From i30817 at gmail.com Sat Jan 16 19:11:48 2010 From: i30817 at gmail.com (Paulo Levi) Date: Sat, 16 Jan 2010 18:11:48 +0000 Subject: [mary-dev] WikipediaMarkupCleaner Message-ID: <212322091001161011q5aff7f9bt97a1b50c75c2b7c0@mail.gmail.com> Hi. In replacing stringbuffers by stringbuilders in the code base i noticed the class marytts.tools.dbselection.WikipediaMarkupCleaner. This class has a very poor use of stringbuffer & string with lines like this: line = new StringBuffer(line.toString().replaceAll("
", "")); line = new StringBuffer(line.toString().replaceAll("
", "")); ... (the string buffer does nothing, the string to string could be avoided by making the line variable a string) or if( ( line.toString().startsWith("*") || line.toString().startsWith("#") || line.toString().startsWith(";") || line.toString().startsWith(".") || .... (to string could be avoided by making line a string) I can remove most of this easily enough, however the whole class seems to do xml preprocessing. For this there are technologies that do a one pass not in memory scanning like sax (if don't need control - callback) or StAX if need control. For stax i could use http://woodstox.codehaus.org/ if i wanted access to more than "normal" stax otherwise i think the jdk implementation is fine, but first i need to know what xml entities are needed exactly from wikipedia. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.dfki.de/pipermail/mary-dev/attachments/20100116/ea14bafc/attachment.htm From Marcela.Charfuelan at dfki.de Mon Jan 18 14:51:11 2010 From: Marcela.Charfuelan at dfki.de (Marcela Charfuelan) Date: Mon, 18 Jan 2010 14:51:11 +0100 Subject: [mary-dev] WikipediaMarkupCleaner In-Reply-To: <212322091001161011q5aff7f9bt97a1b50c75c2b7c0@mail.gmail.com> References: <212322091001161011q5aff7f9bt97a1b50c75c2b7c0@mail.gmail.com> Message-ID: <4B54674F.2080601@dfki.de> Dear Paulo Levi, After spliting the big XML file we process small chunks, the procedure is as follows: 1. load the small xml document, like the page32.xml I am sending you. 2. Select and process the page section:", "")); > line = new StringBuffer(line.toString().replaceAll("
", "")); > ... > (the string buffer does nothing, the string to string could be avoided > by making the line variable a string) > or > > if( ( line.toString().startsWith("*") || > line.toString().startsWith("#") || > line.toString().startsWith(";") || > line.toString().startsWith(".") || > .... > (to string could be avoided by making line a string) > > I can remove most of this easily enough, however the whole class seems > to do xml preprocessing. For this there are technologies that do a one > pass not in memory scanning like sax (if don't need control - > callback) or StAX if need control. > For stax i could use http://woodstox.codehaus.org/ if i wanted access > to more than "normal" stax otherwise i think the jdk implementation is > fine, but first i need to know what xml entities are needed exactly > from wikipedia. > > ------------------------------------------------------------------------ > > _______________________________________________ > Mary-dev mailing list > Mary-dev at dfki.de > http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev > -- _______________________________________________________________ Marcela Charfuelan, Researcher, DFKI GmbH Projektbuero Berlin, Alt-Moabit 91c, 10559 Berlin, Germany Phone: +49 (0)30 39491821 URL : http://www.dfki.de/~charfuel/ _______________________________________________________________ Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 _______________________________________________________________ -------------- next part -------------- A non-text attachment was scrubbed... Name: page32.xml Type: text/xml Size: 102781 bytes Desc: not available Url : http://www.dfki.de/pipermail/mary-dev/attachments/20100118/6d41e1d2/attachment-0001.bin -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: out_markup_clean.txt Url: http://www.dfki.de/pipermail/mary-dev/attachments/20100118/6d41e1d2/attachment-0001.txt