From i30817 at gmail.com Sat Jan 16 19:11:48 2010 From: i30817 at gmail.com (Paulo Levi) Date: Sat, 16 Jan 2010 18:11:48 +0000 Subject: [mary-dev] WikipediaMarkupCleaner Message-ID: <212322091001161011q5aff7f9bt97a1b50c75c2b7c0@mail.gmail.com> Hi. In replacing stringbuffers by stringbuilders in the code base i noticed the class marytts.tools.dbselection.WikipediaMarkupCleaner. This class has a very poor use of stringbuffer & string with lines like this: line = new StringBuffer(line.toString().replaceAll("

", "")); line = new StringBuffer(line.toString().replaceAll("

", "")); ... (the string buffer does nothing, the string to string could be avoided by making the line variable a string) or if( ( line.toString().startsWith("*") || line.toString().startsWith("#") || line.toString().startsWith(";") || line.toString().startsWith(".") || .... (to string could be avoided by making line a string) I can remove most of this easily enough, however the whole class seems to do xml preprocessing. For this there are technologies that do a one pass not in memory scanning like sax (if don't need control - callback) or StAX if need control. For stax i could use http://woodstox.codehaus.org/ if i wanted access to more than "normal" stax otherwise i think the jdk implementation is fine, but first i need to know what xml entities are needed exactly from wikipedia. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.dfki.de/pipermail/mary-dev/attachments/20100116/ea14bafc/attachment.htm From Marcela.Charfuelan at dfki.de Mon Jan 18 14:51:11 2010 From: Marcela.Charfuelan at dfki.de (Marcela Charfuelan) Date: Mon, 18 Jan 2010 14:51:11 +0100 Subject: [mary-dev] WikipediaMarkupCleaner In-Reply-To: <212322091001161011q5aff7f9bt97a1b50c75c2b7c0@mail.gmail.com> References: <212322091001161011q5aff7f9bt97a1b50c75c2b7c0@mail.gmail.com> Message-ID: <4B54674F.2080601@dfki.de> Dear Paulo Levi, After spliting the big XML file we process small chunks, the procedure is as follows: 1. load the small xml document, like the page32.xml I am sending you. 2. Select and process the page section: ... ... In function: getTextFromWikiPage(.. ) from DBHandler.java there is a First filter: do not use the page if the title contains: pageTitle.contains("Wikipedia:") || pageTitle.contains("Image:") || pageTitle.contains("Template:") || pageTitle.contains("Category:") || pageTitle.contains("List_of_")) 3. select and process the text section if its length is > minTextLength (1000) and clean up it as much as possible The problem here is that inside text there is no xml format, it might contain whatever: titles, subtitles, figures, lists, in html formant I guess (?), because those pages are intended to be seen with a browser by wikipedia. The idea was to select as much as possible, just clean text, clean paragraphs from where we can select clean sentences, I do not know if there is a better way to do this, if you know it... The files I am sending you are: page32.xml : this is an input example for the markup_cleaner out_markup_clean.txt : this is the output of processing the previous, in this file I have marked with ORI PAGE and CLEAN PAGE the input/output to the function removeMarKup() in the WikipediaMarkupCleaner class. Please let me know if this helps. Regards, Marcela. Paulo Levi wrote: > Hi. In replacing stringbuffers by stringbuilders in the code base i > noticed the class marytts.tools.dbselection.WikipediaMarkupCleaner. > > This class has a very poor use of stringbuffer & string with lines > like this: > line = new StringBuffer(line.toString().replaceAll("

", "")); > line = new StringBuffer(line.toString().replaceAll("

", "")); > ... > (the string buffer does nothing, the string to string could be avoided > by making the line variable a string) > or > > if( ( line.toString().startsWith("*") || > line.toString().startsWith("#") || > line.toString().startsWith(";") || > line.toString().startsWith(".") || > .... > (to string could be avoided by making line a string) > > I can remove most of this easily enough, however the whole class seems > to do xml preprocessing. For this there are technologies that do a one > pass not in memory scanning like sax (if don't need control - > callback) or StAX if need control. > For stax i could use http://woodstox.codehaus.org/ if i wanted access > to more than "normal" stax otherwise i think the jdk implementation is > fine, but first i need to know what xml entities are needed exactly > from wikipedia. > > ------------------------------------------------------------------------ > > _______________________________________________ > Mary-dev mailing list > Mary-dev at dfki.de > http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev > -- _______________________________________________________________ Marcela Charfuelan, Researcher, DFKI GmbH Projektbuero Berlin, Alt-Moabit 91c, 10559 Berlin, Germany Phone: +49 (0)30 39491821 URL : http://www.dfki.de/~charfuel/ _______________________________________________________________ Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern Geschaeftsfuehrung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 _______________________________________________________________ -------------- next part -------------- A non-text attachment was scrubbed... Name: page32.xml Type: text/xml Size: 102781 bytes Desc: not available Url : http://www.dfki.de/pipermail/mary-dev/attachments/20100118/6d41e1d2/attachment-0001.bin -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: out_markup_clean.txt Url: http://www.dfki.de/pipermail/mary-dev/attachments/20100118/6d41e1d2/attachment-0001.txt