[mary-dev] WikipediaMarkupCleaner
Marcela Charfuelan
Marcela.Charfuelan at dfki.de
Mon Jan 18 14:51:11 CET 2010
Dear Paulo Levi,
After spliting the big XML file we process small chunks, the procedure
is as follows:
1. load the small xml document, like the page32.xml I am sending you.
2. Select and process the page section:
<page>
<revision>
...
<title> ... </title>
<text
...
</text>
</revision>
</page>
In function: getTextFromWikiPage(.. ) from DBHandler.java there is a
First filter: do not use the page if the title contains:
pageTitle.contains("Wikipedia:")
|| pageTitle.contains("Image:")
|| pageTitle.contains("Template:")
|| pageTitle.contains("Category:")
|| pageTitle.contains("List_of_"))
3. select and process the text section if its length is > minTextLength
(1000) and clean up it as much as possible
<text
...
</text>
The problem here is that inside text there is no xml format, it might
contain
whatever: titles, subtitles, figures, lists, in html formant I guess
(?), because those pages are
intended to be seen with a browser by wikipedia.
The idea was to select as much as possible, just clean text, clean
paragraphs from where we can select clean sentences,
I do not know if there is a better way to do this, if you know it...
The files I am sending you are:
page32.xml : this is an input example for the markup_cleaner
out_markup_clean.txt : this is the output of processing the previous, in
this file I have marked
with ORI PAGE and CLEAN PAGE the input/output to
the function removeMarKup() in the
WikipediaMarkupCleaner class.
Please let me know if this helps.
Regards,
Marcela.
Paulo Levi wrote:
> Hi. In replacing stringbuffers by stringbuilders in the code base i
> noticed the class marytts.tools.dbselection.WikipediaMarkupCleaner.
>
> This class has a very poor use of stringbuffer & string with lines
> like this:
> line = new StringBuffer(line.toString().replaceAll("<p>", ""));
> line = new StringBuffer(line.toString().replaceAll("</p>", ""));
> ...
> (the string buffer does nothing, the string to string could be avoided
> by making the line variable a string)
> or
>
> if( ( line.toString().startsWith("*") ||
> line.toString().startsWith("#") ||
> line.toString().startsWith(";") ||
> line.toString().startsWith(".") ||
> ....
> (to string could be avoided by making line a string)
>
> I can remove most of this easily enough, however the whole class seems
> to do xml preprocessing. For this there are technologies that do a one
> pass not in memory scanning like sax (if don't need control -
> callback) or StAX if need control.
> For stax i could use http://woodstox.codehaus.org/ if i wanted access
> to more than "normal" stax otherwise i think the jdk implementation is
> fine, but first i need to know what xml entities are needed exactly
> from wikipedia.
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mary-dev mailing list
> Mary-dev at dfki.de
> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev
>
--
_______________________________________________________________
Marcela Charfuelan, Researcher, DFKI GmbH
Projektbuero Berlin, Alt-Moabit 91c, 10559 Berlin, Germany
Phone: +49 (0)30 39491821
URL : http://www.dfki.de/~charfuel/
_______________________________________________________________
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
_______________________________________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: page32.xml
Type: text/xml
Size: 102781 bytes
Desc: not available
Url : http://www.dfki.de/pipermail/mary-dev/attachments/20100118/6d41e1d2/attachment-0001.bin
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: out_markup_clean.txt
Url: http://www.dfki.de/pipermail/mary-dev/attachments/20100118/6d41e1d2/attachment-0001.txt
More information about the Mary-dev
mailing list