[mary-dev] WikipediaMarkupCleaner

Mon Jan 18 14:51:11 CET 2010

Dear Paulo Levi,

After spliting the big XML file we process small chunks, the procedure 
is as follows:

1. load the small xml document, like the page32.xml I am sending you.

2. Select and process the page section:
<page>
    <revision>
      ...
    <title> ... </title>

    <text
     ...
    </text>
  </revision>
</page>

In function: getTextFromWikiPage(.. ) from DBHandler.java there is a
First filter: do not use the page if the title contains:

   pageTitle.contains("Wikipedia:")
|| pageTitle.contains("Image:")
|| pageTitle.contains("Template:")
|| pageTitle.contains("Category:")
|| pageTitle.contains("List_of_"))

3. select and process the text section if its length is > minTextLength 
(1000) and clean up it as much as possible
<text
 ...

</text>

The problem here is that inside text there is no xml format, it might 
contain
whatever: titles, subtitles, figures, lists, in html formant I guess 
(?), because those pages are
intended to be seen with a browser by wikipedia.

The idea was to select as much as possible, just clean text, clean 
paragraphs from where we can select clean sentences,
I do not know if there is a better way to do this, if you know it...

The files I am sending you are:

page32.xml : this is an input example for the markup_cleaner

out_markup_clean.txt : this is the output of processing the previous, in 
this file I have marked
                       with ORI PAGE and CLEAN PAGE the input/output to 
the function removeMarKup() in the
                       WikipediaMarkupCleaner class.

Please let me know if this helps.

Regards,
Marcela.

Paulo Levi wrote:
> Hi. In replacing stringbuffers by stringbuilders in the code base i 
> noticed the class marytts.tools.dbselection.WikipediaMarkupCleaner.
>
> This class has a very poor use of stringbuffer & string with lines 
> like this:
> line = new StringBuffer(line.toString().replaceAll("<p>", ""));
> line = new StringBuffer(line.toString().replaceAll("</p>", ""));
> ...
> (the string buffer does nothing, the string to string could be avoided 
> by making the line variable a string)
> or
>
>              if( ( line.toString().startsWith("*") ||
>                    line.toString().startsWith("#") ||
>                    line.toString().startsWith(";") ||
>                    line.toString().startsWith(".") ||
> ....
> (to string could be avoided by making line a string)
>
> I can remove most of this easily enough, however the whole class seems 
> to do xml preprocessing. For this there are technologies that do a one 
> pass not in memory scanning like sax (if don't need control - 
> callback) or StAX if need control.
> For stax i could use http://woodstox.codehaus.org/ if i wanted access 
> to more than "normal" stax otherwise i think the jdk implementation is 
> fine, but first i need to know what xml entities are needed exactly 
> from wikipedia.
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mary-dev mailing list
> Mary-dev at dfki.de
> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev
>   

-- 
_______________________________________________________________
 Marcela Charfuelan, Researcher, DFKI GmbH
 Projektbuero Berlin, Alt-Moabit 91c, 10559 Berlin, Germany
 Phone: +49 (0)30 39491821
 URL  : http://www.dfki.de/~charfuel/
_______________________________________________________________
 Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
 Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
 Geschaeftsfuehrung:
 Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
 Dr. Walter Olthoff 
 Vorsitzender des Aufsichtsrats:
 Prof. Dr. h.c. Hans A. Aukes
 Amtsgericht Kaiserslautern, HRB 2313
_______________________________________________________________

-------------- next part --------------
A non-text attachment was scrubbed...
Name: page32.xml
Type: text/xml
Size: 102781 bytes
Desc: not available
Url : http://www.dfki.de/pipermail/mary-dev/attachments/20100118/6d41e1d2/attachment-0001.bin 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: out_markup_clean.txt
Url: http://www.dfki.de/pipermail/mary-dev/attachments/20100118/6d41e1d2/attachment-0001.txt