[mary-dev] WikipediaMarkupCleaner

Sat Jan 16 19:11:48 CET 2010

Hi. In replacing stringbuffers by stringbuilders in the code base i noticed
the class marytts.tools.dbselection.WikipediaMarkupCleaner.

This class has a very poor use of stringbuffer & string with lines like
this:
line = new StringBuffer(line.toString().replaceAll("<p>", ""));
line = new StringBuffer(line.toString().replaceAll("</p>", ""));
...
(the string buffer does nothing, the string to string could be avoided by
making the line variable a string)
or

             if( ( line.toString().startsWith("*") ||
                   line.toString().startsWith("#") ||
                   line.toString().startsWith(";") ||
                   line.toString().startsWith(".") ||
....
(to string could be avoided by making line a string)

I can remove most of this easily enough, however the whole class seems to do
xml preprocessing. For this there are technologies that do a one pass not in
memory scanning like sax (if don't need control - callback) or StAX if need
control.
For stax i could use http://woodstox.codehaus.org/ if i wanted access to
more than "normal" stax otherwise i think the jdk implementation is fine,
but first i need to know what xml entities are needed exactly from
wikipedia.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.dfki.de/pipermail/mary-dev/attachments/20100116/ea14bafc/attachment.htm