[mary-dev] WikipediaProcessor: Japanese Processing Exception

Tue Nov 10 09:05:15 CET 2009

Hello,

before trying to profile the low-level code, it may make sense to step 
back a moment and think what we want here.

The problem seems to occur in WikipediaMarkupCleaner.updateWordList()
http://mary.opendfki.de/browser/trunk/java/marytts/tools/dbselection/WikipediaMarkupCleaner.java#L944
where there is a request to send *all* words sorted by descending frequency.

We are interested in a global list of most frequent words, so that we 
can transcribe them manually as a phonetic form. Reasonably, it is 
improbable that more than 100,000 words can be transcribed manually in a 
reasonable amount of time. So we may serve our purpose just fine if we 
look at the 100k or 200k most frequent words rather than all 7m.

What happens if you change any of the parameters when calling 
getMostFrequentWords(int numWords, int maxFrequency) from updateWordList()?

Try instead of

	wlOld = wikiToDB.getMostFrequentWords(0,0);

to put a max number of words, e.g. 200000:

	wlOld = wikiToDB.getMostFrequentWords(200000, 0);

or to retrieve only the words that occur, say, at least 5 times:

	wlOld = wikiToDB.getMostFrequentWords(0, 5);

Does that change the memory required? If not, then this is not the 
memory bottleneck.

Your question regarding UTF8: we use UTF8 throughout the system for all 
internal representation of the language-specific strings. With wikipedia 
that should be straightforward since wikipedia encodes texts in UTF8 (is 
that true for the Japanese wikipedia as well? it is true for many 
different languages in any case). So on the one end, we put UTF8 encoded 
bytes into the Mysql db, and therefore at this end we must use UTF8 to 
convert the bytes back into strings.

Regards,
Marc

Hind Abdul-Khaleq schrieb:
> *   *please put in mind that the ResultSet here maintains around 
> "7695454" database records !, shall this mean that we have to increase 
> the memory with the data size growing proportionally?!
> 
> --- On *Mon, 11/9/09, Hind Abdul-Khaleq /<habdolkhaleq at yahoo.com>/* wrote:
> 
> 
>     From: Hind Abdul-Khaleq <habdolkhaleq at yahoo.com>
>     Subject: Re: [mary-dev] WikipediaProcessor: Japanese Processing
>     Exception
>     To: "Marc Schroeder" <schroed at dfki.de>, "waleed mohamed"
>     <waleed_cs2000 at yahoo.com>
>     Cc: mary-dev at dfki.de
>     Date: Monday, November 9, 2009, 7:58 PM
> 
>     well ,Although -Xmx1750m didn't produce the exception until now ,it
>     is not continuing -just hanged! 
>     Please check this issue:
>     http://mentormate.com/blog/javalangoutofmemoryerror-java-heap-space-unexpected-tomcat-crash-solved/
> 
>     May be the key in that code snippet that it may leave those Strings'
>     Byte arrays in memory ,so we may try to force GC or something to
>     clean ?!
>     Lines: 1365 and 1366 @
>     http://mary.opendfki.de/browser/trunk/java/marytts/tools/dbselection/DBHandler.java
>      while( rs.next() ) {
>                 wordBytes=rs.getBytes(1);
>                 word = new String(wordBytes, "UTF8"); 
>                 wordList.put(word, new Integer(rs.getInt(2)));
>     }
>     And what about the "UTF8" encoding?, when trying to change it,the
>     program continued from the point of failure but the encoding of the
>     generated wordList file wasn't readable,
>     also generated another exception at the next line as in the top of
>     the post..but
>     changing "wordList.put(word, new Integer(rs.getInt(2)));" to
>     "wordList.put(word,rs.getInt(2));"  skipped the exception, that
>     seems we avoided creating new many objects ??!
> 
>     Please check attached VisualVM screen-shots:
>     All   -Xmx1025m , -Xmx1250m ,  -Xmx1500m  produced the same exception ,
>     Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at
>     java.lang.StringCoding$StringDecoder.decode(StringCoding.java:133)
>         at java.lang.StringCoding.decode(StringCoding.java:173)
>         at java.lang.String.<init>(String.java:443)
>         at java.lang.String.<init>(String.java:515)
>         at
>     marytts.tools.dbselection.DBHandler.getMostFrequentWords(DBHandler.java:1365)
>         at
>     marytts.tools.dbselection.WikipediaMarkupCleaner.updateWordList(WikipediaMarkupCleaner.java:953)
>         at
>     marytts.tools.dbselection.WikipediaMarkupCleaner.processWikipediaPages(WikipediaMarkupCleaner.java:1133)
>         at
>     marytts.tools.dbselection.WikipediaProcessor.main(WikipediaProcessor.java:368)
> 
> 
>     *
>     *
> 
> 
>     --- On *Mon, 11/9/09, Paulo Levi /<i30817 at gmail.com>/* wrote:
> 
> 
>         From: Paulo Levi <i30817 at gmail.com>
>         Subject: Re: [mary-dev] WikipediaProcessor: Japanese Processing
>         Exception
>         To: "Marc Schroeder" <schroed at dfki.de>
>         Cc: "Hind Abdul-Khaleq" <habdolkhaleq at yahoo.com>, mary-dev at dfki.de
>         Date: Monday, November 9, 2009, 8:31 AM
> 
>         The oom condition doesn't necessary occur because of the
>         (unlimited i
>         assume) cache, but it certainly seems a prime suspect.
>         On a more productive vertent, if you want you could you use the
>         jvisualvm application (on a console) to find if the memory leak
>         comes
>         from there. You need the size of the heap on disk space so
>         remember to
>         delete it after use.
> 
>         Open jvisualvm, the application, right click on it tree node on jvvm
>         and enable heap dump on OOM. Then play with it a little. The most of
>         the mem should be bytes[] but there should be structures that hold
>         them (strings probably) held by another persistent or non
>         limited data
>         structure.
>         I don't think that jvvm can tell where exactly in the code the data
>         comes from, but it should be possible to restrict things.
> 
> 
>         On Mon, Nov 9, 2009 at 7:52 AM, Marc Schroeder <schroed at dfki.de>
>         wrote:
>          > Thanks for this update -- good to hear that more memory
>         solves this
>          > problem. Of course it seems curious that 2GB of RAM should be
>         required
>          > for running this code; if anyone would like to try and reduce the
>          > footprint, let me know.
>          >
>          > Best,
>          > Marc
>          >
>          > Hind Abdul-Khaleq schrieb:
>          >> The problem solved with "-Xmx2000m"   given the vm and
>         without other
>          >> changes to the source .
>          >> Thanks a lot and All the Best.
>          >>
>          >>
>          >>
>          >>     --- On *Wed, 10/28/09, Hind Abdul-Khaleq
>         /<habdolkhaleq at yahoo.com>/*
>          >>     wrote:
>          >>
>          >>
>          >>         From: Hind Abdul-Khaleq <habdolkhaleq at yahoo.com>
>          >>         Subject: Re: [mary-dev] WikipediaProcessor: Japanese
>         Processing
>          >>         Exception
>          >>         To: mary-dev at dfki.de
>          >>         Date: Wednesday, October 28, 2009, 11:45 AM
>          >>
>          >>         I'm getting this exception while processing Japanese
>          >>         I changed the encoding to "EUC_JP" at the line
>          >>
>          >>                     word = new String(wordBytes, "UTF8");
>          >>         in
>          >>        
>         marytts.tools.dbselection.DBHandler.getMostFrequentWords(DBHandler.java:1366)
>          >>
>          >>         but it produced another exception at the next line:
>          >>                  wordList.put(word, new Integer(rs.getInt(2)));
>          >>         Exception in thread "main"
>         java.lang.OutOfMemoryError: Java heap
>          >>         space
>          >>             at java.util.HashMap.resize(HashMap.java:462)
>          >>             at java.util.HashMap.addEntry(HashMap.java:755)
>          >>             at java.util.HashMap.put(HashMap.java:385)
>          >>             at
>          >>        
>         marytts.tools.dbselection.DBHandler.getMostFrequentWords(DBHandler.java:1367)
>          >>             at
>          >>        
>         marytts.tools.dbselection.WikipediaMarkupCleaner.updateWordList(WikipediaMarkupCleaner.java:953)
>          >>             at
>          >>        
>         marytts.tools.dbselection.WikipediaMarkupCleaner.processWikipediaPages(WikipediaMarkupCleaner.java:1133)
>          >>             at
>          >>        
>         marytts.tools.dbselection.WikipediaProcessor.main(WikipediaProcessor.java:368)
>          >>
>          >>
>          >>         also I do "-Xmx1000m",... so what to do?
>          >>
>          >>         --- On *Wed, 10/28/09, Hind Abdul-Khaleq
>          >>         /<habdolkhaleq at yahoo.com>/* wrote:
>          >>
>          >>
>          >>             From: Hind Abdul-Khaleq <habdolkhaleq at yahoo.com>
>          >>             Subject: [mary-dev] WikipediaProcessor: Japanese
>         Exception
>          >>             To: mary-dev at dfki.de
>          >>             Date: Wednesday, October 28, 2009, 11:34 AM
>          >>
>          >>             Exception in thread "main"
>         java.lang.OutOfMemoryError: GC
>          >>             overhead limit exceeded
>          >>                 at java.util.Arrays.copyOf(Arrays.java:2882)
>          >>                 at
>         java.lang.StringCoding.safeTrim(StringCoding.java:75)
>          >>                 at
>         java.lang.StringCoding.access$100(StringCoding.java:34)
>          >>                 at
>          >>            
>         java.lang.StringCoding$StringDecoder.decode(StringCoding.java:151)
>          >>                 at
>         java.lang.StringCoding.decode(StringCoding.java:173)
>          >>                 at java.lang.String.<init>(String.java:443)
>          >>                 at java.lang.String.<init>(String.java:515)
>          >>                 at
>          >>            
>         marytts.tools.dbselection.DBHandler.getMostFrequentWords(DBHandler.java:1366)
>          >>                 at
>          >>            
>         marytts.tools.dbselection.WikipediaMarkupCleaner.updateWordList(WikipediaMarkupCleaner.java:953)
>          >>                 at
>          >>            
>         marytts.tools.dbselection.WikipediaMarkupCleaner.processWikipediaPages(WikipediaMarkupCleaner.java:1133)
>          >>                 at
>          >>            
>         marytts.tools.dbselection.WikipediaProcessor.main(WikipediaProcessor.java:368)
>          >>
>          >>
>          >>
>          >>             -----Inline Attachment Follows-----
>          >>
>          >>             _______________________________________________
>          >>             Mary-dev mailing list
>          >>             Mary-dev at dfki.de
>          >>             http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev
>          >>
>          >>
>          >>
>          >>         -----Inline Attachment Follows-----
>          >>
>          >>         _______________________________________________
>          >>         Mary-dev mailing list
>          >>         Mary-dev at dfki.de
>          >>         http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev
>          >>
>          >>
>          >>
>          >>
>          >>
>         ------------------------------------------------------------------------
>          >>
>          >> _______________________________________________
>          >> Mary-dev mailing list
>          >> Mary-dev at dfki.de
>          >> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev
>          >
>          > --
>          > Dr. Marc Schröder, Senior Researcher at DFKI GmbH
>          > Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
>          > Portal Editor http://emotion-research.net
>          > Team Leader DFKI Speech Group http://mary.dfki.de
>          >
>          > Homepage: http://www.dfki.de/~schroed
>         <http://www.dfki.de/%7Eschroed>
>          > Email: schroed at dfki.de
>          > Phone: +49-681-302-5303
>          > Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3,
>         D-66123
>          > Saarbrücken, Germany
>          > --
>          > Official DFKI coordinates:
>          > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>          > Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
>          > Geschaeftsfuehrung:
>          > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>          > Dr. Walter Olthoff
>          > Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
>          > Amtsgericht Kaiserslautern, HRB 2313
>          > _______________________________________________
>          > Mary-dev mailing list
>          > Mary-dev at dfki.de
>          > http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev
>          >
> 
> 
> 
>     -----Inline Attachment Follows-----
> 
>     _______________________________________________
>     Mary-dev mailing list
>     Mary-dev at dfki.de </mc/compose?to=Mary-dev at dfki.de>
>     http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev
> 
> 

-- 
Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
Portal Editor http://emotion-research.net
Team Leader DFKI Speech Group http://mary.dfki.de

Homepage: http://www.dfki.de/~schroed
Email: schroed at dfki.de
Phone: +49-681-302-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313