[mary-users] WikipediaProcessor.java

Ingmar Steiner ingmar.steiner at inria.fr
Thu Sep 15 09:19:39 CEST 2011


Dear Luisina,

quoting from http://mary.opendfki.de/wiki/NewLanguageSupport:

>     Once you have a mysql database, you can start to extract clean text and words from the wikipedia split files using the WikipediaProcessor program. The following script explains its usage and possible parameters (The scripts examples presented in this tutorial use the enwiki, that is locale en_US):
>
> #!/bin/bash
>
> # Before using this program is recomended to split the big xml dump into
> # small files using the wikipediaDumpSplitter.
> #
> # WikipediaProcessor: this program processes wikipedia xml files using
> # mwdumper-2008-04-13.jar (http://www.mediawiki.org/wiki/Mwdumper).
> # mwdumper extract pages from the xml file and load them as tables into a database.
> #
> # Once the tables are loaded the WikipediMarkupCleaner is used to extract
> # clean text and a wordList. As a result two tables will be created in the
> # database: local_cleanText and local_wordList (the wordList is also
> # saved in a file).
> #
> # NOTE: The mwdumper-2008-04-13.jar must be included in the classpath.
> #
> # Usage: java WikipediaProcessor -locale language -mysqlHost host -mysqlUser user -mysqlPasswd passwd
> #                                   -mysqlDB wikiDB -listFile wikiFileList.
> #                                   [-minPage 10000 -minText 1000 -maxText 15000]
> #
> #      -listFile is a a text file that contains the xml wikipedia file names (plus path) to be processed.
> #      This program requires the jar file mwdumper-2008-04-13.jar (or latest).
> #
> #      default/optional: [-minPage 10000 -minText 1000 -maxText 15000]
> #      -minPage is the minimum size of a wikipedia page that will be considered for cleaning.
> #      -minText is the minimum size of a text to be kept in the DB.
> #      -maxText is used to split big articles in small chunks, this is the maximum chunk size.
>
>
> export MARY_BASE="[PATH TO MARY BASE]"
> export CLASSPATH="$MARY_BASE/java/:$MARY_BASE/java/commons-lang-2.4.jar:$MARY_BASE/java/mwdumper-2008-04-13.jar"
>
> java -Xmx512m -classpath $CLASSPATH marytts.tools.dbselection.WikipediaProcessor \
> -locale "en_US" \
> -mysqlHost "localhost" \
> -mysqlUser "mary" \
> -mysqlPasswd "wiki123" \
> -mysqlDB "wiki" \
> -listFile "/current-dir/wikilist.txt"
>

Is this how you are trying to run the WikipediaProcessor? What is the error?

Best wishes,

-Ingmar

On 14.09.2011 19:20, Luisina Violante wrote:
> Hello, I´m stucked in the same place. I know that´s a dependency problem, but I don´t know how to solve it. I´ve never worked with java before.
> I would appreciate if you could help me with this issue.
>
> Luisina
>
>
> Dear all,
>
> I've splited the wikipedia xml, i've created the mySql database but i am
> stucked in WikipediaProcessor. In order to compile it, i run trough terminal :
> MARYTTS/java/marytts/tools/dbselection$ javac WikipediaProcessor.java
>
> i receive the message:
> WikipediaProcessor.java:305: cannot find symbol
> symbol  : class DBHandler
> location: class marytts.tools.dbselection.WikipediaProcessor
>          DBHandler wikiToDB = new DBHandler(wiki.getLocale());
>          ^
> WikipediaProcessor.java:305: cannot find symbol
> symbol  : class DBHandler
> location: class marytts.tools.dbselection.WikipediaProcessor
>          DBHandler wikiToDB = new DBHandler(wiki.getLocale());
>                                   ^
> WikipediaProcessor.java:350: cannot find symbol
> symbol  : class WikipediaMarkupCleaner
> location: class marytts.tools.dbselection.WikipediaProcessor
>                 WikipediaMarkupCleaner wikiCleaner = new
> WikipediaMarkupCleaner();
>                 ^
> WikipediaProcessor.java:350: cannot find symbol
> symbol  : class WikipediaMarkupCleaner
> location: class marytts.tools.dbselection.WikipediaProcessor
>                 WikipediaMarkupCleaner wikiCleaner = new
> WikipediaMarkupCleaner();
>                                                          ^
> 4 errors
>
> Could somebody tell me what is my mistake?
>
> --
> Thodoris Mironidis
> Master student
> Laboratory of Multimedia Technology and Computer Graphics (LTMI)
> Department of Applied Informatics (DoAI) | University of Macedonia (UoM)
> 156 Egnatia Street | GR-540 06 | Thessaloniki | Greece
> T: +30 2310 891 839 | E:mai1041 at uom.gr  <http://www.dfki.de/mailman/cgi-bin/listinfo/mary-users>
> W:http://www.multimedia.uom.gr  <http://www.multimedia.uom.gr/>
>
>
>
> _______________________________________________
> Mary-users mailing list
> Mary-users at dfki.de
> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-users

-- 
Ingmar Steiner
Postdoctoral Researcher

LORIA Speech Group, Nancy, France
National Institute for Research in
Computer Science and Control (INRIA)


More information about the Mary-users mailing list