Paraphrase Fragment Extraction from Monolingual Comparable Corpora

Rui Wang, Chris Callison-Burch

In: Proceddings of the ACL Workshop on Building and Using Comparable Corpora. Workshop on Building and Using Comparable Corpora (BUCC-2011) June 24-24 Portland Oregon United States Association for Computational Linguistics 6/2011.


We present a novel paraphrase fragment pair extraction method that uses a monolingual comparable corpus containing different articles about the same topics or events. The procedure consists of document pair extraction, sentence pair extraction, and fragment pair extraction. At each stage, we evaluate the intermediate results manually, and tune the later stages accordingly. With this minimally supervised approach, we achieve 62% of accuracy on the paraphrase fragment pairs we collected and 67% extracted from the MSR corpus. The results look promising, given the minimal supervision of the approach, which can be further scaled up.


BUCC2011.pdf (pdf, 324 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence