DFKI-LT - Enhancing Chinese Word Segmentation Using Unlabeled Data

Weiwei Sun, Jia Xu
Enhancing Chinese Word Segmentation Using Unlabeled Data
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Pages 970-979, Edinburgh, Scotland, United Kingdom, ACL, Association for Computational Linguistics, 7/2011
This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.
Files: BibTeX, D11-1090