Publikation

Word-Based and Character-BasedWord Segmentation Models: Comparison and Combination

Weiwei Sun

In: Chu-Ren Huang; Dan Jurafsky (Hrsg.). 23rd International Conference on Computational Linguistics. International Conference on Computational Linguistics (COLING-10), August 23-27, Beijing, China, Pages 1211-1219, Coling 2010 Organizing Committee, Coling 2010 Organizing Committee, 8/2010.

Zusammenfassung

We present a theoretical and empirical comparative analysis of the two dominant categories of approaches in Chinese word segmentation: word-based models and character-based models. We show that, in spite of similar performance overall, the two models produce different distribution of segmentation errors, in a way that can be explained by theoretical properties of the two models. The analysis is further exploited to improve segmentation accuracy by integrating a word-based segmenter and a character-based segmenter. A Bootstrap Aggregating model is proposed. By letting multiple segmenters vote, our model improves segmentation consistently on the four different data sets from the second SIGHAN bakeoff.

Projekte

TAKE - Technologies for Advanced Knowledge Extraction

Weitere Links

http://www.aclweb.org/anthology/C/C10/C10-2139.pdf