Word-Based and Character-BasedWord Segmentation Models: Comparison and Combination

Weiwei Sun

In: Chu-Ren Huang; Dan Jurafsky (Hrsg.). 23rd International Conference on Computational Linguistics. International Conference on Computational Linguistics (COLING-10), August 23-27, Beijing, China, Pages 1211-1219, Coling 2010 Organizing Committee, Coling 2010 Organizing Committee, 8/2010.


We present a theoretical and empirical comparative analysis of the two dominant categories of approaches in Chinese word segmentation: word-based models and character-based models. We show that, in spite of similar performance overall, the two models produce different distribution of segmentation errors, in a way that can be explained by theoretical properties of the two models. The analysis is further exploited to improve segmentation accuracy by integrating a word-based segmenter and a character-based segmenter. A Bootstrap Aggregating model is proposed. By letting multiple segmenters vote, our model improves segmentation consistently on the four different data sets from the second SIGHAN bakeoff.


Weitere Links

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence