Towards Generic Text-Line Extraction

Syed Saqib Bukhari; Faisal Shafait; Thomas Breuel
In: 12th International Conference on Document Analysis and Recognition, ICDAR’13, Washungton, DC, USA, August 2013. International Conference on Document Analysis and Recognition (ICDAR-2013), August 25-28, Washington, DC, USA, IEEE, 2013.


Text-line extraction is the backbone of document image analysis. Since decades, a large number of text-line finding methods have been proposed, where these methods rely on certain assumptions about a target class of documents with respect to writing styles, digitization methods, intensity values, and scripts. There is no generic text-line finding method that can be robustly applied to a large variety of simple and complex document images. We introduced the ridge-based text-line finding method, and published its initial results for curled text-line detection on camera-captured document images. In this paper, we demonstrates our ridge-based method as a generic text-line finding approach that can be robustly applied on a diverse collection of simple and complex document images. The comprehensive performance evaluation of the ridge-based method and its comparison with several state-of-the-art methods is presented in the paper. For this purpose, diverse categories of publicly available and standard datasets have been selected: UWIII (scanned, printed English script), DFKI-I (camera-captured, printed English script), UMD (handwritten Chinese, Hindi, and Korean scripts), ICDAR2007 handwritten segmentation contest (handwritten English, French, German and Greek scripts), Arabic/Urdu (scanned, printed script), and Fraktur (scanned, calligraphic German script). Experiments on these datasets show that the ridge-based method achieves better text-line extraction results as those of the best performing, domain-specific text-line finding methods. Firstly, these results show that the ridge-based method is a generic text-line extraction method. Secondly, these results are also helpful for the community to assess the advantages of this method.