Connected Component level Multiscript Identifiation from Ancient Document Images

Sheikh Faisal Rashid; Faisal Shafait; Thomas Breuel

In: 9th IAPR Workshop on Document Analysis Systems. IAPR International Workshop on Document Analysis Systems (DAS-2010), June 9-11, Boston, MA, USA, online only (DAS Workshop Web page), 6/2010.


In a multilingual optical character recognition (MOCR) environment, a MOCR system usually requires script identification of each word or character before passing it to the OCR engine. Many existing script identification techniques mainly depend on various features extracted from document images at character, word, text line or block level. This feature extraction process is not robust and extracted features for one script identification problem are not fully applicable to other script identification problems. In this paper, we present a novel and efficient technique for multi-script identification at connected component level using convolutional neural network. The convolutional neural network acts as a discriminative learning model, where suitable script identification features are automatically extracted and learned as convolution kernels from the raw input. We test our approach on a dataset of ancient Greek-Latin mix script document images. We achieve 96.37% accuracy on a test dataset at connected component level and this accuracy is further improved to 98.40% by using class majority in the left-right neighboring area. The main advantage of our approach is that it can be easily adapted for the identification of other scripts and it can give 100% accuracy at block level.


Rashid-Script-Detection-DAS10.pdf (pdf, 3 MB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence