Page Frame Detection for Marginal Noise Removal from Scanned Documents

Faisal Shafait, Joost van Beusekom, Daniel Keysers, Thomas Breuel

In: Proceedings of 15th SCIA 2007. Scandinavian Conference on Image Analysis (SCIA-2007) 15th June 10-14 Aalborg Denmark Seiten 651-660 Lecture Notes in Computer Science (LNCS) 4522 / 2007 Springer 6/2007.


We describe and evaluate a method to robustly detect the page frame in document images, locating the actual page contents area and removing textual and non-textual noise along the page borders. We use a geometric matching algorithm to find the optimal page frame, which has the advantages of not assuming the existence of whitespace between noisy borders and actual page contents, and of giving a practical solution to the page frame detection problem without the need for parameter tuning. We define suitable performance measures and evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% for each of the performance measures used. In addition, we demonstrate that the use of page frame detection reduces the OCR error rate by removing textual noise. Experiments using a commercial OCR system show that the error rate due to elements outside the page frame is reduced from 4.3% to 1.7% on the UW-III dataset.

Weitere Links

shafait--page-frame-detection-for-marginal-noise-removal-from-scanned-documents--SCIA--2007.pdf (pdf, 456 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence