DFKI-LT - Combining OCR Outputs for Logical Document Structure Markup. Technical Background to the ACL 2012 Contributed Task

Ulrich Schäfer, Benjamin Weitz
Combining OCR Outputs for Logical Document Structure Markup. Technical Background to the ACL 2012 Contributed Task
2 Proceedings of the ACL-2012 Main Conference Workshop on Rediscovering 50 Years of Discoveries, Pages 104-109, Jeju Island, Korea, Republic of, Association for Computational Linguistics, 7/2012
 
We describe how paperXML, a logical document structure markup for scholarly articles, is generated on the basis of OCR tool outputs. PaperXML has been initially developed for the ACL Anthology Searchbench. The main purpose was to robustly provide uniform access to sentences in ACL Anthology papers from the past 46 years, ranging from scanned, typewriter-written conference and workshop proceedings papers, up to recent high-quality typeset, born-digital journal articles, with varying layouts. PaperXML markup includes information on page and paragraph breaks, section headings, footnotes, tables, captions, boldface and italics character styles as well as bibliographic and publication metadata. The role of paperXML in the ACL Contributed Task Rediscovering 50 Years of Discoveries is to serve as fall-back source (1) for older, scanned papers (mostly published before the year 2000), for which born-digital PDF sources are not available, (2) for borndigital PDF papers on which the PDFExtract method failed, (3) for document parts where PDFExtract does not output useful markup such as currently for tables. We sketch transformation of paperXML into the ACL Contributed Task’s TEI P5 XML.
 
Files: BibTeX, W12-3212.pdf, W12-3212, W12-3212