DFKI-LT - LThist 2012: First International Workshop on Language Technology for Historical Text(s)
LThist 2012: First International Workshop on Language Technology for Historical Text(s)
1 Vienna, Austria, ÖGAI, 9/2012
The interaction between Human Language Technology (HLT) and Digital Humanities (DH) at large has been of interest in various projects and initiatives during the last years, aiming to bring forward language resources and tools for the Humanities, Social Sciences and Cultural Heritage. The specific focus of LThist 2012 lies on the development of technology and resources required for processing historical texts. Workshop contributors and participants discuss ways and strategies for shaping HLT resources (tools, data and metadata) in ways that are maximally beneficial for researchers in the Humanities. The necessity for a strong interplay between proponents from language technology and from the Humanities is also reflected in the invited talks. While Caroline Sporleder takes a language technology perspective, Sonia Horn addresses the needs and requirements from a medical historian's point of view. A major aspect of the workshop is the exchange of experiences with and comparison of tools, approaches, and standards that make historical texts accessible to automatic processing. Moreover, LThist encourages the interchange of historical data and processing tools. In the present workshop, historical texts are understood in two ways: i) texts as documents of older forms of languages, and ii) texts as documentations of historical content. Accordingly, the contributions comprise a broad range of topics, genres and diachronic language varieties, including scientific prose, narratives, folk tales, riddles etc., as well as trade-related documents and marriage license books with the latter being are valuable resource for demography studies. The presented papers address various aspects of data preparation and (semi-)automatic processing for a number of languages including Old Swedish, Late Middle English, Middle English, Early Modern English and Modern English, diachronic varieties of German, Dutch and Spanish, and Old Occitan. The proposed approaches and technical solutions center around problem areas such as improving the OCR quality of historical texts, orthography harmonization and mapping historical to modern word forms, as prerequisites for automatic mining of historical texts. Also, the possibilities of cross-language transfer of morphosyntactic and syntactic annotation from resource-rich source languages to underresourced target languages are examined. Technical infrastructures, specifically tailored for historical corpora, are discussed, including mark-up languages for historical texts and representation formats for diachronic lexical databases, processing tools and architectures. Overall, LThist 2012 well reflects the current discussions regarding automatic processing of historical texts where OCR errors and the lack of harmonization in orthography are still major practical issues, but where also machine learning and cross-language transfer are coming more and more into focus.
Files: BibTeX, proceedings.pdf