Multi-lingual ICD-10 Coding using a Hybrid rule-based and Supervised Classification Approach at CLEF eHealth 2017

Jurica Seva, Madeleine Kittner, Roland Roller, Ulf Leser

In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum. Conference and Labs of the Evaluation Forum (CLEF-2017) Information Access Evaluation meets Multilinguality, Multimodality, and Visualization September 11-14 Dublin Ireland CEUR Workshop Proceedings 1866 2017.


In this paper we present our research efforts and obtained results within the CLEF eHealth challenge 2017, Track 1. The task involves the recognition and mapping of ICD-10 codes to English and French death certificates. Our approach proposes a two tier, two stage process. First, we use a rule-based system, based on handcrafted rules and the use of Apache Solr, to perform ICD-10 code Named Entity Recognition (NER). This step produces a set of possible candidates extracted from the input text. Next, we use tf-idf weighted character n-gram classification models to normalize and rank a previously generated ICD-10 candidate set. Classification models used are generated and follow the hierarchical structure of the given ICD-10 dictionaries, by creating individual classification models for the first two hierarchical levels (chapters and blocks). Finally, the top candidate, generated from the overlap between the list of possible ICD-10 code candidates (input list) and ranked list of final ICD-10 candidates (output list), is taken as the final ICD-10 code. Although the ICD-10 candidate NER is language-dependent, the normalization and ranking of candidates utilizes a language independent approach.

Weitere Links

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence