DFKI-LT - The DDI corpus: An annotated corpus with pharmacological substances and drugdrug interactions
The DDI corpus: An annotated corpus with pharmacological substances and drugdrug interactions
1 Journal of Biomedical Informatics volume 45 number 5,
The management of drug-drug interactions (DDIs) is a critical issue resulting from the overwhelming amount of information available on them. Natural Language Processing (NLP) techniques can provide an interesting way to reduce the time spent by healthcare professionals on reviewing biomedical literature. However, NLP techniques rely mostly on the availability of the annotated corpora. While there are several annotated corpora with biological entities and their relationships, there is a lack of corpora annotated with pharmacological substances and DDIs. Moreover, other works in this field have focused in pharmacokinetic (PK) DDIs only, but not in pharmacodynamic (PD) DDIs. To address this problem, we have created a manually annotated corpus consisting of 792 texts selected from the DrugBank database and other 233 Medline abstracts. This fined-grained corpus has been annotated with a total of 18,502 pharmacological substances and 5028 DDIs, including both PK as well as PD interactions. The quality and consistency of the annotation process has been ensured through the creation of annotation guidelines and has been evaluated by the measurement of the inter-annotator agreement between two annotators. The agreement was almost perfect (Kappa up to 0.96 and generally over 0.80), except for the DDIs in the MedLine database (0.55-150;0.72). The DDI corpus has been used in the SemEval 2013 DDIExtraction challenge as a gold standard for the evaluation of information extraction techniques applied to the recognition of pharmacological substances and the detection of DDIs from biomedical texts. DDIExtraction 2013 has attracted wide attention with a total of 14 teams from 7 different countries. For the task of recognition and classification of pharmacological names, the best system achieved an F1 of 71.5%, while, for the detection and classification of DDIs, the best result was F1 of 65.1%. These results show that the corpus has enough quality to be used for training and testing NLP techniques applied to the field of Pharmacovigilance. The DDI corpus and the annotation guidelines are free for use for academic research and are available at http://labda.inf.uc3m.es/ddicorpus.
Files: BibTeX, S1532046413001123