Publication
SportsTables: A New Corpus for Semantic Type Detection (Extended Version)
Sven Langenecker; Christoph Sturm; Christian Schalles; Carsten Binnig
In: Datenbank-Spektrum (Spektrum), Vol. 23, No. 3, Pages 189-197, Springer, 2023.
Abstract
Table corpora such as VizNet or TURL which contain annotated semantic types per column
are important to build machine learning models for the task of automatic semantic type detection.
However, there is a huge discrepancy between corpora that are used for training and testing since
real-world data lakes contain a huge fraction of numerical data which are not present in existing
corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion
of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our
corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to
existing semantic type detection models which have mainly targeted non-numerical columns so far.
To demonstrate this effect, we show the results of a first study using a state-of-the-art approach for
semantic type detection on our new corpus and demonstrate significant performance differences in
predicting semantic types for textual and numerical data.
