Publication
Pythagoras: Semantic Type Detection of Numerical Data Using Graph Neural Networks
Sven Langenecker; Christoph Sturm; Christian Schalles; Carsten Binnig
In: Michael Leyer; Johannes Wichmann (Hrsg.). Lernen, Wissen, Daten, Analysen (LWDA) Conference Proceedings, Marburg, Germany, October 9-11, 2023. GI-Workshop-Tage "Lernen, Wissen, Daten, Analysen" (LWDA), Pages 146-152, CEUR Workshop Proceedings, Vol. 3630, CEUR-WS.org, 2023.
Abstract
Detecting semantic types of table columns is a crucial task to en-
able dataset discovery in data lakes. However, prior semantic type
detection approaches have primarily focused on non-numerical
data despite the fact that numerical data play an essential role in
many real-world enterprise data lakes. Therefore, existing models
are typically rather inadequate when applied to data lakes that
contain a high proportion of numerical data. In this paper, we
introduce Pythagoras, our new learned semantic type detection
approach specially designed to support numerical along with
non-numerical data. Pythagoras uses a GNN in combination with
a novel graph representation of tables to predict the semantic
types for numerical data with high accuracy. In our experiments,
we compare Pythagoras against five state-of-the-art approaches
using two different datasets and show that our model significantly
outperforms these baselines on numerical data. In comparison
to the best existing approach, we achieve F1-Score increases of
around +22%, which sets new benchmarks.
