Publication
Pythagoras: Semantic Type Detection of Numerical Data in Enterprise Data Lakes
Sven Langenecker; Christoph Sturm; Christian Schalles; Carsten Binnig
In: Letizia Tanca; Qiong Luo; Giuseppe Polese; Loredana Caruccio; Xavier Oriol; Donatella Firmani (Hrsg.). Proceedings 27th International Conference on Extending Database Technology, EDBT 2024, Paestum, Italy, March 25 - March 28. International Conference on Extending Database Technology (EDBT), Pages 725-733, OpenProceedings.org, 2024.
Abstract
Detecting semantic types of table columns is a crucial task to en-
able dataset discovery in data lakes. However, prior semantic type
detection approaches have primarily focused on non-numerical
data despite the fact that numerical data play an essential role in
many real-world enterprise data lakes. Therefore, existing models
are typically rather inadequate when applied to data lakes that
contain a high proportion of numerical data. In this paper, we
introduce Pythagoras, our new learned semantic type detection
approach specially designed to support numerical along with
non-numerical data. Pythagoras uses a GNN in combination with
a novel graph representation of tables to predict the semantic
types for numerical data with high accuracy. In our experiments,
we compare Pythagoras against five state-of-the-art approaches
using two different datasets and show that our model significantly
outperforms these baselines on numerical data. In comparison
to the best existing approach, we achieve F1-Score increases of
around +22%, which sets new benchmarks.
