Skip to main content Skip to main navigation

Publication

Pythagoras: Semantic Type Detection of Numerical Data in Enterprise Data Lakes

Sven Langenecker; Christoph Sturm; Christian Schalles; Carsten Binnig
In: Letizia Tanca; Qiong Luo; Giuseppe Polese; Loredana Caruccio; Xavier Oriol; Donatella Firmani (Hrsg.). Proceedings 27th International Conference on Extending Database Technology, EDBT 2024, Paestum, Italy, March 25 - March 28. International Conference on Extending Database Technology (EDBT), Pages 725-733, OpenProceedings.org, 2024.

Abstract

Detecting semantic types of table columns is a crucial task to en- able dataset discovery in data lakes. However, prior semantic type detection approaches have primarily focused on non-numerical data despite the fact that numerical data play an essential role in many real-world enterprise data lakes. Therefore, existing models are typically rather inadequate when applied to data lakes that contain a high proportion of numerical data. In this paper, we introduce Pythagoras, our new learned semantic type detection approach specially designed to support numerical along with non-numerical data. Pythagoras uses a GNN in combination with a novel graph representation of tables to predict the semantic types for numerical data with high accuracy. In our experiments, we compare Pythagoras against five state-of-the-art approaches using two different datasets and show that our model significantly outperforms these baselines on numerical data. In comparison to the best existing approach, we achieve F1-Score increases of around +22%, which sets new benchmarks.

More links