Skip to main content Skip to main navigation


WikiDBs: A Corpus of Relational Databases From Wikidata

Liane Vogel; Carsten Binnig
In: Rajesh Bordawekar; Cinzia Cappiello; Vasilis Efthymiou; Lisa Ehrlinger; Vijay Gadepally; Sainyam Galhotra; Sandra Geisler; Sven Groppe; Le Gruenwald; Alon Y. Halevy; Hazar Harmouch; Oktie Hassanzadeh; Ihab F. Ilyas; Ernesto Jiménez-Ruiz; Sanjay Krishnan; Tirthankar Lahiri; Guoliang Li; Jiaheng Lu; Wolfgang Mauerer; Umar Farooq Minhas; Felix Naumann; M. Tamer Özsu; El Kindi Rezig; Kavitha Srinivas; Michael Stonebraker; Satyanarayana R. Valluri; Maria-Esther Vidal; Haixun Wang; Jiannan Wang; Yingjun Wu; Xun Xue; Mohamed Zaït; Kai Zeng (Hrsg.). Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023). International Conference on Very Large Data Bases (VLDB), August 28 - September 1, Vancouver, Canada, CEUR Workshop Proceedings, Vol. 3462,, 2023.


In recent years, deep learning on tabular data, also known as tabular representation learning, has gained growing interest. However, representation learning for relational databases with multiple tables is still an under-explored area, which might be due to the lack of openly available resources. Therefore, we introduce WikiDBs, a novel open-source corpus of 10,000 relational databases. Each database consists of multiple tables that are connected by foreign keys. The dataset is based on Wikidata and aims to follow the characteristics of real-world databases. In this paper, we describe the dataset and the method for creating it. We also conduct preliminary experiments on the tasks of imputing missing values and predicting column and table names in the databases.

Weitere Links