Publication

Generalizable Data Cleaning of Tabular Data in Latent Space

Eduardo Souza dos Reis; Mohamed Abdelaal; Carsten Binnig

In: Proceedings of the VLDB Endowment (PVLDB), Vol. 17, No. 13, Pages 4786-4798, VLDP, 2024.

Abstract

In this paper, we present a new method for learned data cleaning. In contrast to existing methods, our method learns to clean data in the latent space. The main idea is that we (1) shape the latent space such that we know the area where clean data resides and (2) learn latent operators trained on error repair (Lopster) which shift erroneous data (e.g., table rows with noise, outliers, or missing values) in their latent representation back to a “clean” region, thus abstracting the complexities of the input domain. When formulating data cleaning as a simple shift operation in latent space, we can repair all types of errors using the same method which makes it more robust than other methods. Importantly, with our method, we can handle errors that are unseen during the training of our error repair model. We do not rely on an external error detection method as seen in the state- of-the-art, instead, we handle both detection and repair within the Lopster framework. In our evaluation, we show that our approach outperforms existing cleaning methods even when trained on only a subset of the errors that occur in the dirty data.

Generalizable Data Cleaning of Tabular Data in Latent Space

Abstract

More links