Publication
Learning What Matters: Automated Feature Selection for Learned Performance Models in Parallel Stream Processing
Pratyush Agnihotri; Carsten Binnig; Manisha Luthra
In: VLDB 2025 Workshop (Hrsg.). Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases (VLDB-2025), 6th Applied AI for Database Systems and Applications, located at AIDB-2025, September 1-5, London, United Kingdom, VLDB Endowment, 9/2025.
Abstract
Predicting performance in distributed stream processing systems relies on selecting relevant input features, a process traditionally requiring expert-driven manual tuning. However, manual selection is inefficient and prone to suboptimal choices. For example, features overly tied to one workload can cause the performance model to mis-predict for another workload, affecting the model’s generalizability. This paper presents an automated feature selection approach that systematically identifies the most relevant features across workloads (streams and queries), and resource dimensions for learned performance models. We employ feature selection strategies, including feature ablation and statistical relevance analysis, to evaluate feature importance and distinguish transferable vs. non-transferable features to improve generalization. By optimizing the feature set, our approach enhances the accuracy of performance prediction, reduces feature redundancy, and thereby improves parallelism tuning efficiency compared to manual selection. We demonstrate that our approach reduces reliance on manual tuning and training effort by 11× while maintaining robustness in performance models.