Cost-based Fault-tolerance for Parallel Data ProcessingAbdallah Salama; Carsten Binnig; Tim Kraska; Erfan Zamanian
In: Timos K. Sellis; Susan B. Davidson; Zachary G. Ives (Hrsg.). Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM SIGMOD International Conference on Management of Data (SIGMOD-2015), May 31 - June 4, Melbourne, Victoria, Australia, Pages 285-297, ACM, 2015.
In order to deal with mid-query failures in parallel data engines (PDEs), different fault-tolerance schemes are implemented today: (1) fault-tolerance in parallel databases is typically implemented in a coarse-grained manner by restarting a query completely when a mid-query failure occurs, and (2) modern MapReduce-style PDEs implement a fine-grained fault-tolerance scheme, which either materializes intermediate results or implements a lineage model to recover from mid-query failures. However, neither of these schemes can efficiently handle mixed workloads with both short running interactive queries as well as long running batch queries nor do these schemes efficiently support a wide range of different cluster setups which vary in cluster size and other parameters such as the mean time between failures. In this paper, we present a novel cost-based fault-tolerance scheme which tackles this issue. Compared to the existing schemes, our scheme selects a subset of intermediates to be materialized such that the total query runtime is minimized under mid-query failures. Our experiments show that our cost-based fault-tolerance scheme outperforms all existing strategies and always selects the sweet spot for short- and long running queries as well as for different cluster setups.