Bridging the Gap: Towards Optimization Across Linear and Relational Algebra

Andreas Kunft, Alexander Alexandrov, Asterios Katsifodimos, Volker Markl

In: Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR) befindet sich SIGMOD/PODS '16 July 1 San Francisco CA United States ISBN 978-1-4503-4311-4 ACM New York, NY, USA 2016.


Advanced data analysis typically requires some form of pre- processing in order to extract and transform data before processing it with machine learning and statistical analy- sis techniques. Pre-processing pipelines are naturally ex- pressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is expressed in linear algebra with iterations. Programmers therefore perform end-to-end data analysis utilizing multiple programming paradigms and sys- tems. This impedance mismatch not only hinders produc- tivity but also prevents optimization opportunities, such as sharing of physical data layouts (e.g., partitioning) and data structures among different parts of a data analysis program. The goal of this work is twofold. First, it aims to alleviate the impedance mismatch by allowing programmers to author complete end-to-end programs in one engine-independent language that is automatically parallelized. Second, it aims to enable joint optimizations over both relational and lin- ear algebra. To achieve this goal, we present the design of Lara, a deeply embedded language in Scala which enables authoring scalable programs using two abstract data types (DataBag and Matrix) and control flow constructs. Pro- grams written in Lara are compiled to an intermediate rep- resentation (IR) which enables optimizations across linear and relational algebra. The IR is finally used to compile code for different execution engines.


Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence