Scalable Outlier Detection over Categorical Big Data Streams

Aamir Lakdawala

Mastersthesis TU Kaiserslautern 4/2017.


The problem of outlier detection has been studied since several years. However, when it comes to streaming categorical data it becomes more challenging. Most of the algorithm which focuses on outlier detection over categorical data are static in nature. The amount of data is growing continuously affecting several domains. We must act on such "Big data" streams and identify outliers "on the fly". Hence, as part of this thesis "Cumulative probability" pipeline has been implemented to parallelize and scale to the amount of data in foreseeable time. "Cumulative probability" is used as a core metric to effectively and efficiently detect outliers over streams of data. The implementation is done using the recent Apache Spark framework and socket programming. Amazon Web Services has been used to empirically evaluate "Cumulative probability" pipeline on the basis of generality and scalability. The design adapts to the Big Data streaming scenario. It is best suited for detecting outliers in near real time. It find its applications in area of detecting outliers where devices continuously emits out the data.


Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence