The problem of outlier detection has been studied since several years. However, when it comes
to streaming categorical data it becomes more challenging. Most of the algorithm which focuses
on outlier detection over categorical data are static in nature. The amount of data is growing
continuously affecting several domains. We must act on such "Big data" streams and identify
outliers "on the fly". Hence, as part of this thesis "Cumulative probability" pipeline has been
implemented to parallelize and scale to the amount of data in foreseeable time. "Cumulative
probability" is used as a core metric to effectively and efficiently detect outliers over streams
of data. The implementation is done using the recent Apache Spark framework and socket
programming. Amazon Web Services has been used to empirically evaluate "Cumulative probability"
pipeline on the basis of generality and scalability. The design adapts to the Big Data
streaming scenario. It is best suited for detecting outliers in near real time. It find its applications
in area of detecting outliers where devices continuously emits out the data.