Using Balanced Training to Minimize Biased Classification

Redy Andriyansah, Syed Saqib Bukhari, Martin Jenckel, Andreas Dengel

In: HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing. International Workshop on Historical Document Imaging and Processing (HIP) Sydney Australia Pages 31-36 ISBN 978-1-4503-7668-6 Association for Computing Machinery 9/2019.


In this paper, we classify semantic zone in a document image and observe how a balanced training influences the classification performance. Unlike holistic document which normally distinguishes in content and structural layout, semantic zone introduces stronger inter-class ambiguity as it loses layout feature. Zone extraction from documents often results in unbalanced class distribution. Our experiment shows that training on such data leads to a biased classification. We classify semantic zone by using AlexNet which is a Convolutional Neural Network (CNN). It works on 3 corpora: University of Washington (UW) III, German historical document images (OCRD), and combination of both data sets. Because zone distribution is heavily unbalanced, we augment the data and balance the training distribution to prevent over expression by major classes. To maintain accuracy, we adopt transfer learning from larger document corpus (RVLCDIP). Besides deep learning, we also use heuristic approach to compare performance between balanced and unbalanced training. The result shows that balanced training can alleviate biased performance.

Weitere Links

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz