Visual Concept Learning from User-tagged Web Video
PhD-Thesis, University of Kaiserslautern, ISBN 978-3-86853-248-7, Dr. Hut-Verlag, München, 10/2009.
As digital video has become a source of information and entertainment to millions of users, video databases grow at enormous rates, and a need for new efficient indexing and search strategies has been recognized by research and industry. In this context, concept detection aims at a machine indexing by automatically linking video scenes with semantic concepts appearing in them. Existing concept detection systems rely on manual annotation for concept learning, and are thus limited by the effort associated with training data acquisition. To overcome this problem, this thesis describes a concept learning approach that requires significantly less manual supervision compared to standard methods. To achieve this, user-tagged web video is employed (as offered by portals like YouTube). Four contributions are made that greatly enhance our ability to use this data source for training, regarding its content, label noise, context, and motion information. To make use of web video content, this thesis presents a concept detection system that employs clips downloaded from YouTube as training data, with class labels being automatically derived from user-generated tags and descriptions. It is demonstrated on standard datasets from the TRECVID benchmark that the resulting detectors generalize comparably well to novel domains as detectors trained on manually acquired ground truth. At the same time, the approach offers a much more scalable and flexible way of concept learning. To address label noise (i.e., the problem that user-generated tags are coarse, subjective, and context-dependent), this thesis proposes to adapt the statistical models underlying concept detection. Web tags are viewed as unreliable indicators of true label information, which is modeled as a latent random variable and inferred during concept detector training. This novel approach (called relevance filtering) is validated to improve concept learning from web video significantly compared to supervised standard methods, for both a generative and a discriminative base model. To make use of context, user-generated category labels are employed, another valuable feature of web video. It is demonstrated that this information can be used by combining concept detection with style modeling: a distinct model is learned per category (or style, respectively) and used for an accurate concept detection. Test images are mapped to a style using their context (for example, other pictures taken at the same event). This approach is demonstrated to improve performance by up to 100% on Flickr photos (n = 32, 000). On the well-known COREL-5K image annotation benchmark, the proposed method gives a mean recall/precision of 39%/25%, which is the best result reported to date. Finally, to make use of motion information, this thesis suggests to improve the learning and recognition of objects using motion-based segmentation. Two novel motion segmentation approaches are presented, one based on a globally optimal branch-and-bound search of parameter space, one on a combination of motion and color information. These approaches are integrated with a patch-based recognition method, achieving an improved robustness to clutter. Compared to a baseline operating on unsegmented images, recognition error improves from 8.1% to 4.4% (n = 1, 584), and the precision of concept detection from 31% to 41% (MAP, n = 4, 160). Altogether, these contributions suggest that web video can form the basis for a novel way of concept learning beyond the manual acquisition of small training sets that constitutes the state of the art. With the technology described in this thesis, we can now build concept detection systems that can learn thousands of concepts and offer a better support for video search.