Which Languages do People Speak on Flickr? A Language and Geo-Location Study of the YFCC100m Dataset

Alireza Koochali; Sebastian Kalkowski; Andreas Dengel; Damian Borth; Christian Schulze
In: Proceedings of the 2016 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions. ACM Multimedia Community-Organized Multimodal Mining: Opportunities for Novel Solutions Workshop (MMCommons-16), Datasets, Evaluation, and Reproducibility, located at ACM MM 16, October 16, Amsterdam, Netherlands, ACM, 2016.


Recently, the Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset was introduced to the computer vision and multimedia research community. This dataset consists of millions of images and videos spread over the globe. This geo-distribution hints at a potentially large set of different languages being used in titles, descriptions, and tags of these images and videos. Since the YFCC100m metadata does not provide any information about the languages used in the dataset, this paper presents the first analysis of this kind. The language and geo-location characteristics of the YFCC100m dataset is described by providing (a) an overview of used languages, (b) language to country associations, and (c) second language usage in this dataset. Being able to know the language spoken in titles, descriptions, and tags, users of the dataset can make language specific decisions to select subsets of images for, e.g., proper training of classifiers or analyze user behavior specific to their spoken language. Also, this language information is essential for further linguistic studies on the metadata of the YFCC100m dataset.



