Skip to main content Skip to main navigation



Data Science and NLP for transparency

According to the World Bank and the UN, some US$1tn is paid in bribes every year. Corrupt financial transactions divert funds from legitimate public services, as well as distort free markets—potentially thwarting economic development—and reduce trust in institutions. The Organized Crime and Corruption Reporting Project (OCCRP) is a global platform for investigative reporting, providing resources to journalists and media centres, enabling cost-effective collaboration between editors and offering tools to secure themselves against threats to independent media. Exposing previously-unknown connections between entities makes it possible for citizens, policymakers, activists and law enforcement agencies to act. As the number of such leaks and publications grows, there is an increasing need for effective, scalable and reproducible methods to discover any anomalies and evidence of malfeasance that might exist within them.

There are three components of research and development we plan to undertake as part of this project - 1) Natural language processing, 2) Networks and graphs, and 3) human computer interaction. 1) At the moment, OCCRP data dumps are processed only for Named Entities, with off-the-shelf tools such as Spacy, which detect standard entity types such as persons, organisations, locations, dates, URLs and numbers, but are trained on standard academic datasets, and hence may yield very low performance on the types of documents contained in these dumps. In addition, we will investigate the potential of other IE tasks, such as document categorization and relation extraction, together with platform users. 2) Network embeddings allow low-dimensional latent representations of networks that might otherwise comprise billions of links and nodes. This facilitates clustering and classification of entities in the network, as well as permitting visualisation of the data for better communication. An initial exploratory comparison of different network embeddings for the OCCRP data dumps will therefore reveal how this approach can open up more efficient ways for stakeholders to interact with the large datasets directly, as well as the use of low-dimensional representations as a preprocessing step for further algorithms. 3) t is our imperative that the work done as part of this project does not just achieve scientific and research goals, but results in a useful and usable solution for our end users (journalists and advocates) which helps them achieve the goal of unveiling and exposing corruption to broader society. As such, we intend to pay special attention to the HCI component of this project throughout - from the scoping phase, to delivery. This will involve observing how users already interact with the OCCRP Aleph platform, and assessing if this changes or is sped up (with fewer redundant queries) if the interface is augmented with network statistics.