1 of 25

ICIJ’s leak refinery

Mar Cabra - Data editor

International Consortium of Investigative Journalists (ICIJ)

2 of 25

+190 journalists in

more than 65 countries

12 staff members (USA, Costa Rica, Venezuela, Germany, France, Spain)

50% of the team = Data & Research Unit

3 of 25

Project

Project

Reporting

Publication

Editor

(Spain)

Deputy director (USA)

Data analyst

(Costa Rica)

Web developer

(Germany)

Data journalist

(France)

Data

checkers

Research editor

(Venezuela)

ICIJ Data&Research unit

Developer

(Spain)

4 of 25

260 GB - 100,000 companies

+500 secret tax agreements (PDFs)

More than 100,000 HSBC clients

$100 Billion

5 of 25

6 of 25

raw files

metadata

author; sender...

database

search and discovery

raw text

7 of 25

file, attachment or embedded object

detect the type

do we know how to extract the text?

no!

log and tackle later

yes!

extract, OCR and repeat

8 of 25

3 million files

x

10 seconds per file

=

1 year

9 of 25

Redis queue

35 x g2.xlarge Amazon instances with Ubuntu + Tesseract + Extract

10 of 25

1 year

÷

35 machines

=

11 days

11 of 25

Lucene syntax queries with proximity matching!

400 users

12 of 25

Unstructured data extraction

  • ICIJ Extract (open source, Java: https://github.com/ICIJ/extract), leverages Apache Tika, Tesseract OCR and JBIG2-ImageIO.

Structured data extraction

  • A bunch of Python

Database

  • Apache Solr (open source, Java)
  • Redis (open source, C)
  • Neo4j (open source, Java)

App

  • Blacklight (open source, Rails)
  • Linkurious (closed source, JS)

Stack

13 of 25

14 of 25

Our platforms by the numbers

  • + 14 million documents, 20 formats
  • 500 users* - 100 active each week, close to 200 per month
  • 1 full-time programmer for improvements

Open-source** + exclusive expertise

*not unique users

*except graph database

15 of 25

Not just a network…

… but a community

16 of 25

17 of 25

18 of 25

19 of 25

Our community by the numbers

  • +600 users*
  • Shared status in +21,000 occasions**
  • 1,500 forum topics with 5,400 posts**
  • Uploaded +600 files**

*not unique users

**October 2015

20 of 25

the next steps

  • entity extraction
  • making our data silos talk

21 of 25

Next: ICIJ Knowledge Center

Public databases

22 of 25

Our current challenge (global data sharing)

Offshore Leaks

Bank accounts

Sanctions Data

Indian politicians

Argentinian fraudsters

UK contractors

One-way

data check

23 of 25

Encrypted email (easy)

24 of 25

Encrypted email: PGP

25 of 25

Questions?

Thanks!

mcabra@icij.org