ICIJ’s leak refinery
Mar Cabra - Data editor
International Consortium of Investigative Journalists (ICIJ)
+190 journalists in
more than 65 countries
12 staff members (USA, Costa Rica, Venezuela, Germany, France, Spain)
50% of the team = Data & Research Unit
Project
Project
Reporting
Publication
Editor
(Spain)
Deputy director (USA)
Data analyst
(Costa Rica)
Web developer
(Germany)
Data journalist
(France)
Data
checkers
Research editor
(Venezuela)
ICIJ Data&Research unit
Developer
(Spain)
260 GB - 100,000 companies
+500 secret tax agreements (PDFs)
More than 100,000 HSBC clients
$100 Billion
raw files
metadata
author; sender...
database
search and discovery
raw text
file, attachment or embedded object
detect the type
do we know how to extract the text?
no!
log and tackle later
yes!
extract, OCR and repeat
3 million files
x
10 seconds per file
=
1 year
Redis queue
35 x g2.xlarge Amazon instances with Ubuntu + Tesseract + Extract
1 year
÷
35 machines
=
11 days
Lucene syntax queries with proximity matching!
400 users
Unstructured data extraction
Structured data extraction
Database
App
Stack
Our platforms by the numbers
Open-source** + exclusive expertise
*not unique users
*except graph database
Not just a network…
… but a community
Our community by the numbers
*not unique users
**October 2015
the next steps
Next: ICIJ Knowledge Center
Public databases
Our current challenge (global data sharing)
Offshore Leaks
Bank accounts
Sanctions Data
Indian politicians
Argentinian fraudsters
UK contractors
One-way
data check
Encrypted email (easy)
Encrypted email: PGP
Questions?
Thanks!
mcabra@icij.org