1 of 23

Saving Data Journalism���

Katherine Boss,

Meredith Broussard,

Fernando Chirigati,

Rémi Rampin,

Vicky Steeves,

New York University Libraries

NYU Arthur L. Carter Journalism Institute

NYU Tandon School of Engineering

NYU Tandon School of Engineering

NYU Libraries & Center for Data Science

National Institute of Computer Assisted Reporting, Newport Beach, 2019

2 of 23

How do News Apps differ from other born-digital news content?

  • Interactive
  • Exploratory
  • Some reliant on a database
  • Some use external APIs
  • Custom-built software

Source: Times analysis of LAFD data. Credits: Ben Welsh, Robert Lopez, Kate Linthicum. Map data: (c) OpenStreetMap contributors, CC-BY-SA. http://graphics.latimes.com/how-fast-is-lafd/#14/34.0521/-118.2756

3 of 23

The Problem:

Interactive news stories can’t be archived

4 of 23

Why?

• Web archiving crawlers were built to crawl the static web

• Current web archiving technology cannot capture dynamic elements of a site, or back-end files (like data!)

• Older interactives are dependent on older browsers to display (such as Flash-based interactives)

• Newsrooms are volatile organizations: they get bought and sold, the servers are shut down, labor is re-organized

5 of 23

Early news app from 2009 — “Fatal Flights”

The Washington Post. 2009. “Fatal Flights: Fatal Medical Helicopter Crashes.” From: http://www.washingtonpost.com/wp-srv/special/nation/medical-helicopters/fatal-crashes.html

6 of 23

ProPublica “How much is a limb worth?”

Groeger, L., Grabell, M., & Cotts, C. 2015. Workers’ Comp Benefits: How Much is a Limb Worth? From http://projects.propublica.org/graphics/workers-compensation-benefits-by-limb

7 of 23

Internet Archive snapshot of Workers’ Comp, Mar. 8 2016

8 of 23

Okay, but as long as AWS is around in 20 years, won’t my stuff be okay?

9 of 23

10 of 23

But all of my work is containerized. Isn’t Docker a reliable way of saving stuff?

11 of 23

12 of 23

The Solution

13 of 23

14 of 23

How? An emulation-based web archiving tool

NYU Libraries and the NYU Center for Data Science were awarded an IMLS grant to build an extension to ReproZip to capture remote front-end files (e.g. not located on server where app is running) to be able to capture the *complete* news app!

15 of 23

How does ReproZip work?

open, unpack, and reproduce anywhere, anytime!

necessary data files, libraries, environment variables, etc. required to reproduce your data journalism

16 of 23

ReproZip: Reproducibility in 2 Steps

Packing

Unpacking

Linux

data files, libraries,�environment variables, etc.�required to reproduce�the research

ReproZip�Bundle

Linux

macOS

Windows

open, unpack, and�reproduce anywhere, anytime!

17 of 23

Journalists

Data Analysis, App, etc.

reprozip

Executing

Tracing

Creating�Configuration

Configuration�File

Preservation

Bundle

(.rpz file)

Configuring

Packing

Input files, output files, parameters…

Data

Executable programs and steps

Workflow

Environment variables, dependencies, …

Environment

Detailed Provenance

Step 1: Tracing & packing

Computational Environment E (Linux, in this case a web server)

18 of 23

Readers

Computational Environment E’ (something potentially different from original)

Step 2: Unpacking & replaying

reprounzip

Unpacking

directory

Linux

chroot

/

vagrant

Linux�macOS�Windows

docker

Provenance�Graph

(any OS)

VisTrails

(any OS)

Linux

Linux�macOS�Windows

Preservation

Bundle

(.rpz file)

Singularity (upcoming)

ReproServer

(in-browser)

19 of 23

Our prototype - extending ReproZip for news apps!

Data Analysis, App, etc.

reprozip

Executing

Tracing

ReproZip Bundle

.rpz file

Record

Add WARC

Preservation

Bundle

.rpz file w/ WARC

Unpack

reprounzip dj

Preservation

Bundle

.rpz file w/ WARC

playback

View news app locally, even without wifi!

Pack & record

Unpack & replay

20 of 23

Demo: Interactive Election Polls 2015 from

the Guardian

21 of 23

Current Limitations of the Prototype

  • ReproZip only captures what was executed; some news apps require all parts of the environment (e.g. gems for Ruby) that may or may not be executed, so that all the links in the front-end are captured correctly�
    • Implemented extra rule to detect and automatically capture Ruby gems, will implement for other languages in next phase of dev work�
  • Access to external APIs and data (e.g. on S3 stores) when re-running the news app from RPZ file�
  • Tested on a few types of interactives of similar types (e.g. built on Rails and Node.js, Django) based on environmental scan of news apps �
    • More testing coming in the next phase of work and/or by anyone who wants to test out the prototype and give us feedback!

22 of 23

Next steps, funding to:

  • Continue generalizing the tool for a wider range of interactive sites
  • Build in functionality to point to archived, emulated web browsers
  • Create a GUI that works for newsrooms

23 of 23

Thank you! Questions?

Get this presentation: https://goo.gl/JKZKK5

Try out our prototype: https://github.com/reprozip-news-apps/reprozip-news-apps

Documentation: https://reprozip-news-app-archiving-tool.readthedocs.io/

Any feedback? Either leave us an issue on GitHub or email us: rz-dj@nyu.edu