Saving Data Journalism����
Katherine Boss,
Meredith Broussard,
Fernando Chirigati,
Rémi Rampin,
Vicky Steeves,
New York University Libraries
NYU Arthur L. Carter Journalism Institute
NYU Tandon School of Engineering
NYU Tandon School of Engineering
NYU Libraries & Center for Data Science
National Institute of Computer Assisted Reporting, Newport Beach, 2019
How do News Apps differ from other born-digital news content?
Source: Times analysis of LAFD data. Credits: Ben Welsh, Robert Lopez, Kate Linthicum. Map data: (c) OpenStreetMap contributors, CC-BY-SA. http://graphics.latimes.com/how-fast-is-lafd/#14/34.0521/-118.2756
The Problem:
Interactive news stories can’t be archived
Why?
• Web archiving crawlers were built to crawl the static web
• Current web archiving technology cannot capture dynamic elements of a site, or back-end files (like data!)
• Older interactives are dependent on older browsers to display (such as Flash-based interactives)
• Newsrooms are volatile organizations: they get bought and sold, the servers are shut down, labor is re-organized
Early news app from 2009 — “Fatal Flights”
The Washington Post. 2009. “Fatal Flights: Fatal Medical Helicopter Crashes.” From: http://www.washingtonpost.com/wp-srv/special/nation/medical-helicopters/fatal-crashes.html
ProPublica “How much is a limb worth?”
Groeger, L., Grabell, M., & Cotts, C. 2015. Workers’ Comp Benefits: How Much is a Limb Worth? From http://projects.propublica.org/graphics/workers-compensation-benefits-by-limb
Internet Archive snapshot of Workers’ Comp, Mar. 8 2016
Okay, but as long as AWS is around in 20 years, won’t my stuff be okay?
But all of my work is containerized. Isn’t Docker a reliable way of saving stuff?
The Solution
How? An emulation-based web archiving tool
NYU Libraries and the NYU Center for Data Science were awarded an IMLS grant to build an extension to ReproZip to capture remote front-end files (e.g. not located on server where app is running) to be able to capture the *complete* news app!
How does ReproZip work?
open, unpack, and reproduce anywhere, anytime!
necessary data files, libraries, environment variables, etc. required to reproduce your data journalism
ReproZip: Reproducibility in 2 Steps
Packing
Unpacking
Linux
data files, libraries,�environment variables, etc.�required to reproduce�the research
ReproZip�Bundle
Linux
macOS
Windows
open, unpack, and�reproduce anywhere, anytime!
Journalists
Data Analysis, App, etc.
reprozip
Executing
Tracing
Creating�Configuration
Configuration�File
Preservation
Bundle
(.rpz file)
Configuring
Packing
Input files, output files, parameters…
Data
Executable programs and steps
Workflow
Environment variables, dependencies, …
Environment
Detailed Provenance
Step 1: Tracing & packing
Computational Environment E (Linux, in this case a web server)
Readers
Computational Environment E’ (something potentially different from original)
Step 2: Unpacking & replaying
reprounzip
Unpacking
directory
Linux
chroot
/
vagrant
Linux�macOS�Windows
docker
Provenance�Graph
(any OS)
VisTrails
(any OS)
Linux
Linux�macOS�Windows
Preservation
Bundle
(.rpz file)
Singularity (upcoming)
ReproServer
(in-browser)
Our prototype - extending ReproZip for news apps!
Data Analysis, App, etc.
reprozip
Executing
Tracing
ReproZip Bundle
.rpz file
Record
Add WARC
Preservation
Bundle
.rpz file w/ WARC
Unpack
reprounzip dj
Preservation
Bundle
.rpz file w/ WARC
playback
View news app locally, even without wifi!
Pack & record
Unpack & replay
Demo: Interactive Election Polls 2015 from
the Guardian
Source found at: https://github.com/guardian/interactive-election-polls-2015
Current Limitations of the Prototype
Next steps, funding to:
Thank you! Questions?
Get this presentation: https://goo.gl/JKZKK5
Try out our prototype: https://github.com/reprozip-news-apps/reprozip-news-apps
Documentation: https://reprozip-news-app-archiving-tool.readthedocs.io/
Any feedback? Either leave us an issue on GitHub or email us: rz-dj@nyu.edu