Sixteen solutions for
data-driven projects
Christopher Groskopf
@onyxfish
Slides: http://bit.ly/nicar15-process-slides
Example Code: http://bit.ly/nicar15-process
Notes: http://bit.ly/nicar15-process-notes
Solutions for what?
We’re so bad at everything!
Use BASH.
2. Make your scripts idempotent
Cleanup after yourself.
3. Never modify the originals
Treat files as immutable objects.
4. Name things using a least-to-most specific pattern
2015-03-06-doj-traffic-stops.csv
5. Test to validate the original data
Is it what your source said it is?
6. Test to audit your transformations
Is your file format conversion lossy?
7. Test when you assert novel facts
Cover your ass.
8. Write detailed logs
It doesn’t hurt to write to several different files.
9. Script setup and querying of your DB
Then export back to a neutral format.
10. Follow the crowd
Use CSV for tables.
Use JSON for hierarchies.
Use csvkit, ogr2ogr, jq, etc for processing
11. Define and follow coding conventions
NPR Visuals best practices: http://bit.ly/nprviz-best-practices
12. Store documentation with your code
In version control.
13. Document the provenance of your data
How was this data created?
14. Have a setup script
New users should not need to know how your dependencies work.
15. Be a ticketing zealot
If you find a bug, don’t fix it.
16. Use comments to explain the why
But your code should never be confusing.
(17?) Be smart about breaking the rules