1 of 24

Sixteen solutions for

data-driven projects

Christopher Groskopf

@onyxfish

2 of 24

Solutions for what?

3 of 24

We’re so bad at everything!

4 of 24

  1. Automate everything you reasonably can

Use BASH.

5 of 24

6 of 24

2. Make your scripts idempotent

Cleanup after yourself.

7 of 24

3. Never modify the originals

Treat files as immutable objects.

8 of 24

4. Name things using a least-to-most specific pattern

2015-03-06-doj-traffic-stops.csv

9 of 24

10 of 24

11 of 24

5. Test to validate the original data

Is it what your source said it is?

12 of 24

6. Test to audit your transformations

Is your file format conversion lossy?

13 of 24

7. Test when you assert novel facts

Cover your ass.

14 of 24

8. Write detailed logs

It doesn’t hurt to write to several different files.

15 of 24

9. Script setup and querying of your DB

Then export back to a neutral format.

16 of 24

17 of 24

10. Follow the crowd

Use CSV for tables.

Use JSON for hierarchies.

Use csvkit, ogr2ogr, jq, etc for processing

18 of 24

11. Define and follow coding conventions

NPR Visuals best practices: http://bit.ly/nprviz-best-practices

19 of 24

12. Store documentation with your code

In version control.

20 of 24

13. Document the provenance of your data

How was this data created?

21 of 24

14. Have a setup script

New users should not need to know how your dependencies work.

22 of 24

15. Be a ticketing zealot

If you find a bug, don’t fix it.

23 of 24

16. Use comments to explain the why

But your code should never be confusing.

24 of 24

(17?) Be smart about breaking the rules