1 of 77

Crunching the Numbers: Data Journalism 101

Marina Villeneuve & Kae Petrin

Slides are here: https://shorturl.at/e82pk

2 of 77

What is data journalism?

3 of 77

Data Science

Data Art

Data Storytelling!

4 of 77

Data journalism is another set of tools for figuring out the best way to tell a story

5 of 77

Why use data?

1. Data can lend credence and complexity to (or debunk) anecdote

2. It's a powerful investigative tool

3. It can communicate information efficiently and clearly

6 of 77

7 of 77

8 of 77

Building Science Graphics: An illustrated guide to communicating science through diagrams and visualizations, by Jen Christiansen

9 of 77

What is data?

10 of 77

Some common sources

  • Financial records
  • Environmental testing
  • Disciplinary data, bullying, civil rights data in schools
  • Demographic information in broader datasets
  • Housing, property taxes
  • Police calls, crime, safety
  • Polls and research studies

Thinking broader — what's on your beat?

11 of 77

Finding data for reporting

12 of 77

Where to look

  • Federal agency websites — large databases to dashboards
  • State department websites
  • Reports from municipal, state, and federal entities
  • Research institutes (e.g. Invisible Institute, Williams Institute)
  • Journalism open source resources (ProPublica, Big Local Data, IRE)
  • Census and American Community Survey data
  • RAND Surveys: https://www.bentobento.info/

Or… obtain it yourself through FOIA.

13 of 77

If someone fills out a form, that's data.

14 of 77

State Department of Education Websites

Good place to check first for data – there is almost always a section of the website entitled “data center”, “library”, “report card” or something similar.

15 of 77

Government databases

16 of 77

Terrible government data viz

17 of 77

Government dashboards

18 of 77

You have to interview data just like any other source.

19 of 77

Beware of data definitions

20 of 77

How to interview your data

  • WHO: Who collected the data? Who is in the data? Is this a complete universe, a representative sample or a non-representative sample?
  • WHAT: What am I actually looking at? Figure out basic shape of the data (e.g. the range of the time period) and what every column means.
    • What is missing? E.g.: time periods, underrepresented groups, locations.
  • WHERE: Which areas does this data cover?
  • WHEN: When was data collected? Is it up to date?
  • WHY: Why was data collected? (E.g.: mandated by law or a lawsuit, market research, voluntary survey).
  • HOW: How was data collected? Uploaded automatically or manually? Is it self-reported?

21 of 77

Best practices for obtaining data

  1. Request documentation (fields, collection methods, update history, etc) with any data sets you try to obtain
  2. If something looks weird, or you can’t explain it — or the data is just poorly documented — ask an official to explain
  3. Like with a story, it’s okay to show officials summaries of your findings and give them a chance to respond or contradict — but don’t show them a full spreadsheet, our entire analysis, or a Google Sheet that you're actively working in
  4. Check your work, ask lots of questions

22 of 77

Additional tips

  • Look for or request structured data that comes in a .csv, .xlsx, .tsv or other machine-readable format
  • If you see an interesting dashboard, check for a “download CSV” button
  • Ask for more data than you think you’ll need — ie, “all schools in a district from 2012 until most recently available,” even if the story focuses on 2 schools.
    1. Scale back your request if they claim that will be burdensome to fulfill.
  • Check state or federal websites for datasets that contain mandated reporting information from local school districts
  • If you can’t find any data, or you found information that’s in a different format (like dashboard, web map, etc), reach out to industry data resources for help!

23 of 77

Checking your work

  1. Ask clear questions that you can prove — or disprove
  2. Start from what you already know
  3. Restructure your data and see what happens
  4. Retrace your steps with different tools — like a calculator
  5. Try a new method of calculation, with fresh assumptions

It’s all about shifting perspective

Read more about ways to think through your results.

24 of 77

Thinking more broadly about data

25 of 77

Data can come in surprising forms

26 of 77

Learning more

27 of 77

Low-tech tools for data journalism

  • Turning PDFs into data
    • Adobe Acrobat/Reader
    • Tabula
    • Google Pinpoint
  • Scraping basic websites
    • IMPORTHTML in Google Sheets
  • Visualizing data and making maps
    • Datawrapper
    • QGIS

28 of 77

Learning more and checking your work - web resources

29 of 77

Books on thinking about data

  • Investigative Reporters & Editors' Numbers in the Newsroom
  • The Wall Street Journal's Guide to Information Graphics
  • How Charts Lie by Alberto Cairo
  • Precision Journalism by Philip Meyer
  • Data Portraits: Visualizing Black America by W.E.B. DuBois
  • Invisible Women: Data Bias in a World Designed for Men by Caroline Criado Pérez
  • Queer Data by Kevin Guyan

30 of 77

Learning to code

Journalism-specific resources

Other resources

31 of 77

Finding the story:

Using data to report on communities, states

32 of 77

33 of 77

Data as adding context, finding stories

For example:

If you want to do a story on people living with diabetes …

find a dataset on diabetes rates by county

https://gis.cdc.gov/grasp/diabetes/DiabetesAtlas.html

And see which counties have the highest rates by capita..

And speak to people who live there

34 of 77

Statehouse data…

Who’s influencing who?

Lobbying data

Campaign finance contribution data

Financial disclosure statements

How are they spending our money and their campaign funds?

Budgets

Legislative reimbursements

Campaign finance expenditure data

35 of 77

Campaign finance data

  • Know the reporting deadlines, put them in your calendar..
  • Know contribution limits (or lack thereof..)
  • Call up spenders yourself, sign up for trainings, read filing guides, know Board of Elections experts, know about recent campaign finance reforms, ethics reform activists/groups...
  • Who’s giving most and why? (See if big donors are getting rewards, check out new groups)
  • Who’s not giving anymore and why? (See if big donors are unhappy with party etc)
  • Who’s spending what on what? (Candidates using funds as piggy banks?)
  • Who’s giving at the last minute? (Find in 24 hour reports)
  • Is it really dark money? (Nonprofit 501(c)(4)s and 501(c)(6)s don’t have to disclose donors. But they may anyways)
  • Who’s giving money in other states? (A starting point but verify data: followthemoney.org)
  • Who’s buying campaign ads in your state and for how much? https://publicfiles.fcc.gov/

36 of 77

37 of 77

38 of 77

39 of 77

Lobbying data: Fights to defeat, pass, tweak bills

40 of 77

41 of 77

Let’s get hands on:

With Excel!

42 of 77

First up…

Sorting and filtering

43 of 77

https://publicreporting.elections.ny.gov/CandidateCommitteeDisclosure/CandidateCommitteeDisclosure

44 of 77

Search by Committee

Search by Committee for:

Democratic Senate Campaign Committee - Housekeeping

Click search

45 of 77

Click on CSV Full Period

That will download the data for you

46 of 77

Don’t convert (removing leading zeroes will mess up things)

47 of 77

Steps:

Copy and paste data into a new worksheet (always keep an original)

Make sure the dataset is clean - columns are labeled, column headings are in first row, no empty rows, etc

Read columns to understand what data is here

48 of 77

On the main page: https://publicreporting.elections.ny.gov/

Look around and you’ll find a guide that explains each column

https://publicreporting.elections.ny.gov/Content/Help/FileFormatReferenceFiler.pdf

49 of 77

Sorting and filtering can be powerful!

Let’s sort the whole sheet by contribution amount

Go to Column Z (Amount) and click on the cell right below “Amount”

Click Control-A

Go to Sort & Filter and click Sort Largest to Smalleset

50 of 77

51 of 77

52 of 77

Now you can see the organizations donating the most to the NY Senate Democratic housekeeping committee

(Which is a committee that’s supposed to just be about funding the costs of a party headquarters and not for funding campaigns… but can powerful donors curry favor by donating?)

53 of 77

Now let’s try filtering…

Go to sort and filter, then click filter

54 of 77

Now you see little drop down boxes next to each column

  • In the Entity Name (Column N) column, hit that button
  • then click Select All (so it empties all the options)
  • Then type in GNYHA (Greater New York Hospital Association)
  • Click GNYHA and GNYHA
  • Now you’ve filtered!
  • To return back to seeing everything, simply click select all

55 of 77

Let’s make a Pivot Table!

Control A the entire selection you want..

Depending on which version of Excel you have, you then-

  • Go to Insert, then hit Pivot table
  • Go to Data and then hit “Summarize with PivotTable”

56 of 77

Hit OK

57 of 77

Now you can start summarizing and looking for trends in data

Try playing around with placing different columns in columns and rows

How might you show entities by how much they donated?

58 of 77

59 of 77

Click on the cell right below “Sum of Amount”

And then hit “Sort Largest to Smallest”

60 of 77

61 of 77

To switch to $$

Go to “home” then highlight the column you want to change

And click on the $ sign

62 of 77

A lot of learning data is just playing around and getting used to it.

From here on out, you’ll want to think about things like:

  • Cleaning data (what about extra spaces or the difference between “Microsoft” and “Microsoft Inc?)

  • Learning formulas for median, mean, percent change… (how much are things increasing?)

  • Learning what data are out there

Resources:

https://gijn.org/resource/analyzing-data-spreadsheets/

63 of 77

An exercise on calculating percent change:

64 of 77

https://mainecampaignfinance.com/#/transactionSearch/151

Let’s look at data about how much outside groups spend on campaigns in Maine…

Check off election year 2016 and 2020

Then click export results

65 of 77

Let’s make a pivot table..

66 of 77

Put Election Year in Column

Filer Type in Row

And Sum of Amount in Values

You get a summary of how much spending by year

67 of 77

Formula for percentage change is

(new-old)/old

So, if we’re using cell name, it’s the cell for 2020 spending minus the 2016 spending, divided by 2016 spending.

68 of 77

Type in that formula and then hit enter

And then hit “%” in the Home pane to change the decimal to a %

So now we know that outside spending increased 38% from 2016 to 2020

(And look to Inflation calculator to see how much inflation rose)

69 of 77

And to calculate plain old percent…

Use Cell 1/Cell 2

Or, C6/D6 to find the percentage of 2020 spending of all spending in 2016 and 2020

70 of 77

To learn median…

Go to your spreadsheet (not the pivot table)

Go to Column P and scroll to the bottom

In the cell below the last entry in Column P, type in:

=MEDIAN(P1:P1914)

That tells Excel to calculate the median (which is kinda useless in this particular example but oh well)

You can switch out MEDIAN for AVERAGE to get that figure..

But use median for monetary amounts!

Use mean for things like.. Average cat life span.

71 of 77

Other resources:

Show your work on Github

https://github.com/reportermarina

For example, I did a project on school discipline data in MA

Here’s my readme, as well as links to data I used, and my methodology:

https://github.com/ReporterMarina/project3-schools

72 of 77

I HIGHLY recommend Columbia’s Lede Program - a data journalism certificate you can do virtually

https://ledeprogram.com/

They surprisingly have financial aid available that made it really affordable for me (someone who applied literally last minute)

I still have access to all the lectures, walkthroughs, tutorials… a god send

73 of 77

74 of 77

I also highly recommend IRE’s data Bootcamp!

https://www.ire.org/training/bootcamps/data-journalism-bootcamps/

And search for tutorials on Excel, SQL, R online

SQL: http://www.padjo.org/tutorials/

https://ksj.mit.edu/resource/data-journalism-tools/introduction/

75 of 77

76 of 77

Contact info�

@ReporterMarina

marina.villeneuve@gmail.com

77 of 77