1 of 16

API CAN CODE �Computational Foundations of �Data Science

Lesson 2.2: Manual Data Processing

This work was made possible through generous support from the National Science Foundation (Award # 2141655).

2 of 16

2.1 Recap

  • Investigated Food Deserts in DC using a dataset from OpenDataDC and a Google Sheet for analysis�
  • Discussed data science elements in the process of an investigation on why TikTok songs go viral�
  • Data science as the intersection of programming, statistics, and content fields

2

3 of 16

Warmup - Explore “Streetlights”

  • Open the OpenDataDC Streetlights dataset. �
    • Click the download button (it looks like a cloud with a downwards arrow pointing from it). �
    • Under the “CSV” category, select “Download file previously generated on [last update date]”�
    • Save the file locally.�
  • Import the CSV into Google Sheets. �
  • Explore the dataset. What is the data about? What kinds of variables are stored within the dataset?

3

4 of 16

Warmup - Explore “Streetlights”

  • Open the OpenDataDC Streetlights dataset in Google Sheets. This dataset stores the location and information about every streetlight in DC.�
  • Explore the dataset.
    • What is the data about?
    • What kinds of variables are stored within the dataset? (look at different columns)
    • What is stored in each row?
    • What is stored in each column?
    • Why is it important for data to be organized this way?

4

5 of 16

Streetlights - Missing Data

Use conditional formatting across the entire dataset to highlight missing data cells.

  • Press ctrl+A to select the entire dataset.
  • Go to Format → Conditional Formatting.
  • Format cells if… “Is empty.”
  • Color the cells any color you want!
  • Look through the data. Is there any other cell value that you might like to have highlighted?

5

6 of 16

Streetlights - Countif Observations

Use a countif function to count a number of observations meeting �a particular variable value. �

The “CountIf” function looks at a cell and produces a certain value �IF that cell’s contents match a specified number or string of text, �and a different value if the cell’s contents are something ELSE.

=COUNTIF(O2:O72037, “Posttop”)

6

function name

range of values to search

criteria to check match

7 of 16

Streetlights - Countif Observations

Go to the bottom of the column O - “Fixture Style Description”.

  • Below the last data value in this column, enter the following: �=COUNTIF(O2:O72037, “Posttop”) �72037 may be replaced by whatever is the last row of data in your sheet, since your downloaded data may differ slightly from our version!
  • Press enter. What number appears? What does this number mean?
  • Go to Column Y, “LIGHTMANUFACTURER_DESC”. In the row below the last data value, type a similar function: =COUNTIF(Y2:Y72037, “General Electric”) and press enter. �How many street lights are made by General Electric?
  • How might missing data cells affect these counts?

7

8 of 16

Streetlights - Function Calculation

Use the Average function to calculate a mean value for several of the variables.

  • Go to the bottom of column AV, “WATTAGE 1”, and in the row below the last data value, type =AVERAGE(AV2:AV72037) What is the average wattage of street lights in the dataset? How might missing data affect this average?
  • Go to the bottom of column AH, “POLEHEIGHT_DESC”, and type a similar function: =AVERAGE(AH2:AH72037) and press enter.
  • This second function doesn’t seem to work. Why not? What went wrong, and how could we fix it?

8

9 of 16

Streetlights - Transformation

  • Variable transformation. Create new columns using a formula.
    • Look at column AG, “POLECOMPOSITION_DESC.” What information does this store?
    • Click on the header “AG,” right-click, and select “Insert 1 column right.” This will create a new, blank column. In the title row, enter “Steel.”
    • Go to cell AH2. Enter =IF(AG2=”Steel”, “Yes”, “No”) and press enter.
    • Click in cell AH2, click on the small circle in the bottom right corner of the cell, and drag downwards all the way to the bottom of the sheet (whoa!). This should apply this formula to all the cells in this column.
    • Note: does this keep comparing cells to AG2? Or does it compare each new cell to a different cell? (Click in AH72037 to check!)
    • How does missing data affect these “Yes” and “No” values?

9

10 of 16

The Switch to Programming

  • Now, we are going to look at working with the same data through programmed analysis instead of manual analysis. �
  • We’ll use a program called EduBlocks, which allows us to code in Python through structured “blocks” of code.�
  • Think about what benefits programming offers over the manual work we’ve already done!

10

11 of 16

Intro to EduBlocks

EduBlocks is a block-based programming tool that helps novices learn how to code with text-based programming languages.

11

Block-view

Code-view

There’s a quick guide in your handouts for future use!

12 of 16

Streetlights - Programming

  • Display this program, which imports the live-updated Streetlights data and prints an example streetlight data structure.�
  • Run the program, which will include a printout of one chunk of the data. The data structure here is quite different from the tabular structure of the CSV. How is this data organized?

12

13 of 16

Streetlights - Programming

  • Drag the lower chunk of code, beginning with “steelcount = 0” to connect it with the top chunk.
  • Each “block” in this program represents a line of Python code. You can see the text equivalent on the right-hand side. What parts do you recognize? What parts do you NOT recognize?

  • Look at the green “if” block. What criteria is required for a “match”? What happens within the “if” statement if these criteria are met? �
  • The end of the code should print out a number. Does this “Steel Pole Count” match the output of our countif statement on steel poles from earlier?

13

14 of 16

Streetlights - Programming

There are clear differences between the block-based programming process and our Google Sheets work earlier.

  • Which do you think was easier? In what ways?
  • What is challenging about the Google Sheets work?
  • What is challenging about the programming work?

14

15 of 16

Exit Ticket

  • If I want to take the average of a set of values, which function will give me that?
    • =AVERAGE(B2:B5)
    • =COUNTIF(B2:B5, “Yes”)�
  • If I want to count the number of “Blue” cars in a column, which function will do that?
    • =AVERAGE(AD2:AD57)
    • =COUNTIF(AD2:AD57, “Blue”)
  • I surveyed 100 randomly-selected students and calculated their average GPA, but 23 students did not provide a GPA value. I come up with an average GPA of 2.75. How might the missing data have affected this average?

15

16 of 16

Thanks!

apicancode@umd.edu

16

This work was made possible through generous support from the National Science Foundation (Award # 2141655).

API Can Code is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

4.0 International (CC BY-NC-SA 4.0) License