Lecture 6

Charts; Census

DATA 8

Fall 2018

Slides created by John DeNero (denero@berkeley.edu) and Ani Adhikari (adhikari@berkeley.edu)

Announcements

Data Visualization

Types of Data

All values in a column should be both the same type and be comparable to each other in some way

  • Numerical — Each value is from a numerical scale
    • Numerical measurements are ordered
    • Differences are meaningful
  • Categorical — Each value is from a fixed inventory
    • May or may not have an ordering
    • Categories are the same or different

“Numerical” Data

Just because the values are numbers, doesn’t mean the variable is numerical

  • Census example has numerical SEX code (0, 1, and 2)

  • It doesn’t make sense to perform arithmetic on these “numbers”, e.g. 1 - 0 or (0+1+2)/3 are meaningless

  • The variable SEX is still categorical, even though numbers were used for the categories

Plotting Two Numerical Variables

Line graph: plot

Scatter plot : scatter

https://en.wikipedia.org/wiki/C-3PO

Anthony Daniels,

actor

Table Review

Table Structure

  • A Table is a sequence of labeled columns
  • Labels are strings
  • Columns are arrays, all with the same length

Name

Code

Area (m2)

California

CA

163696

Nevada

NV

110567

Label

Column

Row

Table Methods

  • Creating and extending tables:
    • Table().with_column and Table.read_table
  • Finding the size: num_rows and num_columns
  • Referring to columns: labels, relabeling, and indices
    • labels and relabeled; column indices start at 0
  • Accessing data in a column
    • column takes a label or index and returns an array
  • Using array methods to work with data in columns
    • item, sum, min, max, and so on
  • Creating new tables containing some of the original columns:
    • select, drop

(Demo)

Manipulating Rows

  • t.sort(column) sorts the rows in increasing order
  • t.take(row_numbers) keeps the numbered rows
    • Each row has an index, starting at 0
  • t.where(column, are.condition) keeps all rows for which a column's value satisfies a condition
  • t.where(column, value) keeps all rows
    for which a column's value equals some particular value
  • t.with_row makes a new table that has another row

Lists

Lists are Generic Sequences

A list is a sequence of values (just like an array), but the values can all have different types

[2+3, 'four', Table().with_column('K', [3, 4])]

  • Lists can be used to create table rows.
  • If you create a table column from a list, it will be converted to an array automatically

(Demo)

Discussion Questions

The table nba has columns NAME, POSITION, and SALARY.

  • Create an array containing the names of all point guards (PG) who make more than $15M/year

b) After evaluating these two expressions in order, what's the result of the second one?

nba.with_row(['Bernie', 'Mascot', 100])

nba.where('NAME', are.containing('Bern'))

guards = nba.where('POSITION', 'PG')

guards.where('SALARY', are.above(15)).column('NAME')

Census Data

The Decennial Census

  • Every ten years, the Census Bureau counts how many people there are in the U.S.

  • In between censuses, the Bureau estimates how many people there are each year.

  • Article 1, Section 2 of the Constitution:
    • “Representatives and direct Taxes shall be apportioned among the several States … according to their respective Numbers …”

Analyzing Census Data

Leads to the discovery of interesting features and trends in the population

(Demo)

Census Table Description

  • Values have column-dependent interpretations
    • The SEX column: 1 is Male, 2 is Female
    • The POPESTIMATE2010 column: 7/1/2010 estimate
  • In this table, some rows are sums of other rows
    • The SEX column: 0 is Total (of Male + Female)
    • The AGE column: 999 is Total of all ages
  • Numeric codes are often used for storage efficiency
  • Values in a column have the same type, but are not necessarily comparable (AGE 12 vs AGE 999)

Growth Rate

  • Growth rate = g (for example 3%, or 0.03)
  • Initial value x, final value y after t periods of time

Value after 1 period = x + xg = x * (1+g)

Value after 2 periods = x(1+g)(1+g) = x * (1+g) ** 2

Value after t periods = y = x * (1+g) ** t

So (1+g) ** t = y/x and so 1+g = (y/x) ** (1/t)

So g = (y/x) ** (1/t) - 1

Lecture 06 – Charts, Census - Google Slides