Lecture 6

Charts; Census

DATA 8

Fall 2018

Slides created by John DeNero (denero@berkeley.edu) and Ani Adhikari (adhikari@berkeley.edu)

Lecture 6

Charts; Census

DATA 8

Fall 2018

Slides created by John DeNero (denero@berkeley.edu) and Ani Adhikari (adhikari@berkeley.edu)

Announcements

Data Visualization

Types of Data

All values in a column should be both the same type and be comparable to each other in some way

- Numerical — Each value is from a numerical scale
- Numerical measurements are ordered
- Differences are meaningful
- Categorical — Each value is from a fixed inventory
- May or may not have an ordering
- Categories are the same or different

“Numerical” Data

Just because the values are numbers, doesn’t mean the variable is numerical

- Census example has numerical SEX code (0, 1, and 2)

- It doesn’t make sense to perform arithmetic on these “numbers”, e.g. 1 - 0 or (0+1+2)/3 are meaningless

- The variable SEX is still categorical, even though numbers were used for the categories

Plotting Two Numerical Variables

Line graph: plot

Scatter plot : scatter

https://en.wikipedia.org/wiki/C-3PO

Anthony Daniels,

actor

Table Review

Table Structure

- A Table is a sequence of labeled columns
- Labels are strings
- Columns are arrays, all with the same length

Name | Code | Area (m2) |

California | CA | 163696 |

Nevada | NV | 110567 |

Label

Column

Row

Table Methods

- Creating and extending tables:
- Table().with_column and Table.read_table
- Finding the size: num_rows and num_columns
- Referring to columns: labels, relabeling, and indices
- labels and relabeled; column indices start at 0
- Accessing data in a column
- column takes a label or index and returns an array
- Using array methods to work with data in columns
- item, sum, min, max, and so on
- Creating new tables containing some of the original columns:
- select, drop

(Demo)

Manipulating Rows

- t.sort(column) sorts the rows in increasing order
- t.take(row_numbers) keeps the numbered rows
- Each row has an index, starting at 0
- t.where(column, are.condition) keeps all rows for which a column's value satisfies a condition
- t.where(column, value) keeps all rows

for which a column's value equals some particular value - t.with_row makes a new table that has another row

Lists

Lists are Generic Sequences

A list is a sequence of values (just like an array), but the values can all have different types

[2+3, 'four', Table().with_column('K', [3, 4])]

- Lists can be used to create table rows.
- If you create a table column from a list, it will be converted to an array automatically

(Demo)

Discussion Questions

The table nba has columns NAME, POSITION, and SALARY.

- Create an array containing the names of all point guards (PG) who make more than $15M/year

b) After evaluating these two expressions in order, what's the result of the second one?

nba.with_row(['Bernie', 'Mascot', 100])

nba.where('NAME', are.containing('Bern'))

guards = nba.where('POSITION', 'PG')

guards.where('SALARY', are.above(15)).column('NAME')

Census Data

The Decennial Census

- Every ten years, the Census Bureau counts how many people there are in the U.S.

- In between censuses, the Bureau estimates how many people there are each year.

- Article 1, Section 2 of the Constitution:
- “Representatives and direct Taxes shall be apportioned among the several States … according to their respective Numbers …”

Analyzing Census Data

Leads to the discovery of interesting features and trends in the population

(Demo)

Census Table Description

- Values have column-dependent interpretations
- The SEX column: 1 is Male, 2 is Female
- The POPESTIMATE2010 column: 7/1/2010 estimate
- In this table, some rows are sums of other rows
- The SEX column: 0 is Total (of Male + Female)
- The AGE column: 999 is Total of all ages
- Numeric codes are often used for storage efficiency
- Values in a column have the same type, but are not necessarily comparable (AGE 12 vs AGE 999)

Growth Rate

- Growth rate = g (for example 3%, or 0.03)
- Initial value x, final value y after t periods of time

Value after 1 period = x + xg = x * (1+g)

Value after 2 periods = x(1+g)(1+g) = x * (1+g) ** 2

Value after t periods = y = x * (1+g) ** t

So (1+g) ** t = y/x and so 1+g = (y/x) ** (1/t)

So g = (y/x) ** (1/t) - 1