1 of 21

Lecture 11

Pivots and Joins

DATA 8

Fall 2023

2 of 21

Announcements

  • HW 4 due Wednesday 9/20 at 11pm
  • Project 1 due Friday 9/29
    • Checkpoint due Friday 9/22
    • World population through data
  • Sahai OH: Today, 5-7pm @ Free Speech Movement

3 of 21

Review: Grouping and Functions

4 of 21

Grouping and Collection

A

3

B

1

C

4

A

1

B

5

C

9

A

2

C

5

12

7

3

2

7

3

8

6

A

3

A

1

A

2

B

1

B

5

C

4

C

9

C

5

group

12

2

8

7

7

3

3

6

A

1

2

B

1

7

C

4

3

collect function (e.g., min)

Condenses results back into one table

(Demo)

5 of 21

Lists

6 of 21

Lists are Generic Sequences

A list is a sequence of values (just like an array), � but the values can have different types

[2+3, 'four', Table().with_column('K', [3, 4])]

  • Lists can be used to create table rows.
  • If you create a table column from a list, it will be converted to an array automatically

(Demo)

7 of 21

Cross-Classification

8 of 21

Grouping By Multiple Columns

The group method can also aggregate all rows that share the combination of values in multiple columns

  • First argument: A list of which columns to group by
  • collect = … (Optional) How to combine values

Group by one column (and minimize each other column): t.group(“label”, collect=min) or t.group(“label”, min)

Group by two columns (and minimize each other): �t.group([“label 1”, “label 2”], collect=min) or�t.group([“label 1”, “label 2”], min)

(Demo)

9 of 21

10 of 21

Pivot Tables

11 of 21

Pivot

  • Cross-classifies according to two categorical variables
  • Produces a grid of counts or aggregated values
  • Two required arguments:
    • First: variable that forms column labels of the grid
    • Second: variable that forms row labels of the grid
  • Two optional arguments (include both or neither)
    • values=’column_label_to_aggregate’
    • collect=function_to_aggregate_with

(Demo)

12 of 21

Let’s Practice

How to solve a table manipulation problem:

  1. Understand the result — what information is needed to fill in a particular value of the goal.
  2. Describe the operations — articulate (in English) the operations that will produce the result.
  3. Write Python expressions — express the operations using table methods, functions, & array arithmetic.

13 of 21

Discussion Question

  1. For each city, what’s the height of the tallest building for each material?
  2. For each city, what’s the height difference between the tallest steel building and the tallest concrete building?

(Demo)

sky

14 of 21

Group or Pivot?

  • When to Group:
    • aggregates of one categorical variable
    • aggregates of many variables
    • Multiple outputs (aggregate columns)

  • When to Pivot:
    • Aggregates of exactly two variables
    • Few unique values for column variable
    • Interested in every combination of values

15 of 21

16 of 21

Joins

17 of 21

Joining Two Tables

Drink

Cafe

Price

Milk Tea

Asha

5.5

Espresso

Strada

1.75

Latte

Strada

3.25

Espresso

FSM

2

Coupon

Location

10%

Asha

25%

Strada

5%

Asha

drinks

discounts

Cafe

Drink

Price

Coupon

Asha

Milk Tea

5.5

10%

Asha

Milk Tea

5.5

5%

Strada

Espresso

1.75

25%

Strada

Latte

3.25

25%

drinks.join('Cafe', discounts, 'Location')

Match rows in this table ...

… using values in this column ...

… with rows in that table ...

… using values in that column.

The joined column is sorted automatically

(Demo)

Columns from both tables

18 of 21

Table Review

19 of 21

Important Table Methods

t.select(column, …) or t.drop(column, …)

t.take([row_num, …]) or t.exclude([row_num, …])

t.sort(column, descending=False)

t.where(column, are.condition(...))

t.apply(function_name, column, …)

t.group(column) or t.group(column, function_name)

t.group([column, …]) or t.group([column, …], function_name)

t.pivot(cols, rows) or t.pivot(cols, rows, vals, function_name)

t.join(column, other_table, other_table_column)

20 of 21

Extra Example

21 of 21

Challenge Question

Generate a table of the names of the oldest buildings for each material for each city: