1 of 69

Final Exam Review

1

Data 6 Summer 2022

Getting ready to crush the final

Data 6 Staff

2 of 69

Expressions

  • Python expressions represent values.

  • Python expressions follow the rules of PEMDAS.

  • You can do addition and multiplication on strings!

2

17

“Hello” + “ World”

2 ** 3

(17 - 14) / 2

15 % 2

expression

3 of 69

Data Types

  • Fundamental Python Data Types
    • int, str, float, NoneType, bool
  • Aggregated Data Types
    • dictionaries, numpy arrays, tables
  • You can cast values into different data types.

3

str(1) -> “1”

float(5) -> 1.5

int(10.3) -> 10

int(True) -> 1

4 of 69

Names

  • We can save expressions and values into names!

  • Names allow us to easily keep track of values and functions.

4

two = 2

james = “James”

data = 6

max

sum

5 of 69

Arrays and Indexing

  • Arrays are used to hold a sequence of values of the same data type
  • You can create array by using the function, make_array()
  • You can access items in an array using arr.item()
  • Arrays are zero-indexed, meaning indices start at 0.

5

two = 2

james = “James”

data = 6

max

sum

6 of 69

Arrays and Indexing

  • You can do arithmetic with arrays.
  • Array to value arithmetic applies the same expression to every element in the array.
  • Array to array arithmetic operates elementwise through each array.
    • Both arrays must be the same length.

6

make_array(1,2) * 4 -> [4, 8]

make_array(“mr.”, “James”) + make_array(“professor”, “ w”) -> [“mr.professor”, “James w”]

7 of 69

WWPD?

data = 6.0

six = 10.4

data + six + float(True)

7

8 of 69

WWPD?

data = 6.0

six = 10.4

data + six + float(True)

17.4

8

9 of 69

WWPD?

x = True

false = x

str(False) + str(false)

9

10 of 69

WWPD?

x = True

false = x

str(False) + str(false)

“FalseTrue”

10

11 of 69

WWPD?

car = “bar”

cdr = “foo”

car = cdr + car

cdr = (car + cdr) * 2

cdr

11

12 of 69

WWPD?

car = “bar”

cdr = “foo”

car = cdr + car

cdr = (car + cdr) * 2

cdr

“foobarfoofoobarfoo”

12

13 of 69

WWPD?

one = 2.6

two = 1.2

three = 47.3

four = make_array(three, two, one)

one = str(int(four.item(1)))

(one + str(four.item(0))) * int(four.item(2))

13

14 of 69

Visualization

Data 6 Summer 2022

FINAL EXAM REVIEW

Bar Charts, Histograms, Scatter Plots, Line Plots, Maps

Developed by students and faculty at UC Berkeley and Tuskegee University

data6.org/su22/syllabus/#acknowledgements-

15 of 69

Encoding

An encoding is a mapping from a variable to a visual element.

Examples:

  • Bar length — longer bar = higher average age
  • Point size – larger point = older player
  • Point color – different colors for different categories (e.g. one color for forwards, one color for guards).
  • Point label – according to a category (name, position, team, etc.)

15

16 of 69

Three-Step Process for Visualization

16

Pre-Process

Create a table with only the columns necessary to create the visualization

Customize the Plot

Provide the correct arguments for visual customization

Choose the Plot Type

Call the correct visualization (depending on variable type)

17 of 69

17

e.g number of cars, number of Cal students

e.g price, temperature, GPA, weight

e.g highest degree attained, Yelp stars

e.g colors, political affiliation

Can do arithmetic with.

Cannot do arithmetic with.

Variable Type

Categorical

aka Qualitative

Numerical

aka Quantitative

Ordinal

Categories with some inherent ordering.

Nominal

Categories with no inherent ordering.

Discrete

Whole numbers; can be counted.

Continuous

Numbers with decimals; often measured.

Choose the Plot Type

Call the correct visualization (depending on variable type)

18 of 69

Optional Exam Practice Problems: Question 3.1

How many variables are encoded in this scatter plot?

18

Quick Check

19 of 69

Categorical Distributions,

Bar Charts

19

20 of 69

Bar Charts and Categorical Distributions

Bar charts are often used to display the relationship between�a categorical variable and a numerical variable.

tbl.barh(column_for_categories)

  • Values in column_for_categories are the unique categories on the y-axis
  • Bars represent every other column in tbl.

20

Cookie

Count

chocolate chip

15

red velvet

15

oatmeal raisin

10

sugar cookies

10

peanut butter

5

cookies.barh('Cookie')

21 of 69

Visualization Note: Bar Order

Depending on the type of categorical variable we’re displaying, we may want to sort the bars of our bar charts differently.

21

Sort by bar length: e.g., if categorical variable has no natural order to the categories.

Sort by category: e.g., if categorical variable has an inherent ordering like alphabetical, numerical, etc.

Cookie

Count

chocolate chip

15

red velvet

15

oatmeal raisin

10

sugar cookies

10

peanut butter

5

Semester

Enrollment

Fall 2020

70

Spring 2021

55

Fall 2021

80

Spring 2022

60

The bar order depends on what you want to express through your visualization.

22 of 69

sort()

The method tbl.sort(...) returns a new table with the rows sorted according to the values in some column. There are two ways we can call it:

  1. tbl.sort(column_or_label)
    • Sorts rows according to the specified column, in ascending (increasing) order.
  2. tbl.sort(column_or_label, descending = True)
    • Sorts rows according to the specified column, in descending (decreasing) order.

22

Cookie

Count

chocolate chip

15

red velvet

15

oatmeal raisin

10

sugar cookies

10

peanut butter

5

cookies.sort(‘Count’, descending = True)

23 of 69

Numerical Distributions, Histograms

23

24 of 69

Histograms

A histogram visualizes the distribution of a numerical variable by binning (counting the number of numerical values that fall within ranges, called “bins”).

tbl.hist(column)

  • It automatically chooses bins for us. We can change them.

24

25 of 69

np.arange()

The NumPy function np.arange() creates a sequence of numbers.

25

Array ranges work like indexing:�inclusive of the starting position, and exclusive of the ending position.

Format

Returns

np.arange(n)

An array of all integers from 0 to n-1.

np.arange(start, stop)

An array of all integers from start to stop-1.

np.arange(start, stop, step)

An array of all integers from start to stop-1, counting by step.

26 of 69

A Note on Bins

By looking at a histogram, we cannot tell how values are distributed within a bin.

26

All heights in this bin could be 64 inches.

Or they could all be 66 inches.

Or half could be 65 and half could be 67.

Unless we have the actual data, we can’t tell.

27 of 69

Optional Exam Practice Problems: Question 3.2

Given a table tips with one column called "tips" containing the tip amount the server received for each order, write one line of code to generate the following histogram:

27

Quick Check

28 of 69

Optional Exam Practice Problems: Question 3.2 SOLUTION

Given a table tips with one column called "tips" containing the tip amount the server received for each order, write one line of code to generate the following histogram:

28

tips.hist('tips', bins = np.arange(1, 11)

Quick Check

29 of 69

Bar Charts vs. Histograms

Bar charts visualize the distribution of a categorical variable, or the relationship between a categorical variable and a numerical variable.

  • Length of bar corresponds to value.
  • Width of bar means nothing.

Histograms visualize the distribution of a numerical variable.

  • Length of bar corresponds to number of values within bin.
  • Width of bar corresponds to the width of the bin.
    • Wider bin → more values within bin → smoother histogram.

29

30 of 69

Scatter Plots

30

31 of 69

Scatter Plots

Scatter plots are used to visualize two numerical variables at once. To create a scatter plot from a table, you need two columns:

  • A numerical column for the x-axis.
  • A numerical column for the y-axis.

The resulting graph has one point for every row in your table.

32 of 69

.scatter()

tbl.scatter(column_for_x, column_for_y)

  • If only column_for_x is provided, a separate scatter plot is drawn for every other column in t (similar to the behavior of barh).

33 of 69

Line Plots

33

34 of 69

Line Plots and .plot()

What if we want to visualize two numerical variables, but one of them is time?

  • There’s only one y for every x.
  • We want to emphasize a trend by “connecting the dots”.

tbl.plot(column_for_x, column_for_y)

  • column_for_x should contain some time-based numerical variable.
  • If only column_for_x is provided, a separate line plot is drawn for every other column in t (similar to the behavior of barh and scatter).

35 of 69

Scatter Plots vs. Line Plots

Scatter plots visualize the relationship between any two numerical variables.

  • No need to have unique x (or y) values.
  • Useful for identifying patterns between variables

Line plots visualize the relationship between two numerical variables — one of them is ordered.

  • x-axis generally represents time or distance.
  • There should only be one y value for every x value.
  • Useful for identifying trends over time

36 of 69

Maps

36

37 of 69

Scatter Plot Maps

When we want to visualize the geographic locations of a lot of data points, it's often helpful to start with a scatter plot map.

  • Scatter plots with geographic maps
  • Help you visualize geographic locations in relation to cities, states, and countries.

37

Use px.scatter_geo(df, lat, lon)

data frame, latitude, longitude

Overcrowding!

38 of 69

Choropleth Maps

Choropleth maps are useful for visualizing numerical variables across different states or countries. In this sense they are analogous to bar charts, since they encode one categorical variable (state or country) and one numerical variable.

38

Aggregation!

.group()

Use px.choropleth(df, locations)

data frame, state abbreviations

39 of 69

Summary

39

Visualization

Description

Python

Bar Chart

distribution of a categorical variable, or the relationship between a categorical variable and a numerical variable

tbl.barh(column_for_categories)

Histogram

distribution of a numerical variable

tbl.hist(column)

Scatter Plot

relationship between any two numerical variables

tbl.scatter(column_for_x, column_for_y)

Line Plot

relationship between two numerical variables — one of them is ordered

tbl.plot(column_for_x, column_for_y)

40 of 69

Table Manipulations

40

41 of 69

Table Properties

Table: a sequence of labeled columns

Row: one individual, one data point

Column: one attribute, one feature

42 of 69

Method 1: .take(row_index/array of index)

  • Can be used with tables or arrays
  • Example:
    • table.take(0) takes the 1st row from a table
    • array.take(0) takes the 1st element from an array
    • table.take(np.arange(5)) takes the first 5 rows from a table

43 of 69

Method 2: Select, Drop, Relabel, Add Columns

  • school.select(n1, n2, ...): takes in one or more labels (or indices) and returns a new table with just those columns
  • school.drop(n1, n2, ...): takes in one or more labels (or indices) and returns a new table without those
  • school.relabeled(old_name, new_name): replaces the label old_name replaced with new_name and returns a new table with

*Above methods return new tables. The original table school is not changed!

  • school.with_column(name, vals): adds a column with label name and values vals.
  • school.with_columns(n1, v1, n2, v2, ...): add multiple columns at once

*if adding a column that already exists, will replace old column values with new ones

44 of 69

Method 3: Filtering with Where

school.where(“Founded”, 1869)

or: school.where(“Founded”, are.equal_to(1869)

school.where(label, predicate): returns a new table that contains only the rows whose label field/attribute satisfies the predicate

More examples: Lab 2

List of predicates

45 of 69

Method 4: .join()

table1.join(‘col1’, table2, ‘col2’)

  • Looks for matches in col 1 from table 1 and col2 from table 2
  • Combines the rows that have a match

46 of 69

Practice Problem

data8_roster has the following 3 columns: (25 rows)

  • 'First Name', Student's (not necessarily unique) first name
  • 'Student ID', Student's (unique) Student ID
  • 'Test Average', Student's average test grade

while englishr1a_roster has the following 5 columns: (35 rows)

  • 'Name', Student's (not necessarily unique) first name
  • 'Student ID', Student's (unique) Student ID
  • 'Essay 1', Student's grade on Essay 1
  • 'Essay 2', Student's grade on Essay 2
  • 'Final Essay', Student's grade on the Final Essay
  • There are 12 students enrolled in BOTH classes

  • How many columns does the table data8_roster.join('Student ID', englishr1a_roster) have?
  • How many rows does the table data8_roster.join('Student ID', englishr1a_roster)

46

47 of 69

Practice Problem

data8_roster has the following 3 columns: (25 rows)

  • 'First Name', Student's (not necessarily unique) first name
  • 'Student ID', Student's (unique) Student ID
  • 'Test Average', Student's average test grade

while englishr1a_roster has the following 5 columns: (35 rows)

  • 'Name', Student's (not necessarily unique) first name
  • 'Student ID', Student's (unique) Student ID
  • 'Essay 1', Student's grade on Essay 1
  • 'Essay 2', Student's grade on Essay 2
  • 'Final Essay', Student's grade on the Final Essay

  • How many columns does the table data8_roster.join('Student ID', englishr1a_roster) have? 7
    • When joining two tables, the resulting table contains all of the columns in both tables minus one.
  • How many rows does the table data8_roster.join('Student ID', englishr1a_roster) have? 12

47

48 of 69

Method 5: .group(column)

  • Used for data aggregation
  • table.group(column) counts the number of rows for each unique value in column, and returns the counts in a two-column table
    • Column 1: every unique value in column
    • Column 2: count for each unique value

49 of 69

Practice: streams table

49

Fill in the blanks below to generate a table that contains the top 10 artists sorted by most songs

  • streams.group(_________).sort(________, ________).take(__________)

50 of 69

Practice: streams table

50

Fill in the blanks below to generate a table that contains the top 10 artists sorted by most songs

  • streams.group(artist_names).sort(‘count’, descending=True).take(np.arange(10))

51 of 69

Method 6: .pivot(columns, rows, values, collect)

  • default: counts the number of occurrences of each [col, row] combination
  • When values and collect function are specified:
    • Returns a table with entries specified by values combined using the collect function for every [col, row] combination

52 of 69

Method 7: .apply(function)

t.apply(function, column_or_columns)

  • applies function to every element in column_or_columns, and returns an array with the results.
    • If you only supply one column name, function should only take one argument.
    • If you supply X column names, function should take X arguments.

52

53 of 69

Practice Problem: assets table

53

We want to compute the Closing Price today of each commodity in the assets table using formula:

Closing Price Today = Closing Price Yesterday * (100% + Growth Today)

Step 1: Define a function that computes the closing price today

def get_today_price(yesterday_price, pct_str):

# Get the growth rate as a float

pct = __(a)__

# Apply the formula

today_price = __(b)__

# Round the result to 2 decimal places and return

return np.round(today_price, 2)

Step 2: compute closing price today for each commodity in the assets table:

prices_applied = assets.apply(??????????)

54 of 69

Practice Problem: assets table

54

We want to compute the Closing Price today of each commodity in the assets table using formula:

Closing Price Today = Closing Price Yesterday * (100% + Growth Today)

Step 1: Define a function that computes the closing price today

def get_today_price(yesterday_price, pct_str):

# Get the growth rate as a float

pct = float(pct_str.replace('%', '')

# Apply the formula

today_price = yesterday_price * (1 + pct/100)

# Round the result to 2 decimal places and return

return np.round(today_price, 2)

Step 2: compute closing price today for each commodity in the assets table:

prices_applied = assets.apply(get_today_price,

'Closing Price Yesterday',

'Growth Today')

55 of 69

Control & Iteration

55

56 of 69

Booleans

Any expression that evaluates to True or False:

  • Comparison Operators: ==, <=, >=, !=, <, >
  • Statements using and and or keywords
  • Any number besides 0 is True
  • Any non-empty String is True

56

These all evaluate to False:

  • False
  • ''(the empty string)
  • 0 (and hence 0.0)
  • None
  • Generally things that are empty (empty lists, sets, dictionaries, etc)

57 of 69

Boolean Expression Examples

'' == True

(True or False) and (False or True) or 1/0

(5 and -1) and 0

“abc” < “def”

57

False

True

False

True

58 of 69

If-Statements

if <boolean expression>:

<if body>

elif <boolean expression>:

<elif body>

else:

<else body>

58

Optional

If a <boolean expression> is True, the corresponding <... body> is run.

If all <boolean expression>(s) are False, <else body> is run.

59 of 69

If-Statement Example

def mystery(x, y):

if x < 7:

if y == “Berkeley”:

return “Yay!”

else:

return “Boo!”

else:

return y

59

mystery(5, “Stanford”)

mystery(9, “Berkeley”)

What do the following return?

“Boo!”

“Berkeley”

60 of 69

While Loops

“While the expression evaluates to True, run the body.”

* Make sure that your “<boolean expression>” eventually evaluates to False

60

Example:

while <boolean expression>:

<body>

61 of 69

For Loops

for <element> in <sequence>:

<for body>

“For each element in the sequence, run the body.”

61

Example:

* Sequence can be arrays, lists, strings, etc.

62 of 69

While Loops Vs. For Loops

While Loops

  • Use when you don’t know how many iterations will be run in advance

  • To determine when the while loop will terminate, we often keep track of some sort of counter and check if it is above, below, or equal to a certain value

For Loops

  • Use when you know how many iterations will be run in advance

  • You can use these to loop through existing arrays, strings, etc.

  • You can also use the array created by np.arange(n) in order to execute the body of the for loop n times

62

63 of 69

Practice!

Fill in the blanks in the fours_and_sevens(n) function so that it does the following given an integer input n:

  • Starting with 1, find all multiples of 4 or 7 up to (and including) n
    • If the number is a multiple of 4 (but not 7), print "I'm a boring number"
    • If the number is a multiple of 7 (but not 4), print "I'm a cool number"
    • If the number is a multiple of 4 and 7, print "I guess I'm sorta cool"

63

64 of 69

Solution

64

65 of 69

Practice!

Fill in the blanks to create a function that:

  • Takes in a String
  • Prints out each character of the String in uppercase.

65

66 of 69

Solution

66

67 of 69

Practice!

Fill in the two blanks in np.arange() so that the following code works as displayed.

67

68 of 69

Solution

68

69 of 69

Good Luck!

69