Final Exam Review
1
Data 6 Summer 2022
Getting ready to crush the final
Data 6 Staff
Expressions
2
17
“Hello” + “ World”
2 ** 3
(17 - 14) / 2
15 % 2
expression
Data Types
3
str(1) -> “1”
float(5) -> 1.5
int(10.3) -> 10
int(True) -> 1
Names
4
two = 2
james = “James”
data = 6
max
sum
Arrays and Indexing
5
two = 2
james = “James”
data = 6
max
sum
Arrays and Indexing
6
make_array(1,2) * 4 -> [4, 8]
make_array(“mr.”, “James”) + make_array(“professor”, “ w”) -> [“mr.professor”, “James w”]
WWPD?
data = 6.0
six = 10.4
data + six + float(True)
7
WWPD?
data = 6.0
six = 10.4
data + six + float(True)
17.4
8
WWPD?
x = True
false = x
str(False) + str(false)
9
WWPD?
x = True
false = x
str(False) + str(false)
“FalseTrue”
10
WWPD?
car = “bar”
cdr = “foo”
car = cdr + car
cdr = (car + cdr) * 2
cdr
11
WWPD?
car = “bar”
cdr = “foo”
car = cdr + car
cdr = (car + cdr) * 2
cdr
“foobarfoofoobarfoo”
12
WWPD?
one = 2.6
two = 1.2
three = 47.3
four = make_array(three, two, one)
one = str(int(four.item(1)))
(one + str(four.item(0))) * int(four.item(2))
13
Visualization
Data 6 Summer 2022
FINAL EXAM REVIEW
Bar Charts, Histograms, Scatter Plots, Line Plots, Maps
Developed by students and faculty at UC Berkeley and Tuskegee University
Encoding
An encoding is a mapping from a variable to a visual element.
Examples:
15
Three-Step Process for Visualization
16
Pre-Process
Create a table with only the columns necessary to create the visualization
Customize the Plot
Provide the correct arguments for visual customization
Choose the Plot Type
Call the correct visualization (depending on variable type)
17
e.g number of cars, number of Cal students
e.g price, temperature, GPA, weight
e.g highest degree attained, Yelp stars
e.g colors, political affiliation
Can do arithmetic with.
Cannot do arithmetic with.
Variable Type
Categorical
aka Qualitative
Numerical
aka Quantitative
Ordinal
Categories with some inherent ordering.
Nominal
Categories with no inherent ordering.
Discrete
Whole numbers; can be counted.
Continuous
Numbers with decimals; often measured.
Choose the Plot Type
Call the correct visualization (depending on variable type)
Optional Exam Practice Problems: Question 3.1
How many variables are encoded in this scatter plot?
18
Quick Check
Categorical Distributions,
Bar Charts
19
Bar Charts and Categorical Distributions
Bar charts are often used to display the relationship between�a categorical variable and a numerical variable.
tbl.barh(column_for_categories)
20
Cookie | Count |
chocolate chip | 15 |
red velvet | 15 |
oatmeal raisin | 10 |
sugar cookies | 10 |
peanut butter | 5 |
cookies.barh('Cookie')
Visualization Note: Bar Order
Depending on the type of categorical variable we’re displaying, we may want to sort the bars of our bar charts differently.
21
Sort by bar length: e.g., if categorical variable has no natural order to the categories.
Sort by category: e.g., if categorical variable has an inherent ordering like alphabetical, numerical, etc.
Cookie | Count |
chocolate chip | 15 |
red velvet | 15 |
oatmeal raisin | 10 |
sugar cookies | 10 |
peanut butter | 5 |
Semester | Enrollment |
Fall 2020 | 70 |
Spring 2021 | 55 |
Fall 2021 | 80 |
Spring 2022 | 60 |
The bar order depends on what you want to express through your visualization.
sort()
The method tbl.sort(...) returns a new table with the rows sorted according to the values in some column. There are two ways we can call it:
22
Cookie | Count |
chocolate chip | 15 |
red velvet | 15 |
oatmeal raisin | 10 |
sugar cookies | 10 |
peanut butter | 5 |
cookies.sort(‘Count’, descending = True)
Numerical Distributions, Histograms
23
Histograms
A histogram visualizes the distribution of a numerical variable by binning (counting the number of numerical values that fall within ranges, called “bins”).
tbl.hist(column)
24
np.arange()
The NumPy function np.arange() creates a sequence of numbers.
25
Array ranges work like indexing:�inclusive of the starting position, and exclusive of the ending position.
Format | Returns |
np.arange(n) | An array of all integers from 0 to n-1. |
np.arange(start, stop) | An array of all integers from start to stop-1. |
np.arange(start, stop, step) | An array of all integers from start to stop-1, counting by step. |
A Note on Bins
By looking at a histogram, we cannot tell how values are distributed within a bin.
26
All heights in this bin could be 64 inches.
Or they could all be 66 inches.
Or half could be 65 and half could be 67.
Unless we have the actual data, we can’t tell.
Optional Exam Practice Problems: Question 3.2
Given a table tips with one column called "tips" containing the tip amount the server received for each order, write one line of code to generate the following histogram:
27
Quick Check
Optional Exam Practice Problems: Question 3.2 SOLUTION
Given a table tips with one column called "tips" containing the tip amount the server received for each order, write one line of code to generate the following histogram:
28
tips.hist('tips', bins = np.arange(1, 11)
Quick Check
Bar Charts vs. Histograms
Bar charts visualize the distribution of a categorical variable, or the relationship between a categorical variable and a numerical variable.
Histograms visualize the distribution of a numerical variable.
29
Scatter Plots
30
Scatter Plots
Scatter plots are used to visualize two numerical variables at once. To create a scatter plot from a table, you need two columns:
The resulting graph has one point for every row in your table.
.scatter()
tbl.scatter(column_for_x, column_for_y)
Line Plots
33
Line Plots and .plot()
What if we want to visualize two numerical variables, but one of them is time?
tbl.plot(column_for_x, column_for_y)
Scatter Plots vs. Line Plots
Scatter plots visualize the relationship between any two numerical variables.
Line plots visualize the relationship between two numerical variables — one of them is ordered.
Maps
36
Scatter Plot Maps
When we want to visualize the geographic locations of a lot of data points, it's often helpful to start with a scatter plot map.
37
Use px.scatter_geo(df, lat, lon)
data frame, latitude, longitude
Overcrowding!
Choropleth Maps
Choropleth maps are useful for visualizing numerical variables across different states or countries. In this sense they are analogous to bar charts, since they encode one categorical variable (state or country) and one numerical variable.
38
Aggregation!
.group()
Use px.choropleth(df, locations)
data frame, state abbreviations
Summary
39
Visualization | Description | Python |
Bar Chart | distribution of a categorical variable, or the relationship between a categorical variable and a numerical variable | tbl.barh(column_for_categories) |
Histogram | distribution of a numerical variable | tbl.hist(column) |
Scatter Plot | relationship between any two numerical variables | tbl.scatter(column_for_x, column_for_y) |
Line Plot | relationship between two numerical variables — one of them is ordered | tbl.plot(column_for_x, column_for_y) |
Table Manipulations
40
Table Properties
Table: a sequence of labeled columns
Row: one individual, one data point
Column: one attribute, one feature
Method 1: .take(row_index/array of index)
Method 2: Select, Drop, Relabel, Add Columns
*Above methods return new tables. The original table school is not changed!
*if adding a column that already exists, will replace old column values with new ones
Method 3: Filtering with Where
school.where(“Founded”, 1869)
or: school.where(“Founded”, are.equal_to(1869)
school.where(label, predicate): returns a new table that contains only the rows whose label field/attribute satisfies the predicate
More examples: Lab 2
List of predicates
Method 4: .join()
table1.join(‘col1’, table2, ‘col2’)
Practice Problem
data8_roster has the following 3 columns: (25 rows)
while englishr1a_roster has the following 5 columns: (35 rows)
46
Practice Problem
data8_roster has the following 3 columns: (25 rows)
while englishr1a_roster has the following 5 columns: (35 rows)
47
Method 5: .group(column)
Practice: streams table
49
Fill in the blanks below to generate a table that contains the top 10 artists sorted by most songs
Practice: streams table
50
Fill in the blanks below to generate a table that contains the top 10 artists sorted by most songs
Method 6: .pivot(columns, rows, values, collect)
Method 7: .apply(function)
t.apply(function, column_or_columns)
52
Practice Problem: assets table
53
We want to compute the Closing Price today of each commodity in the assets table using formula:
Closing Price Today = Closing Price Yesterday * (100% + Growth Today)
Step 1: Define a function that computes the closing price today
def get_today_price(yesterday_price, pct_str):
# Get the growth rate as a float
pct = __(a)__
# Apply the formula
today_price = __(b)__
# Round the result to 2 decimal places and return
return np.round(today_price, 2)
Step 2: compute closing price today for each commodity in the assets table:
prices_applied = assets.apply(??????????)
Practice Problem: assets table
54
We want to compute the Closing Price today of each commodity in the assets table using formula:
Closing Price Today = Closing Price Yesterday * (100% + Growth Today)
Step 1: Define a function that computes the closing price today
def get_today_price(yesterday_price, pct_str):
# Get the growth rate as a float
pct = float(pct_str.replace('%', '')
# Apply the formula
today_price = yesterday_price * (1 + pct/100)
# Round the result to 2 decimal places and return
return np.round(today_price, 2)
Step 2: compute closing price today for each commodity in the assets table:
prices_applied = assets.apply(get_today_price,
'Closing Price Yesterday',
'Growth Today')
Control & Iteration
55
Booleans
Any expression that evaluates to True or False:
56
These all evaluate to False:
Boolean Expression Examples
'' == True
(True or False) and (False or True) or 1/0
(5 and -1) and 0
“abc” < “def”
57
False
True
False
True
If-Statements
if <boolean expression>:
<if body>
elif <boolean expression>:
<elif body>
else:
<else body>
58
Optional
If a <boolean expression> is True, the corresponding <... body> is run.
If all <boolean expression>(s) are False, <else body> is run.
If-Statement Example
def mystery(x, y):
if x < 7:
if y == “Berkeley”:
return “Yay!”
else:
return “Boo!”
else:
return y
59
mystery(5, “Stanford”)
mystery(9, “Berkeley”)
What do the following return?
“Boo!”
“Berkeley”
While Loops
“While the expression evaluates to True, run the body.”
* Make sure that your “<boolean expression>” eventually evaluates to False
60
Example:
while <boolean expression>:
<body>
For Loops
for <element> in <sequence>:
<for body>
“For each element in the sequence, run the body.”
61
Example:
* Sequence can be arrays, lists, strings, etc.
While Loops Vs. For Loops
While Loops
For Loops
62
Practice!
Fill in the blanks in the fours_and_sevens(n) function so that it does the following given an integer input n:
63
Solution
64
Practice!
Fill in the blanks to create a function that:
65
Solution
66
Practice!
Fill in the two blanks in np.arange() so that the following code works as displayed.
67
Solution
68
Good Luck!
69