Join at slido.com�#1308228
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
1308228
Week 3 Climate Survey Results
2
Based on 187 Responses
1308228
Visualization I
Visualizing distributions and KDEs
3
LECTURE 7
Data 100/Data 200, Spring 2024 @ UC Berkeley
Narges Norouzi and Joseph E. Gonzalez
1308228
Goals for this Lecture
Lecture 7, Data 100 Spring 2024
Understand the theories behind effective visualizations and start to generate plots of our own
4
1308228
Agenda
Lecture 7, Data 100 Spring 2024
5
1308228
Goals of Visualization
Lecture 7, Data 100 Spring 2024
6
1308228
Where Are We?
7
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
Reports, Decisions, and Solutions
?
Data Wrangling
Intro to EDA
Working with Text Data
Regular Expressions
Plots and variables
Seaborn
Viz principles
KDE/Transformations
(Part I: Processing Data)
(Part II: Visualizing and Reporting Data)
(today)
1308228
Visualizations in Data 8 (and Data 100, so far)
You worked with many types of visualizations throughout Data 8.
8
Line plot
Scatter plot
Histogram from Homework #1
What did these achieve?
1308228
Goals of Data Visualization
Goal 1: To help your own understanding of your data/results.
9
What do these goals imply?
Visualizations aren't a matter of making "pretty" pictures.
We need to do a lot of thinking about what stylistic choices communicate ideas most effectively.
Goal 2: To communicate results/conclusions to others.
1308228
Goals of Data Visualization
10
What do these goals imply?
Visualizations aren't a matter of making "pretty" pictures.
We need to do a lot of thinking about what stylistic choices communicate ideas most effectively.
First half of visualization topics in Data 100: Choosing the "right" plot for
Second half of visualization topics in Data 100: Stylizing plots appropriately
1308228
Visualizing Distributions
Lecture 7, Data 100 Spring 2024
11
1308228
Distributions
A distribution describes…
…for a single variable
12
Example: Distribution of faculty to different departments at Cal.
In other words: How is the variable distributed across all of its possible values?
This means that percentages should sum to 100% (if using proportions) and counts should sum to the total number of datapoints (if using raw counts).
Let's see some examples.
1308228
Does this chart show a distribution?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
1308228
Does this chart show a distribution?
No.
14
1308228
1308228
Does this chart show a distribution?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
1308228
Does this chart show a distribution?
Yes!
16
1308228
1308228
Variable Types Should Inform Plot Choice
Different plots are more or less suited for displaying particular types of variables.
17
First step of visualization: Identify the variables being visualized. Then, select a plot type accordingly.
1308228
Bar Plots: Distributions of Qualitative Variables
Bar plots are the most common way of displaying the distribution of a qualitative variable.
18
*Sometimes quantitative discrete data too, if there are few unique values.
1308228
World Bank Dataset
We will be using the wb dataset about world countries for most of our work today.
19
1308228
Generating Bar Plots: Matplotlib
In Data 100, we will mainly use two libraries for generating plots: Matplotlib and Seaborn.
20
import matplotlib.pyplot as plt
plt.plotting_function(x_values, y_values)
Matplotlib is typically given the alias plt
Most Matplotlib plotting functions follow the same structure: We pass in a sequence (list, array, or Series) of values to be plotted on the x-axis, and a second sequence of values to be plotted on the y-axis.
To add labels and a title:
plt.xlabel("x axis label")
plt.ylabel("y axis label")
plt.title("Title of the plot");
1308228
Generating Bar Plots: Matplotlib
21
plt.bar(continents.index, continents.values);
To create a bar plot in Matplotlib: plt.bar( )
continents = wb["Continent"].value_counts()
x values
y values
1308228
Generating Bar Plots: pandas Native Plotting
22
wb["Continent"].value_counts().plot(kind='bar')
To create a bar plot in native pandas: .plot(kind='bar')
1308228
Generating Bar Plots: Seaborn
23
import seaborn as sns
sns.plotting_function(data=df, x="x_col", y="y_col")
Seaborn is typically given the alias sns
Seaborn plotting functions use a different structure: Pass in an entire DataFrame, then specify what column(s) to plot.
To add labels and a title, use the same syntax as before:
plt.xlabel("x axis label")
plt.ylabel("y axis label")
plt.title("Title of the plot");
1308228
Generating Bar Plots: Seaborn
24
import seaborn as sns
sns.countplot(data=wb, x="Continent");
countplot operates at a higher level of abstraction!
You give it the entire DataFrame and it does the counting for you.
1308228
Distributions of Quantitative Variables
Earlier, we said that bar plots are appropriate for distributions of qualitative variables.
Why only qualitative? Why not quantitative as well?
25
A bar plot will create a separate bar for each unique value. This leads to too many bars for continuous data!
1308228
Distributions of Quantitative Variables
To visualize the distribution of a continuous quantitative variable:
26
Box plot
Violin plot
Histogram
1308228
Box plots and Violin Plots
Box plots and violin plots display distributions using information about quartiles.
27
sns.boxplot(data=df, y="y_variable");
sns.violinplot(data=df, y ="y_variable");
1308228
Quartiles
For a quantitative variable:
The interval [first quartile, third quartile] contains the "middle 50%" of the data.
Interquartile range (IQR) measures spread.
28
The length of this region is the IQR
1308228
Box Plots
29
sns.boxplot(data=wb, y="Gross domestic product: % growth : 2016")
First quartile (25th percentile)
Second quartile (median)
Third quartile (75th percentile)
Whisker: upper quartile + 1.5*IQR
Whisker: lower quartile - 1.5*IQR
Outliers
Outliers
Why an outlier? [link]
1308228
Violin Plots
Violin plots are similar to box plots, but also show smoothed density curves.
30
sns.violinplot(data=wb, y="Gross domestic product: % growth : 2016")
1308228
Side-by-side Box and Violin Plots
What if we wanted to incorporate a qualitative variable as well? For example, compare the distribution of a quantitative continuous variable across different qualitative categories.
31
GDP growth: quantitative continuous
Continent: qualitative nominal
sns.boxplot(data=wb, x="Continent", y="Gross domestic product: % growth : 2016");
1308228
Histograms
A histogram:
32
The first bin has a width of $16410
height of 4.77 x 10-5
This means that it contains 16410 x (4.77 x 10-5) = 78.3% of all datapoints in the dataset.
1308228
How many observations are in the bin [110, 115) given the following information?
- There are 1174 observations in total.
- Width of bin [110, 115): 5
- Height of bar [110, 115): 0.02
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
1308228
Answer
There are 1174 observations total.
34
1308228
Histograms in Code
In Matplotlib [Documentation]: plt.hist(x_values, density=True)
In Seaborn [Documentation]: sns.histplot(data=df, x="x_column", stat="density")
35
Matplotlib
Seaborn
1308228
Overlaid Histograms
To compare a quantitative variable's distribution across qualitative categories, overlay histograms on top of one another.
36
The hue parameter of Seaborn plotting functions sets the column that should be used to determine color.
sns.histplot(data=wb, hue="Hemisphere", x="Gross national income…")
Always include a legend when color is used to encode information!
1308228
Interpreting Histograms
The skew of a histogram describes the direction in which its "tail" extends.
A histogram with no clear skew is called symmetric.
37
A long right tail
A long left tail
1308228
Interpreting Histograms
The mode(s) of a histogram are the peak values in the distribution.
38
Unimodal
Bimodal?
1308228
Interlude
39
Made by DALL-E
1308228
Lecture 7 ended here!
We will cover the rest in lecture 8
40
1308228
Kernel Density Estimation
Lecture 7, Data 100 Spring 2024
41
1308228
Kernel Density Estimation: Intuition
Often, we want to identify general trends across a distribution, rather than focus on detail. Smoothing a distribution helps generalize the structure of the data and eliminate noise.
42
A KDE curve
Idea: approximate the probability distribution that generated the data.
1308228
Kernel Density Estimation: Process
43
Idea: Approximate the probability distribution that generated the data.
A kernel is a function that tries to capture the randomness of our sampled data.
A datapoint in our dataset
The kernel models the probability of us sampling that datapoint.
Area below integrates to 1
1308228
Step 1️⃣ – Place a Kernel at Each Data Point
Consider a fake dataset with just five collected datapoints.
44
Each line represents a datapoint in the dataset
(e.g. one country’s HIV rate).
Place a kernel on top of each datapoint.
1308228
Step 2️⃣ – Normalize Kernels
In Step 3, We will be summing each of these kernels to produce a probability distribution.
45
Each kernel has area 1.
Each normalized kernel has density ⅕.
1308228
Step 3️⃣ – Sum the Normalized Kernels
At each point in the distribution, add up the values of all kernels. This gives us a smooth curve with area 1 – an approximation of a probability distribution!
46
Sum these five normalized curves together.
The final KDE curve.
1308228
Result
47
Each line represents a datapoint in the dataset
(e.g. one country’s HIV rate).
The density at each point corresponds to the KDE calculated based on kernels placed on all data points
1308228
Summary of KDE
A general “KDE formula” function is given above.
48
1️⃣
2️⃣
3️⃣
K1(x, 2)
K1(x, 6)
1️⃣
1308228
Summary of KDE
A general “KDE formula” function is given above.
𝝰 is the bandwidth or smoothing parameter.
49
1️⃣
2️⃣
3️⃣
K1(x, 2)
K1(x, 6)
1️⃣
2️⃣
3️⃣
1308228
Kernels
A kernel (for our purposes) is a valid density function, meaning:
Memorizing this formula is less important than knowing the shape and how the bandwidth parameter 𝝰 smoothes the KDE.
50
The most common kernel is the Gaussian kernel.
1308228
Effect of Bandwidth on KDEs
Bandwidth is analogous to the width of each bin in a histogram.
51
1308228
Other Kernels: Boxcar
As an example of another kernel, consider the boxcar kernel.
52
A boxcar kernel centered on xi = 4 with 𝝰 = 2.
1308228
Which of the following are valid kernel density plots?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
1308228
Have a Normal Day!
54
1308228
Visualization I
55
LECTURE 7
Content credit: Acknowledgments
1308228