Join at slido.com�#1196901
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
1196901
Visualization II
KDEs, Transformations, and Visualization Theory
2
LECTURE 8
Data 100/Data 200, Spring 2024 @ UC Berkeley
Narges Norouzi and Joseph E. Gonzalez
1196901
Goals for this Lecture
Lecture 8, Data 100 Spring 2024
3
1196901
Where Are We?
4
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
Reports, Decisions, and Solutions
?
Data Wrangling
Intro to EDA
Working with Text Data
Regular Expressions
Plots and variables
Seaborn
Viz principles
KDE/Transformations
(Part I: Processing Data)
(Part II: Visualizing and Reporting Data)
(today)
1196901
Agenda
Lecture 8, Data 100 Spring 2024
5
1196901
Plotting Distributions - Revisited
Lecture 8, Data 100 Spring 2024
6
1196901
Kernel Density Estimation: Intuition
Often, we want to identify general trends across a distribution, rather than focus on detail. Smoothing a distribution helps generalize the structure of the data and eliminate noise.
7
A KDE curve
Idea: approximate the probability distribution that generated the data.
1196901
Kernel Density Estimation: Process
8
Idea: Approximate the probability distribution that generated the data.
A kernel is a function that tries to capture the randomness of our sampled data.
A datapoint in our dataset
The kernel models the probability of us sampling that datapoint.
Area below integrates to 1
1196901
Step 1️⃣ – Place a Kernel at Each Data Point
Consider a fake dataset with just five collected datapoints.
9
Each line represents a datapoint in the dataset
(e.g. one country’s HIV rate).
Place a kernel on top of each datapoint.
sns.rugplot(points, height=0.5)
1196901
Step 2️⃣ – Normalize Kernels
In Step 3, We will be summing each of these kernels to produce a probability distribution.
10
Each kernel has area 1.
Each normalized kernel has density ⅕.
1196901
Step 3️⃣ – Sum the Normalized Kernels
At each point in the distribution, add up the values of all kernels. This gives us a smooth curve with area 1 – an approximation of a probability distribution!
11
Sum these five normalized curves together.
The final KDE curve.
1196901
Result
12
Each line represents a datapoint in the dataset
(e.g. one country’s HIV rate).
The density at each point corresponds to the KDE calculated based on kernels placed on all data points
1196901
Summary of KDE
A general “KDE formula” function is given above.
13
1️⃣
2️⃣
3️⃣
K1(x, 2)
K1(x, 6)
1️⃣
1196901
Summary of KDE
A general “KDE formula” function is given above.
𝝰 is the bandwidth or smoothing parameter.
14
1️⃣
2️⃣
3️⃣
K1(x, 2)
K1(x, 6)
1️⃣
2️⃣
3️⃣
1196901
Kernels
A kernel (for our purposes) is a valid density function, meaning:
Memorizing this formula is less important than knowing the shape and how the bandwidth parameter 𝝰 smoothes the KDE.
15
The most common kernel is the Gaussian kernel.
1196901
Effect of Bandwidth on KDEs
Bandwidth is analogous to the width of each bin in a histogram.
16
1196901
Other Kernels: Boxcar
As an example of another kernel, consider the boxcar kernel.
17
A boxcar kernel centered on xi = 4 with 𝝰 = 2.
1196901
Which of the following are valid kernel density plots?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
1196901
Plotting Distributions - Revisited
Lecture 8, Data 100 Spring 2024
19
1196901
displot
displot is a wrapper for histplot, kdeplot, and ecdfplot to plot distributions.
20
sns.displot(data=wb,
x="gni",
kind="hist",
stat="density")
sns.displot(data=wb,
x="gni",
kind="kde")
sns.displot(data=wb,
x="gni",
kind="ecdf")
ECDF: Empirical Cumulative Distribution Function
1196901
Relationships between Quantitative Variables
Lecture 8, Data 100 Spring 2024
21
1196901
From Distributions to Relationships
Up until now, we focused exclusively on visualizing variable distributions.
Now we will visualize relationships between variables. In other words, how do sets of two (or more) variables vary in relation to one another?
22
1196901
Scatter Plots
Scatter plots are used to reveal relationships between pairs of numerical variables.
23
simple linear
simple nonlinear
linear, spreading
v-shaped
relationship appears linear, but with increasing spread as x gets larger
1196901
Scatter Plots
Scatter plots are used to reveal relationships between two quantitative variables [Documentation].
24
plt.scatter(x_values, y_values)
sns.scatterplot(data=df, x="x_column", \
y="y_column", hue="hue_column")
1196901
Overplotting
The plot on the previous slide suffered from overplotting – scatter points all stacked on top of one another are difficult to see.
Jittering: adding a small amount of random noise to all x and y values to slightly move each scatter point. Main trends are still present, but individual datapoints are easier to distinguish.
25
x_noise = np.random.uniform(-1, 1, len(wb))
y_noise = np.random.uniform(-5, 5, len(wb))
plt.scatter(wb['% growth'] + x_noise, \
wb['Literacy rate: Female'] + y_noise, \
s=15);
Decreasing point size also helps. s specifies the marker size in Matplotlib.
1196901
Scatter Plot Alternatives
Seaborn includes several built-in functions for making more complex scatter plots.
26
sns.lmplot(data=df, \
x="x_column", y="y_column")
sns.jointplot(data=df, \
x="x_column", y="y_column")
1196901
Hex Plots
Rather than plot individual datapoints, plot the density of their joint distribution.
Can be thought of as a two dimensional histogram.
27
sns.jointplot(data=df, x="x_column", \ y="y_column", kind="hex")
1196901
Contour Plots
2-dimensional version of a KDE plot.
Similar to a topographic map – contour lines represent an area that has the same density of datapoints throughout. Darker colors indicate more datapoints in the region.
28
sns.kdeplot(data=df, x="x_column", y="y_column", fill=True)
Dark color → many datapoints
1196901
Summary
Next, we’ll go deeper into the theory behind visualization.
29
1196901
Transformations
Lecture 8, Data 100 Spring 2024
30
1196901
Visualization Theory
Remember our goals of visualization:
These are influenced by our choice of visualization and our choices in how to prepare data for visualization.
31
What problems are there here?
We often transform a dataset to help prepare it for being visualized.
1196901
Linearization
When applying transformations, we often want to linearize the data – rescale the data so the x and y variables share a linear relationship.
32
Why?
1196901
Applying Transformations
What makes this plot non-linear?
33
2. Many large y values are all clumped
together, compressing the vertical axis.
1196901
Applying Transformations
What makes this plot non-linear?
34
Resolve by log-transforming the x data:
1196901
Applying Transformations
What makes this plot non-linear?
35
2. Many large y values are all clumped together, compressing the vertical axis.
Resolve by power-transforming the y data:
1196901
Interpreting Transformed Data
Now, we see a linear relationship between the transformed variables.
36
This tells us about the underlying relationship between the original x and y!
1196901
Tukey-Mosteller Bulge Diagram
The Tukey-Mosteller Bulge Diagram is a guide to possible transforms to try to get linearity.
37
You should still understand the logic we just worked through to decide how to transform the data. The bulge diagram is just a summary.
1196901
Tukey-Mosteller Bulge Diagram
38
If the data bulges like this…
…or transform x by this
…transform y by this
Could have transformed y by y2, y3
Could have transformed x by log(x), sqrt(x)
Applying to the data from before:
1196901
Visualization Theory
Lecture 8, Data 100 Spring 2024
39
1196901
Visualizations Are For Humans
40
“Looks like older people didn’t spend more money on tickets for the Titanic than younger people.”
(Note: A histogram or KDE would give stronger evidence than a scatter plot.)
1196901
Visualizations Are More Expressive than Summary Statistics
41
Each of these 13 datasets has the same mean, standard deviation, and correlation coefficient.
Visualizations complement statistics.
1196901
Information Channels
Lecture 8, Data 100 Spring 2024
42
1196901
Take Advantage of the Human Visual Perception System
Data can be visualized in many ways!
43
1196901
Rug Plot: Encoding 1 Variable
44
...
...
10px
16px
11px
NONE
11px
15px
Encoding
(Maps datum to visual position)
Mark
(Represents a datum)
1196901
Rug Plot: Different Marks
45
...
10px
16px
11px
NONE
11px
15px
Encoding
(Maps datum to visual position)
Mark
(Represents a datum)
...
1196901
Scatter Plot: Encoding 2 Variables
46
Encoding
(Maps datum to visual position)
Mark
(Represents a datum)
...
(10px, 7px)
(70px, 60px)
(45px, 9px)
(5px, 24px)
(45px, 37px)
(66px, 8px)
...
1196901
Going Beyond: Encoding 3+ Variables
How many variables are we encoding here?
47
1196901
How many variables are we encoding here?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
1196901
Going Beyond: Encoding 3+ Variables
How many variables are we encoding here?
We could add even more: Shapes, outline colors of shapes, shading, etc.�There are infinite possibilities!
49
Answer: 4.
1196901
Abusing Encodings: Length
There are many things that can go wrong in a visualization. For example, the visualization below abuses the length channel:
For the next huge chunk of today’s lecture, we’ll dive into ways to properly use other aspects of a visualization:
50
?? This is a very famous paper, but I’m not sure why Mackinlay thinks the bar chart would suggest USA cars are longer ??
1196901
Harnessing X/Y
Lecture 8, Data 100 Spring 2024
51
1196901
Case Study: Planned Parenthood Hearing
In 2015, Planned Parenthood was accused of selling aborted fetal tissue for profit.
Congressman Chaffetz (R-UT) showed this plot which originally appeared in a report by Americans United for Life.
52
1196901
Keep Axis Scales Consistent
The scales for the two lines are completely different!
In 2013:
53
Do not use two different scales for the same axis!
1196901
Always Consider the Scale When Comparing "Similar" Data
The top plot draws all of the data on the same scale.
54
1196901
Always Consider the Scale When Comparing "Similar" Data
We could also visualize abortions and cancer screenings as a percentage of total procedures.
55
1196901
Reveal the Data
Recommendations:
56
1196901
Reveal the Data
Recommendations:
Terrible White House COVID-19 visualization:
57
1196901
Harnessing Color
Lecture 8, Data 100 Spring 2024
58
1196901
Choosing a set of colors which work together is a challenging task!
Perception of Color
59
Download the Color Oracle App to simulate common color vision impairments.
1196901
1196901
Colormaps
60
Jet
Viridis
1196901
The Jet/Rainbow Colormap Actively Misleads
61
"Rainbow Colormap (Still) Considered Harmful", Borland and Taylor, 2007.
1196901
Use a Perceptually Uniform Colormap!
62
x-axis is color,�y-axis is “lightness”
Slope is constant
Bounces all over
1196901
Except When Not :) The Google Turbo Colormap
63
X-axis is color, y-axis is “lightness”
1196901
Use Color to Highlight Data Type
The plot on the right has both distinctions!
64
1196901
Sequential vs. Diverging Colormaps for Quantitative Data
If the data progresses from low to high, use a sequential scheme where lighter colors are for more extreme values.
If low and high values deserve equal emphasis, use a diverging scheme where lighter colors represent middle values.
65
1196901
Default matplotlib Colormaps
Taken from matplotlib documentation.
66
1196901
Harnessing Markings
Lecture 8, Data 100 Spring 2024
67
1196901
The accuracy of our judgements depend on the type of marking.
Perception of Markings
68
1196901
1196901
How much longer is the long bar?
69
🤔
1196901
How much longer is the long bar?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
1196901
The long bar is 7 times longer than the short bar.
71
1196901
How much bigger is the big circle?
72
🤔
1196901
How much bigger is the big circle?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
1196901
The area of the big circle is 7 times larger than the area of the small circle.
74
1196901
Lengths Are Easy to Distinguish. Others, Like Angles, Are Hard.
Don’t use pie charts! Visual angle judgments are inaccurate.
75
1196901
Areas Are Hard to Distinguish
(South Africa has twice the GDP of Algeria, but that isn’t clear from the areas.)
Avoid area charts!�Visual area judgments are inaccurate.
76
1196901
Areas Are Hard to Distinguish
Avoid word clouds too!
It’s hard to tell the area taken up by a word.
77
…that being said, if you are not trying to make quantifiable comparisons, then word clouds are useful for “the idea.”
1196901
Avoid "Jiggling" the Baseline!
Stacked bar charts, histograms, and area charts are hard to read because the baseline moves ("jiggles").
78
In the second plot:
In the first plot:
1196901
Avoid Jiggling the Baseline
Here, by switching to a line plot, comparisons are made much easier.
79
1196901
Harnessing Conditioning
Lecture 8, Data 100 Spring 2024
80
1196901
Use Conditioning to Aid Comparison
This data comes from the Bureau of Labor Statistics, who oversees surveys regarding the economic health of the US. They have plotted median weekly earnings for men and women by education level.
81
1196901
Use Conditioning to Aid Comparison
This data comes from the Bureau of Labor Statistics, who oversees surveys regarding the economic health of the US. They have plotted median weekly earnings for men and women by education level.
82
How could we more easily make this difficult comparison?
1196901
Use Conditioning to Aid Comparison
83
Having two separate lines makes clear the wage difference between men and women.
1196901
How Does the Income Gap Increase with Education?
84
See notebook for how to get this figure with groupby!
1196901
Other Notes: Superposition vs. Juxtaposition
Superposition: placing multiple density curves, scatter plots on top of each other (what we’ve usually been doing)
Juxtaposition: placing multiple plots side by side, with the same scale (called “small multiples”) (see left).
85
An example of small multiples.
1196901
Harnessing Context (for Publication)
Lecture 8, Data 100 Spring 2024
86
1196901
Getting Ready for Publication
87
1196901
Publication-Ready: Add Context Directly to Plot
A publication-ready plot needs:
The plots you create in this class always need titles and axis labels.
88
1196901
Publication-Ready: Captions
A publication-ready plot needs:
The plots you create in this class always need titles and axis labels.
A picture is worth a thousand words, but not all thousand words you want to tell may be in the picture. In many cases, we need captions to help tell the story:
89
1196901
Visualization II
90
LECTURE 8
Content credit: Acknowledgments
1196901