1 of 55

Join at slido.com�#1308228

ⓘ

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

1308228

2 of 55

Week 3 Climate Survey Results

Based on 187 Responses

1308228

3 of 55

Visualization I

Visualizing distributions and KDEs

LECTURE 7

Data 100/Data 200, Spring 2024 @ UC Berkeley

Narges Norouzi and Joseph E. Gonzalez

1308228

4 of 55

Goals for this Lecture

Lecture 7, Data 100 Spring 2024

Understand the theories behind effective visualizations and start to generate plots of our own

The necessary "pre-thinking" before creating a plot
Python libraries for visualizing data

1308228

5 of 55

Agenda

Lecture 7, Data 100 Spring 2024

Visualization

Goals of visualization
Visualizing distributions
Kernel density estimation

1308228

6 of 55

Goals of Visualization

Lecture 7, Data 100 Spring 2024

Visualization

Goals of visualization
Visualizing distributions
Kernel density estimation

1308228

7 of 55

Where Are We?

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

Data Wrangling

Intro to EDA

Working with Text Data

Regular Expressions

Plots and variables

Seaborn

Viz principles

KDE/Transformations

(Part I: Processing Data)

(Part II: Visualizing and Reporting Data)

(today)

1308228

8 of 55

Visualizations in Data 8 (and Data 100, so far)

You worked with many types of visualizations throughout Data 8.

Line plot

Scatter plot

Histogram from Homework #1

What did these achieve?

Provide a high-level overview of a complex dataset.
Communicated trends to viewers.

1308228

9 of 55

Goals of Data Visualization

Goal 1: To help your own understanding of your data/results.

Key part of exploratory data analysis.
Summarize trends visually before in-depth analysis.
Lightweight, iterative and flexible.

What do these goals imply?

Visualizations aren't a matter of making "pretty" pictures.

We need to do a lot of thinking about what stylistic choices communicate ideas most effectively.

Goal 2: To communicate results/conclusions to others.

Highly editorial and selective.
Be thoughtful and careful!
Fine-tuned to achieve a communications goal.
Considerations: clarity, accessibility, and necessary context.

1308228

10 of 55

Goals of Data Visualization

What do these goals imply?

Visualizations aren't a matter of making "pretty" pictures.

We need to do a lot of thinking about what stylistic choices communicate ideas most effectively.

First half of visualization topics in Data 100: Choosing the "right" plot for

Introducing plots for different variable types
Generating these plots through code

Second half of visualization topics in Data 100: Stylizing plots appropriately

Smoothing and transforming visual data
Providing context through labeling and color

1308228

11 of 55

Visualizing Distributions

Lecture 7, Data 100 Spring 2024

Visualization

Goals of visualization
Visualizing distributions
Kernel density estimation

1308228

12 of 55

Distributions

A distribution describes…

The set of values that a variable can possibly take.
The frequency with which each value occurs.

…for a single variable

Example: Distribution of faculty to different departments at Cal.

The list of departments at Cal.
The number of faculty in each department.

In other words: How is the variable distributed across all of its possible values?

This means that percentages should sum to 100% (if using proportions) and counts should sum to the total number of datapoints (if using raw counts).

Let's see some examples.

1308228

13 of 55

Does this chart show a distribution?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

1308228

14 of 55

Does this chart show a distribution?

No.

The chart does show percents of individuals in different categories!
But, this is not a distribution because individuals can be in more than one category (see the fine print).

1308228

15 of 55

Does this chart show a distribution?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

1308228

16 of 55

Does this chart show a distribution?

Yes!

This chart shows the distribution of the qualitative ordinal variable "income tier."
Each individual is in exactly one category.
The values we see are the proportions of individuals in that category.
Everyone is represented, as the total percentage is 100%.

1308228

17 of 55

Variable Types Should Inform Plot Choice

Different plots are more or less suited for displaying particular types of variables.

First step of visualization: Identify the variables being visualized. Then, select a plot type accordingly.

1308228

18 of 55

Bar Plots: Distributions of Qualitative Variables

Bar plots are the most common way of displaying the distribution of a qualitative variable.

For example, the proportion of adults in the upper, middle, and lower classes.
Lengths encode values.

Widths encode nothing!
Color could indicate a sub-category (but not necessarily).

*Sometimes quantitative discrete data too, if there are few unique values.

1308228

19 of 55

World Bank Dataset

We will be using the wb dataset about world countries for most of our work today.

1308228

20 of 55

Generating Bar Plots: Matplotlib

In Data 100, we will mainly use two libraries for generating plots: Matplotlib and Seaborn.

import matplotlib.pyplot as plt

plt.plotting_function(x_values, y_values)

Matplotlib is typically given the alias plt

Most Matplotlib plotting functions follow the same structure: We pass in a sequence (list, array, or Series) of values to be plotted on the x-axis, and a second sequence of values to be plotted on the y-axis.

To add labels and a title:

plt.xlabel("x axis label")

plt.ylabel("y axis label")

plt.title("Title of the plot");

1308228

21 of 55

Generating Bar Plots: Matplotlib

plt.bar(continents.index, continents.values);

To create a bar plot in Matplotlib: plt.bar( )

[Documentation]

continents = wb["Continent"].value_counts()

x values

y values

1308228

22 of 55

Generating Bar Plots: pandas Native Plotting

wb["Continent"].value_counts().plot(kind='bar')

To create a bar plot in native pandas: .plot(kind='bar')

1308228

23 of 55

Generating Bar Plots: Seaborn

import seaborn as sns

sns.plotting_function(data=df, x="x_col", y="y_col")

Seaborn is typically given the alias sns

Seaborn plotting functions use a different structure: Pass in an entire DataFrame, then specify what column(s) to plot.

To add labels and a title, use the same syntax as before:

plt.xlabel("x axis label")

plt.ylabel("y axis label")

plt.title("Title of the plot");

1308228

24 of 55

Generating Bar Plots: Seaborn

To create a bar plot in Seaborn: sns.countplot( )

[Documentation]

import seaborn as sns

sns.countplot(data=wb, x="Continent");

countplot operates at a higher level of abstraction!

You give it the entire DataFrame and it does the counting for you.

1308228

25 of 55

Distributions of Quantitative Variables

Earlier, we said that bar plots are appropriate for distributions of qualitative variables.

Why only qualitative? Why not quantitative as well?

For example: The distribution of gross national income per capita.

A bar plot will create a separate bar for each unique value. This leads to too many bars for continuous data!

1308228

26 of 55

Distributions of Quantitative Variables

To visualize the distribution of a continuous quantitative variable:

Box plot

Violin plot

Histogram

1308228

27 of 55

Box plots and Violin Plots

Box plots and violin plots display distributions using information about quartiles.

In a box plot, the width of the box encodes no meaning.
In a violin plot, the width of the "violin" indicates the density of datapoints at each value.

sns.boxplot(data=df, y="y_variable");

[Documentation]

sns.violinplot(data=df, y ="y_variable");

[Documentation]

1308228

28 of 55

Quartiles

For a quantitative variable:

First or lower quartile: 25th percentile.
Second quartile: 50th percentile (median).
Third or upper quartile: 75th percentile.

The interval [first quartile, third quartile] contains the "middle 50%" of the data.

Interquartile range (IQR) measures spread.

IQR = third quartile – first quartile.

The length of this region is the IQR

1308228

29 of 55

Box Plots

sns.boxplot(data=wb, y="Gross domestic product: % growth : 2016")

First quartile (25th percentile)

Second quartile (median)

Third quartile (75th percentile)

Whisker: upper quartile + 1.5*IQR

Whisker: lower quartile - 1.5*IQR

Outliers

Why an outlier? [link]

1308228

30 of 55

Violin Plots

Violin plots are similar to box plots, but also show smoothed density curves.

The "width" of our "box" now has meaning!
The three quartiles and "whiskers" are still present – look closely.

sns.violinplot(data=wb, y="Gross domestic product: % growth : 2016")

1308228

31 of 55

Side-by-side Box and Violin Plots

What if we wanted to incorporate a qualitative variable as well? For example, compare the distribution of a quantitative continuous variable across different qualitative categories.

GDP growth: quantitative continuous

Continent: qualitative nominal

sns.boxplot(data=wb, x="Continent", y="Gross domestic product: % growth : 2016");

1308228

32 of 55

Histograms

A histogram:

Collects datapoints with similar values into a shared "bin".
Scales the bins such that the area of each bin is equal to the percentage of datapoints it contains (as in Data 8).

The first bin has a width of $16410

height of 4.77 x 10^-5

This means that it contains 16410 x (4.77 x 10^-5) = 78.3% of all datapoints in the dataset.

1308228

33 of 55

How many observations are in the bin [110, 115) given the following information?

- There are 1174 observations in total.

- Width of bin [110, 115): 5

- Height of bar [110, 115): 0.02

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

1308228

34 of 55

Answer

There are 1174 observations total.

Width of bin [110, 115): 5
Height of bar [110, 115): 0.02
Proportion in bin = 5 * 0.02 = 0.1
Number in bin = 0.1 * 1174 = 117.4

1308228

35 of 55

Histograms in Code

In Matplotlib [Documentation]: plt.hist(x_values, density=True)

In Seaborn [Documentation]: sns.histplot(data=df, x="x_column", stat="density")

Matplotlib

Seaborn

1308228

36 of 55

Overlaid Histograms

To compare a quantitative variable's distribution across qualitative categories, overlay histograms on top of one another.

The hue parameter of Seaborn plotting functions sets the column that should be used to determine color.

sns.histplot(data=wb, hue="Hemisphere", x="Gross national income…")

Always include a legend when color is used to encode information!

1308228

37 of 55

Interpreting Histograms

The skew of a histogram describes the direction in which its "tail" extends.

A distribution with a long right tail is skewed right.
A distribution with a long left tail is skewed left.

A histogram with no clear skew is called symmetric.

A long right tail

A long left tail

1308228

38 of 55

Interpreting Histograms

The mode(s) of a histogram are the peak values in the distribution.

A distribution with one clear peak is called unimodal.
Two peaks: bimodal.
More peaks: multimodal.

Unimodal

Bimodal?

1308228

39 of 55

Interlude

Made by DALL-E

1308228

40 of 55

Lecture 7 ended here!

We will cover the rest in lecture 8

1308228

41 of 55

Kernel Density Estimation

Lecture 7, Data 100 Spring 2024

Visualization

Goals of visualization
Visualizing distributions
Kernel density estimation

1308228

42 of 55

Kernel Density Estimation: Intuition

Often, we want to identify general trends across a distribution, rather than focus on detail. Smoothing a distribution helps generalize the structure of the data and eliminate noise.

A KDE curve

Idea: approximate the probability distribution that generated the data.

Assign an “error range” to each data point in the dataset – if we were to sample the data again, we might get a different value.
Sum up the error ranges of all data points.
Scale the resulting distribution to integrate to 1.

1308228

43 of 55

Kernel Density Estimation: Process

Idea: Approximate the probability distribution that generated the data.

Place a kernel at each data point.
Normalize kernels so that total area = 1.
Sum all kernels together.

A kernel is a function that tries to capture the randomness of our sampled data.

A datapoint in our dataset

The kernel models the probability of us sampling that datapoint.

Area below integrates to 1

1308228

44 of 55

Step 1️⃣ – Place a Kernel at Each Data Point

Consider a fake dataset with just five collected datapoints.

Place a Gaussian kernel with bandwidth of alpha = 1.
We will precisely define both the Gaussian kernel and bandwidth in a few slides.

Each line represents a datapoint in the dataset

(e.g. one country’s HIV rate).

Place a kernel on top of each datapoint.

1308228

45 of 55

Step 2️⃣ – Normalize Kernels

In Step 3, We will be summing each of these kernels to produce a probability distribution.

We want the result to be a valid probability distribution that has area 1.
We have 5 different kernels, each with an area 1.
So, we normalize by multiplying each kernel by ⅕.

Each kernel has area 1.

Each normalized kernel has density ⅕.

1308228

46 of 55

Step 3️⃣ – Sum the Normalized Kernels

At each point in the distribution, add up the values of all kernels. This gives us a smooth curve with area 1 – an approximation of a probability distribution!

Sum these five normalized curves together.

The final KDE curve.

1308228

47 of 55

Result

A summary of the distribution using KDE.

Each line represents a datapoint in the dataset

(e.g. one country’s HIV rate).

The density at each point corresponds to the KDE calculated based on kernels placed on all data points

1308228

48 of 55

Summary of KDE

A general “KDE formula” function is given above.

K𝝰(x, xi) is the kernel function centered on the observation i.

Each kernel individually has area 1.
K represents our kernel function of choice. We’ll talk about the math of these functions soon.

1️⃣

2️⃣

3️⃣

K1(x, 2)

K1(x, 6)

1️⃣

1308228

49 of 55

Summary of KDE

A general “KDE formula” function is given above.

K𝝰(x, xi) is the kernel centered on the observation i.

Each kernel individually has area 1.
x represents any number on the number line. It is the input to our function.

n is the number of observed data points that we have.

We multiply by 1/n to normalize the kernels so that the total area of the KDE is still 1.

Each xi (x1, x2, …, xn) represents an observed data point. We sum the kernels for each datapoint to create the final KDE curve.

𝝰 is the bandwidth or smoothing parameter.

1️⃣

2️⃣

3️⃣

K1(x, 2)

K1(x, 6)

1️⃣

2️⃣

3️⃣

1308228

50 of 55

Kernels

A kernel (for our purposes) is a valid density function, meaning:

It must be non-negative for all inputs.
It must integrate to 1(area under curve = 1).

Memorizing this formula is less important than knowing the shape and how the bandwidth parameter 𝝰 smoothes the KDE.

The most common kernel is the Gaussian kernel.

Gaussian = Normal distribution = bell curve.
Here, x represents any input, and xi represents the ith observed value (datapoint).
Each kernel is centered on our observed values (and so its distribution mean is xi).
𝝰 is the bandwidth parameter. It controls the smoothness of our KDE. Here, it is also the standard deviation of the Gaussian.

1308228

51 of 55

Effect of Bandwidth on KDEs

Bandwidth is analogous to the width of each bin in a histogram.

As 𝝰 increases, the KDE becomes more smooth.
Large 𝝰 KDE is simpler to understand, but gets rid of potentially important distributional information (e.g. multimodality).

1308228

52 of 55

Other Kernels: Boxcar

As an example of another kernel, consider the boxcar kernel.

It assigns uniform density to points within a “window” of the observation, and 0 elsewhere.
Resembles a histogram… sort of.

Not of any practical use in Data 100! Presented as a simple theoretical alternative.

A boxcar kernel centered on xi = 4 with 𝝰 = 2.

1308228

53 of 55

Which of the following are valid kernel density plots?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

1308228

54 of 55

Have a Normal Day!

1308228

55 of 55

Visualization I

LECTURE 7

Content credit: Acknowledgments

1308228