1 of 60

Data Visualization

(Reading: 6.4-6.6)

(Slides adapted from Deb Nolan, Sandrine Dudoit, & Fernando Perez)

UC Berkeley Data 100 Summer 2019

Sam Lau

Learning goals:

Understand the principles of Scale, Conditioning, Perception, Transformation, Context, and Smoothing

2 of 60

Announcements

Project 1 due today!
Small group tutoring is starting this week

HW2 out Wednesday

Will be “officially” due Friday but we will take submissions without penalty until Tuesday (July 9)

HW3 out Friday

Due the following Friday (July 12)

3 of 60

When Submitting Assignments...

(Demo)

Run the last cell of the notebook

Ensure that OkPy link shows the latest version of the notebook

Then, submit the PDF to Gradescope!

You must label each question with the pages it’s on
If jassign breaks, you can use Print to PDF.

4 of 60

Data Visualization Principles

5 of 60

Six Principles Today

Scale
Conditioning
Perception
Transformations
Context
Smoothing

Explored via three case studies.

6 of 60

Case 1: Planned Parenthood 2015 Hearing

Investigation of federal funding of Planned Parenthood in light of fetal tissue controversy
Congressman Chaffetz (R-UT) showed plot which originally appeared in a report by Americans United for Life (http://www.aul.org/)

Full Report available at https://oversight.house.gov/interactivepage/plannedparenthoo d/ .

7 of 60

Case 1: Planned Parenthood 2015 Hearing

Procedures: cancer screenings and abortions
How many data points are plotted?
What is suspicious?
What message is this plot trying to convey?

8 of 60

Case 2: Median Weekly Earnings

Bureau of Labor Statistics surveys economics of labor
www.bls.gov - Web interface to a report generating app
Plot of median weekly earnings for males and females by education level

9 of 60

Case 2: Median Weekly Earnings

What comparisons are easily made with this plot?
What comparisons are most interesting and important?

10 of 60

Case 3: Cherry Blossom Runners

10 mi run in DC every April
Results available from 1999-2019
In 2019 over 17,000 runners
Scatter plot of run time (min) against age (yrs)

http://www.cherryblossom.org/

11 of 60

Case 3: Cherry Blossom Runners

70,000+ points in the plot!
What’s the relationship between run time and age?

12 of 60

Principles of Scale

14 of 60

Keep consistent axis scales

Don’t change scale mid-axis
Don’t use two different scales for same axis
How does this plot change perception of information?

15 of 60

Consider Scale of Data

Scales of cancer screenings vs. abortions quite different
Can plot percent change instead of raw counts

16 of 60

Reveal the Data

Choose axis limits to fill plot
If necessary, zoom into region with most of data

Can make separate plots for different regions

17 of 60

Principles of Conditioning

18 of 60

Conditioning

19 of 60

Use Conditioning To Aid Comparison

Conditioning on male/female aligns points on x-axis

What does it reveal?
Why is this interesting?

20 of 60

Use Small Multiples To Aid Comparison

Faceted plots that share scales are easy to compare

https://statmodeling.stat.columbia.edu/2009/07/15/hard_sell_for_b/

21 of 60

Principles of Perception

22 of 60

Color Choices Matter!

Jet Colormap

Viridis Colormap

23 of 60

Use a Perceptually Uniform Color Map

Perceptually uniform: changing data from 0.1 to 0.2 appears similar to change from 0.8 to 0.9.

Measure by running experiments on people!

Jet, the old matplotlib default, was far from uniform!
Our own Stéfan van der Walt and Nathaniel Smith at the Berkeley Institute of Data Science fixed this :)

https://bids.github.io/colormap/

Also, avoid red + green since many people are colorblind

24 of 60

Use a Perceptually Uniform Color Map

Jet Colormap

Viridis Colormap

25 of 60

Use Color to Highlight Data Type

Qualitative: Choose a qualitative scheme that makes it easy to distinguish between categories
Quantitative: Choose a color scheme that implies magnitude.
Plot on right has both!

26 of 60

Use Color to Highlight Data Type

Does the data progress from low to high? Use a sequential scheme where light colors are for more extreme values

27 of 60

Use Color to Highlight Data Type

Do both low and high value deserve equal emphasis? Use a diverging scheme where light colors represent middle values

28 of 60

Not All Marks Are Good!

Accuracy of judgements depend on the type of mark.
Aligned lengths most accurate
Color least accurate

29 of 60

Lengths are Easy to Understand

People can easily distinguish two different lengths

E.g. Heights of bars in bar chart

30 of 60

Angles are Hard to Understand

Avoid pie charts!

Angle judgements are inaccurate

In general, underestimate size of larger angle.

31 of 60

Areas are Hard to Understand

Avoid area charts!

Area judgements are inaccurate

In general, underestimate size of larger area

32 of 60

Areas are Hard to Understand

Avoid word clouds!

Hard to tell the “area” taken up by a word

33 of 60

Avoid Jiggling Baseline

Stacked bar charts / histograms hard to read because baseline moves
Notice that top bars are all about the same height

34 of 60

Avoid Jiggling Baseline

Stacked area charts hard to read because baseline moves

35 of 60

Avoid Jiggling Baseline

Instead, plot lines themselves

36 of 60

Break!

Fill out Attendance:

http://bit.ly/at-d100

37 of 60

Principles of Transformation

38 of 60

Transforming Data Can Reveal Patterns

When data are heavy tailed, useful to take the log and replot

39 of 60

Transforming Data Can Reveal Patterns

Shows a mode when log(fare) = 2 and a smaller mode at 3.4.
What do these correspond to in actual dollars?
exp(2) = $7.4
exp(3.4) = $30

40 of 60

Transforming Data Can Reveal Patterns

Log of nonlinear data can reveal pattern in scatter plot!

41 of 60

Log of y-values

Linear relationship after log of y-values implies exponential model for original plot

Fit line to log of y-values:

42 of 60

Log of both x and y-values

Fit line to log of x and y-values:

Linear relationship after log of x and y-values implies polynomial model for original plot

43 of 60

Principles of Context

46 of 60

Add Context Directly to Plot

A publication-ready plot needs:

Informative title (takeaway, not description)

“Older passengers spend more on plane tickets” instead of “Scatter plot of price vs. age”.

Axis labels
Reference lines and markers for important values
Labels for unusual points
Captions that describe data

47 of 60

Principles of Smoothing

48 of 60

Apply Smoothing for Large Datasets

49 of 60

A Histogram is a Smoothed Rug Plot

50 of 60

Smoothing Needs Tuning

51 of 60

Kernel Density Estimation (KDE)

Sophisticated smoothing technique
Used to estimate a probability density function from a set of data

52 of 60

Kernel Density Estimation

Intuition:

Place a “kernel” at each data point

53 of 60

Kernel Density Estimation

Intuition:

Place a “kernel” at each data point
Normalize kernels so that total area = 1

54 of 60

Kernel Density Estimation

Intuition:

Place a “kernel” at each data point
Normalize kernels so that total area = 1
Sum all kernels together

55 of 60

Kernel Density Estimation

Gaussian kernel most common (default for seaborn).

56 of 60

Kernel Density Estimation

Changing width of each kernel = changing bandwidth

Narrow bandwidth is analogous to narrow bins for histogram

57 of 60

KDE Example — Uniform Kernel

Uniform kernel with bandwidth of 2.

Data points at:

Kernel at each x:

58 of 60

KDE Example — Uniform Kernel

Scale each kernel by 1/4 since there are four points:

59 of 60

KDE Example — Uniform Kernel

Add kernels together:

Height at 1.5? 0.5

60 of 60

Summary

When choosing a visualization, consider the principles of Scale, Conditioning, Perception, Transformation, Context, and Smoothing!
In general: show the data!

Maximize data-ink ratio: cut out everything that isn’t data-related

1 of 60

2 of 60

3 of 60

4 of 60

5 of 60

6 of 60

7 of 60

8 of 60

9 of 60

10 of 60

11 of 60

12 of 60

13 of 60

14 of 60

15 of 60

16 of 60

17 of 60

18 of 60

19 of 60

20 of 60

21 of 60

22 of 60

23 of 60

24 of 60

25 of 60

26 of 60

27 of 60

28 of 60

29 of 60

30 of 60

31 of 60

32 of 60

33 of 60

34 of 60

35 of 60

36 of 60

37 of 60

38 of 60

39 of 60

40 of 60

41 of 60

42 of 60

43 of 60

44 of 60

45 of 60

46 of 60

47 of 60

48 of 60

49 of 60

50 of 60

51 of 60

52 of 60

53 of 60

54 of 60

55 of 60

56 of 60

57 of 60

58 of 60

59 of 60

60 of 60