1 of 60

Data Visualization

(Reading: 6.4-6.6)

(Slides adapted from Deb Nolan, Sandrine Dudoit, & Fernando Perez)

UC Berkeley Data 100 Summer 2019

Sam Lau

Learning goals:

  • Understand the principles of Scale, Conditioning, Perception, Transformation, Context, and Smoothing

2 of 60

Announcements

  • Project 1 due today!
  • Small group tutoring is starting this week
    • Sign up: http://bit.ly/d100-tutor
  • HW2 out Wednesday
    • Will be “officially” due Friday but we will take submissions without penalty until Tuesday (July 9)
  • HW3 out Friday
    • Due the following Friday (July 12)

3 of 60

When Submitting Assignments...

(Demo)

  • Run the last cell of the notebook
    • Ensure that OkPy link shows the latest version of the notebook
  • Then, submit the PDF to Gradescope!
    • You must label each question with the pages it’s on
    • If jassign breaks, you can use Print to PDF.

4 of 60

Data Visualization Principles

5 of 60

Six Principles Today

  1. Scale
  2. Conditioning
  3. Perception
  4. Transformations
  5. Context
  6. Smoothing

Explored via three case studies.

6 of 60

Case 1: Planned Parenthood 2015 Hearing

  • Investigation of federal funding of Planned Parenthood in light of fetal tissue controversy
  • Congressman Chaffetz (R-UT) showed plot which originally appeared in a report by Americans United for Life (http://www.aul.org/)

7 of 60

Case 1: Planned Parenthood 2015 Hearing

  • Procedures: cancer screenings and abortions
  • How many data points are plotted?
  • What is suspicious?
  • What message is this plot trying to convey?

8 of 60

Case 2: Median Weekly Earnings

  • Bureau of Labor Statistics surveys economics of labor
  • www.bls.gov - Web interface to a report generating app
  • Plot of median weekly earnings for males and females by education level

9 of 60

Case 2: Median Weekly Earnings

  • What comparisons are easily made with this plot?
  • What comparisons are most interesting and important?

10 of 60

Case 3: Cherry Blossom Runners

  • 10 mi run in DC every April
  • Results available from 1999-2019
  • In 2019 over 17,000 runners
  • Scatter plot of run time (min) against age (yrs)

11 of 60

Case 3: Cherry Blossom Runners

  • 70,000+ points in the plot!
  • What’s the relationship between run time and age?

12 of 60

Principles of Scale

13 of 60

Scale

14 of 60

Keep consistent axis scales

  • Don’t change scale mid-axis
  • Don’t use two different scales for same axis
  • How does this plot change perception of information?

15 of 60

Consider Scale of Data

  • Scales of cancer screenings vs. abortions quite different
  • Can plot percent change instead of raw counts

16 of 60

Reveal the Data

  • Choose axis limits to fill plot
  • If necessary, zoom into region with most of data
    • Can make separate plots for different regions

17 of 60

Principles of Conditioning

18 of 60

Conditioning

19 of 60

Use Conditioning To Aid Comparison

  • Conditioning on male/female aligns points on x-axis
    • What does it reveal?
    • Why is this interesting?

20 of 60

Use Small Multiples To Aid Comparison

  • Faceted plots that share scales are easy to compare

21 of 60

Principles of Perception

22 of 60

Color Choices Matter!

Jet Colormap

Viridis Colormap

23 of 60

Use a Perceptually Uniform Color Map

  • Perceptually uniform: changing data from 0.1 to 0.2 appears similar to change from 0.8 to 0.9.
    • Measure by running experiments on people!
  • Jet, the old matplotlib default, was far from uniform!
  • Our own Stéfan van der Walt and Nathaniel Smith at the Berkeley Institute of Data Science fixed this :)
  • Also, avoid red + green since many people are colorblind

24 of 60

Use a Perceptually Uniform Color Map

Jet Colormap

Viridis Colormap

25 of 60

Use Color to Highlight Data Type

  • Qualitative: Choose a qualitative scheme that makes it easy to distinguish between categories
  • Quantitative: Choose a color scheme that implies magnitude.
  • Plot on right has both!

26 of 60

Use Color to Highlight Data Type

  • Does the data progress from low to high? Use a sequential scheme where light colors are for more extreme values

27 of 60

Use Color to Highlight Data Type

  • Do both low and high value deserve equal emphasis? Use a diverging scheme where light colors represent middle values

28 of 60

Not All Marks Are Good!

  • Accuracy of judgements depend on the type of mark.
  • Aligned lengths most accurate
  • Color least accurate

29 of 60

Lengths are Easy to Understand

People can easily distinguish two different lengths

E.g. Heights of bars in bar chart

30 of 60

Angles are Hard to Understand

Avoid pie charts!

Angle judgements are inaccurate

In general, underestimate size of larger angle.

31 of 60

Areas are Hard to Understand

Avoid area charts!

Area judgements are inaccurate

In general, underestimate size of larger area

32 of 60

Areas are Hard to Understand

Avoid word clouds!

Hard to tell the “area” taken up by a word

33 of 60

Avoid Jiggling Baseline

  • Stacked bar charts / histograms hard to read because baseline moves
  • Notice that top bars are all about the same height

34 of 60

Avoid Jiggling Baseline

  • Stacked area charts hard to read because baseline moves

35 of 60

Avoid Jiggling Baseline

Instead, plot lines themselves

36 of 60

Break!

Fill out Attendance:

http://bit.ly/at-d100

37 of 60

Principles of Transformation

38 of 60

Transforming Data Can Reveal Patterns

  • When data are heavy tailed, useful to take the log and replot

39 of 60

Transforming Data Can Reveal Patterns

  • Shows a mode when log(fare) = 2 and a smaller mode at 3.4.
  • What do these correspond to in actual dollars?
  • exp(2) = $7.4
  • exp(3.4) = $30

40 of 60

Transforming Data Can Reveal Patterns

  • Log of nonlinear data can reveal pattern in scatter plot!

41 of 60

Log of y-values

Linear relationship after log of y-values implies exponential model for original plot

Fit line to log of y-values:

42 of 60

Log of both x and y-values

Fit line to log of x and y-values:

Linear relationship after log of x and y-values implies polynomial model for original plot

43 of 60

Principles of Context

44 of 60

45 of 60

46 of 60

Add Context Directly to Plot

A publication-ready plot needs:

  • Informative title (takeaway, not description)
    • “Older passengers spend more on plane tickets” instead of “Scatter plot of price vs. age”.
  • Axis labels
  • Reference lines and markers for important values
  • Labels for unusual points
  • Captions that describe data

47 of 60

Principles of Smoothing

48 of 60

Apply Smoothing for Large Datasets

49 of 60

A Histogram is a Smoothed Rug Plot

50 of 60

Smoothing Needs Tuning

51 of 60

Kernel Density Estimation (KDE)

  • Sophisticated smoothing technique
  • Used to estimate a probability density function from a set of data

52 of 60

Kernel Density Estimation

Intuition:

  1. Place a “kernel” at each data point

53 of 60

Kernel Density Estimation

Intuition:

  • Place a “kernel” at each data point
  • Normalize kernels so that total area = 1

54 of 60

Kernel Density Estimation

Intuition:

  • Place a “kernel” at each data point
  • Normalize kernels so that total area = 1
  • Sum all kernels together

55 of 60

Kernel Density Estimation

Gaussian kernel most common (default for seaborn).

56 of 60

Kernel Density Estimation

Changing width of each kernel = changing bandwidth

Narrow bandwidth is analogous to narrow bins for histogram

57 of 60

KDE Example — Uniform Kernel

Uniform kernel with bandwidth of 2.

Data points at:

Kernel at each x:

58 of 60

KDE Example — Uniform Kernel

Scale each kernel by 1/4 since there are four points:

59 of 60

KDE Example — Uniform Kernel

Add kernels together:

Height at 1.5? 0.5

60 of 60

Summary

  • When choosing a visualization, consider the principles of Scale, Conditioning, Perception, Transformation, Context, and Smoothing!
  • In general: show the data!
    • Maximize data-ink ratio: cut out everything that isn’t data-related