1 of 69

1

Applied Data Analysis (CS401)

Robert West

Lecture 6

Data Visualization

2 of 69

Announcements

  • HW 1 grades available
    • Post-mortem in tomorrow’s lab session
  • HW 2 was due yesternight
    • Peer reviews due in 1 week (Thu, Nov 2, noon)
    • Grades available in 2 weeks
  • HW 3 released tomorrow (due in 2 weeks)
  • Tomorrow’s lab session:
    • More on data visualization
    • Office hours (HW 1, HW 2, project); add questions to FAQ
  • A word on peer reviews

2

3 of 69

Uses for data visualization

Support reasoning about information (analysis)

    • Finding relationships
    • Discover structure
    • Quantifying values and influences
    • Should be part of a query/analyze cycle

Inform and persuade others (communication)

    • Capture attention, engage
    • Tell a story visually
    • Can focus on certain aspects and omit others

3

4 of 69

Old-fashioned viz

Great for data exploration, developed throughout the last few centuries…

Interactive viz

More and more common when delivering the results. New frameworks are the key enabler.

4

5 of 69

Want to learn more?

5

6 of 69

Today’s lecture

  • Visualization for data exploration
  • Principles of data visualization
  • Examples
  • Tools (more in lab session)

6

7 of 69

Visualization for data exploration

7

8 of 69

Histograms

Histograms can tell you a lot about a single variable, discrete or continuous

8

9 of 69

Histograms

Skewed distributions

9

10 of 69

Box plots

11 of 69

Heavy-tailed data

11

12 of 69

Heavy-tailed data: power laws

    • Very very large values are rare, “but not very rare”
    • Body size vs. city size
    • Many natural phenomena are power laws (e.g., # of friends)
    • For dealing with them, need to know�some tricks
    • E.g., straight line on log-log axes:
      • y = C x–α ↔ log(y) = log(C) – α log(x)

PDF

13 of 69

Heavy-tailed data: power laws

  • Complementary cumulative distribution function (CCDF):
    • P(x) := Pr{X >= x}
  • CCDF of power law is also a power law (with exponent α – 1)

PDF

CCDF

14 of 69

Heavy-tailed data: power laws

  • Smart trick for plotting CCDF of any distribution:
    • x-axis: data sorted in ascending order
    • y-axis: (n:1)/n (where n is number of data points)

CCDF

15 of 69

Multimodal data

  • Two or more distinct peaks in a histogram.
  • Suggests two or more distinct populations of samples.
  • Often arise from gender/political views, or other binary factors.
  • But don’t guess! Explore further by using, e.g., color and a histogram of multiple populations.

15

16 of 69

Multimodal data

Explore further by using, e.g., color and a histogram of multiple populations

16

17 of 69

Weird data

  • Some data is very hard to explain.
  • Don’t guess! Trace through the data pipeline to find where the strangeness comes from. Usually it’s a processing bug.

17

18 of 69

Proactive “weird-data detection”

If data looks ok, take a picture and save it for later…

Then periodically compare new data with old whenever there is a pipeline update.

Always try to have a theory of what the data should look like.

18

19 of 69

Remarks on exploration

  • Form expectations of what the data should look like. This helps you guard against pipeline errors and to identify interesting patterns
  • But expect the unexpected

19

20 of 69

Principles of data visualization

20

21 of 69

Visualization definitions

  • Transformation of the symbolic into the geometric

[McCormick et al. 1987]

  • ... finding the artificial memory that best supports

our natural means of perception.” [Bertin 1967]

  • The use of computer-generated, interactive, visual

representations of data to amplify cognition.

[Card, Mackinlay, & Shneiderman 1999]

21

22 of 69

Edward Tufte

22

23 of 69

Tufte’s Rules

23

24 of 69

Perception of magnitudes

24

Which is brighter?

(128, 128, 128)

(144, 144, 144)

25 of 69

Just Noticeable Difference

  • JND (Weber’s Law)

  • I : intensity; ΔI : increase from I; k : constant factor
  • Required increase ΔI depends on original intensity I
  • Most continuous variations in stimuli are perceived in discrete steps

25

26 of 69

26

Compare area of circles

27 of 69

27

Compare area of circles

28 of 69

Perception of magnitudes

Most accurate Position

Length

Slope

Angle

Area

Volume

Least accurate Color hue-saturation-density

28

Cleveland, McGill (1984)

Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods

29 of 69

Use colors wisely

Choose colors based on the information you want to convey

      • Sequential
      • Diverging
      • Categorical

Use online resources to discover and record your color schemes

      • Color Brewer
      • Kuler
      • Colour Lovers

29

30 of 69

Use colors wisely

30

31 of 69

Use colors wisely

31

32 of 69

Use colors wisely

32

33 of 69

33

34 of 69

Colorblind-friendly pallettes

  • 10% of males (i.e., ~10% of your reviewers...) have some form of colorblindness

34

35 of 69

Use structure

Gestalt psychology principles (1912)

35

36 of 69

Use structure� (but not like this!)

36

37 of 69

Less is more

37

38 of 69

Interactive chart design: simplifying

  • With interactive charts you can keep things very simple by hiding and dynamically revealing important structure.
  • On an interactive chart, you reveal the information most useful for navigating the chart.

38

39 of 69

Navigating the chart landscape

39

40 of 69

Chart selection (by Andrew Abela)

40

41 of 69

One variable: histograms, box plots

41

42 of 69

Two variables: scatter plots

Scatter plots quickly expose the relationships between two variables

42

43 of 69

> 2 variables: scatter plot matrix

43

44 of 69

> 2 variables: stacked plots

Stack index, color, height

Stack variable and color variable: categorical

44

45 of 69

> 2 variables: parallel-coord. plots

Color, x, y

Color variable is categorical, others arbitrary

45

46 of 69

> 2 variables: radar charts

  • Similar to parallel-coord. plots
  • Doesn’t pretend that x axis has meaningful order
  • Good for periodic data

46

47 of 69

Dimensionality reduction

  • One example, PCA: allows visualization of high-dimensional continuous data in 2D using principal components
  • The principal components are the strongest (highest variation) dimensions in the dataset, and are orthogonal

47

48 of 69

One Dataset, visualized 25 ways

http://flowingdata.com/2017/01/24/one-dataset-visualized-25-ways/

“You must help the data focus and get to the point. Otherwise, it just ends up rambling about what it had for breakfast this morning and how the coffee wasn’t hot enough.”

48

49 of 69

Good examples

49

50 of 69

Charles Joseph Minard 1869�Napoleon’s march

50

According to Tufte: “It may well be the best statistical graphic ever drawn.

5 variables: army size, location, dates, direction, temperature during retreat

51 of 69

Interactivity to educate

Hans Rosling:

200 Countries, 200 Years, 4 Minutes

https://www.youtube.com/watch?feature=player_embedded&v=jbkSRLYSojo

51

52 of 69

Examples: public Information

52

53 of 69

The future of journalism?

NY Times interactive visualizations (recession/recovery 2014)�http://www.nytimes.com/interactive/2014/06/05/upshot/how-the-recession-reshaped-the-economy-in-255-charts.html

And 2014 “the year in interactive storytelling”

http://www.nytimes.com/interactive/2014/12/29/us/year-in-interactive-storytelling.html?_r=0

NY Times graphics are a great source of best practices in viz�(except for when they’re not…)

53

54 of 69

Bad examples

Courtesy of viz.wtf

54

55 of 69

Visualization to educate?

55

56 of 69

Pie in the sky?

56

57 of 69

57

58 of 69

Needs fixing

58

59 of 69

Data viz in the sciences

59

60 of 69

Uses for Data Viz

60

61 of 69

A case for ugly visualizations

People instinctively gravitate to attractive visualizations, and they have a better chance of getting on the cover of a journal.

But does this conflict with the goals of visualization?

  • Rapid exploration
  • Focus on most important details
  • Easy and fast to develop and �customize

61

62 of 69

Tools

62

63 of 69

Interactive toolkits: D3

Without doubt, the most widely used interactive visualization framework is D3, developed around 2011 by Jeff Heer, Mike Bostock, and Vadim Ogievetsky.

Note from the authors: D3 is intentionally a low-level system. During the early design of D3, we even referred to it as a "visualization kernel" rather than a "toolkit" or "framework"

63

64 of 69

Interactive toolkits: Vega

Vega is a “visualization grammar” developed on top of D3.js

It specifies graphics in JSON format.

64

65 of 69

Interactive toolkits: Vincent

Vincent is a Python-to-Vega translator.

Trivia question: why is it called Vincent? Hint: Vincent+Vega= ?

65

66 of 69

Interactive toolkits: Vincent

Vincent is a Python-to-Vega translator.

Trivia question: why is it called Vincent? Hint: Vincent+Vega= ?

66

67 of 69

Bokeh: another interactive viz library

Bokeh is an independent Viz library focused more heavily on big data visualization. Has both Python and Scala bindings.

67

68 of 69

Visualizing maps: Folium

More in tomorrow’s lab session!

68

69 of 69

Credits

  • Last year’s version of these slides

69