1 of 69

1

Applied Data Analysis (CS401)

Maria Brbić

Lecture 3

Visualizing data

24 Sep 2025

2 of 69

Announcements

  • You must find 4 team mates and register by this Fri Sep 26th 23:59
    • Every student must register individually
    • Still looking for a team or for team members? Use Ed! If need be, talk to Sevda & Pierre [TAs]
  • Project milestone 1 was released last Sun; due Fri 3 Oct 23:59
  • Friday’s lab session:
    • 15:30–16:30: Project milestone P1 office hours (Zoom)
    • In parallel: Exercise on data visualization and working with data from the Web (Exercise 2)
    • Happening only in GCC330, CE14 (not BCH2201!)

2

3 of 69

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec3-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

4 of 69

Uses for data visualization

Analysis: Support reasoning about information

    • Finding relationships
    • Discover structure
    • Quantifying values and influences
    • Should be part of the data analysis cycle

Communication: Inform and persuade others

    • Capture attention, engage
    • Tell a story visually
    • Can focus on certain aspects and omit others

Decision making: Make it easier to evaluate potential courses of action

4

5 of 69

5

Data visualizations and plots

Basic stats

6 of 69

An unconventional example

6

Garden of Eden”: 8 lettuces, each of which is enclosed in its own airtight plexiglas box and represents a major city. The concentration of ozone in each box is controlled in real-time to reflect the current pollution level in the city.

7 of 69

Static viz

Great for data exploration, developed throughout the last few centuries…

Interactive viz

More and more common when delivering the results (and also during exploration). New frameworks are the key enabler.

7

8 of 69

Want to learn more?

8

9 of 69

Today’s lecture

  • Part 1: Navigating the chart landscape
  • Part 2: Principles and best practices
  • Part 3: A (small) selection of use cases for data visualization

9

10 of 69

Part 1

Navigating the chart landscape

10

11 of 69

Chart selection

11

12 of 69

One variable: histograms

Histograms can tell you a lot about a single variable, discrete or continuous

Easy to recognize skewed distributions!

Smoothed histogram (a.k.a. kernel density estimate)

12

13 of 69

One variable: box plots

14 of 69

Two variables: scatter plots

Scatter plots quickly expose the relationships between two variables

2D histograms

a.k.a. heatmap

14

15 of 69

Two variables: line plots

If relationship is functional (for instance, after binning and aggregating)

15

16 of 69

> 2 variables: scatter plot matrix

16

17 of 69

> 2 variables: stacked plots

Here: 3 variables: stack index, height, color

17

Stack variable and color variables categorical,

height variable continuous:

Color variable categorical,

stack and height variables continuous:

18 of 69

Dimensionality reduction

  • For example, PCA: allows visualization of high-dimensional continuous data in 2D using principal components
  • The principal components are the strongest (highest variation) dimensions in the dataset, and are orthogonal

18

19 of 69

One dataset, visualized 25 ways

http://flowingdata.com/2017/01/24/one-dataset-visualized-25-ways

“You must help the data focus and get to the point. Otherwise, it just ends up rambling about what it had for breakfast this morning and how the coffee wasn’t hot enough.”

19

20 of 69

Part 2

Principles and best practices

20

21 of 69

Instructive coffee table books by Edward Tufte

21

22 of 69

Perception of magnitudes

22

Which is brighter?

(134, 134, 134)

(144, 144, 144)

23 of 69

Just noticeable difference (JND)

  • Weber’s law:

  • I: intensity; ΔI: increase from I to notice a difference; k: constant
  • Required increase ΔI depends on original intensity I
  • Most continuous variations in stimuli are perceived in discrete (multiplicative) steps

23

24 of 69

24

Compare area of circles

25 of 69

25

Compare area of circles

26 of 69

Perception of magnitudes

Most accurate Position

Length

Slope

Angle

Area

Volume

Least accurate Color hue-saturation-density

26

Cleveland & McGill (1984)

Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods

27 of 69

Choose your axes wisely!

Time series of pageviews

of Wikipedia article about “Coronavirus”

(linear y-axis)

(logarithmic y-axis)

27

28 of 69

Choose your axes wisely:

Visualizing heavy-tailed distributions

28

Linear x-axis

Logarithmic x-axis

29 of 69

Heavy-tailed data: power laws

  • Probability of x:
    • Very large values are rare, “but not very rare”
    • Body size vs. city size
    • Many natural phenomena are power laws (e.g., # of friends)
    • For dealing with them, need to know some tricks
    • E.g., for small α, mean & var = ∞ → use median!
    • E.g., straight line on log-log axes:
      • y = C x–αlog(y) = log(C) – α log(x)

PDF

30 of 69

30

Commercial break

31 of 69

Heavy-tailed data: power laws

  • Complementary cumulative�distribution function (CCDF):�

PDF

PDF

(with binned x-axis)

POLLING TIME

  • Do you recall how is CDF defined?
  • Scan QR code or go to�https://go.epfl.ch/ada2025-lec3-poll

P(x) := Pr{X >= x}

32 of 69

Heavy-tailed data: power laws

  • Complementary cumulative distribution function (CCDF):�P(x) := Pr{X >= x}
  • CCDF of power law is also a power law (with exponent α – 1)�
  • CCDF plot is monotonically decreasing (even without binning)

PDF

CCDF

PDF

(with binned x-axis)

33 of 69

Answer fast: which time series has a higher mean value?

33

34 of 69

Answer fast: which time series has a higher mean value?

Use consistent axes!

34

35 of 69

Label your axes!

35

36 of 69

36

THINK FOR A MINUTE:

How could we show details of�both time series without using

different y-axes?

(Feel free to discuss with your neighbor.)

37 of 69

logarithmic!

37

38 of 69

Which beer is more popular, Guinness or Paulaner?

Show data uncertainty!

“error bars”

38

39 of 69

Consider using small multiples!

Use colors consistently!

Media attention

40 of 69

Use colors wisely!

Choose colors based on the information you want to convey

40

41 of 69

41

42 of 69

Use colorblind-safe palettes!

  • Remember: 10% of males have some form of colorblindness

42

43 of 69

Use data ink wisely! Avoid chart junk!

43

44 of 69

Use visual contrast

44

  • Use visual contrast to highlight the important information
  • Vary the size, shape, position, orientation, or color of an element to make the key part of the figure the most visually prominent
  • Avoid incorporating every form of visual contrast

45 of 69

The good, the bad and the ugly

45

46 of 69

Which principles and best practices do these graphics violate?

Courtesy of viz.wtf

46

47 of 69

Part 3

A (small) selection of use cases

for data visualization

47

48 of 69

48

Use case:

Presenting scientific results

49 of 69

Multimodal data

  • Two or more distinct peaks in a histogram often suggest 2 or more distinct populations of samples.
  • But don’t guess! Explore further by using, e.g., color and a histogram of multiple populations (p.t.o.).

49

Use case:

Data wrangling

50 of 69

Multimodal data

Explore further by using, e.g., color and a histogram of multiple populations

50

Use case:

Data wrangling

51 of 69

Weird data

  • Maintain a theory of what the data should look like.
  • Some data is very hard to explain.
  • Never just blink it away!
  • First, assume a bug. Try to fix it.
  • If not a bug: you might have made an interesting discovery!
  • Some of science’s most important findings were made by not ignoring weird data, but dwelling on it!

51

Use case:

Data wrangling

52 of 69

NY Times interactive visualizations (recession/recovery 2014)�http://www.nytimes.com/interactive/2014/06/05/upshot/how-the-recession-reshaped-the-economy-in-255-charts.html

And 2014 “the year in interactive storytelling”

http://www.nytimes.com/interactive/2014/12/29/us/year-in-interactive-storytelling.html?_r=0

NY Times graphics are a great source of�best practices in viz (except for when they’re not…)

52

Use case:

Journalism

53 of 69

Hans Rosling:

200 countries, 200 years, 4 minutes

https://www.youtube.com/watch?v=jbkSRLYSojo

53

Use case:

Educating the public

54 of 69

Charles Joseph Minard 1869�Napoleon’s march

54

According to Tufte: “It may well be the best statistical graphic ever drawn.

5 variables: army size, location, dates, direction, temperature during retreat

Use case:

Give new perspectives

55 of 69

Tools

(remaining slides for your personal perusal)

55

56 of 69

Interactive toolkits: D3

Without doubt, the most widely used interactive visualization framework is D3.

Note from the authors: D3 is intentionally a low-level system. During the early design of D3, we even referred to it as a "visualization kernel" rather than a "toolkit" or "framework"

56

57 of 69

Interactive toolkits: Vega

Vega is a “visualization grammar” developed on top of D3.js

It specifies graphics in JSON format.

57

58 of 69

Interactive toolkits: Vincent

Vincent is a Python-to-Vega translator.

Trivia question: why is it called Vincent? Hint: Vincent+Vega= ?

58

59 of 69

Interactive toolkits: Vincent

Vincent is a Python-to-Vega translator.

Trivia question: why is it called Vincent? Hint: Vincent+Vega= ?

59

60 of 69

Bokeh: another interactive viz library

Bokeh is an independent Viz library focused more heavily on big data visualization. Has both Python and Scala bindings.

60

61 of 69

Visualizing maps: Folium

More in tomorrow’s lab session!

61

62 of 69

Feedback

62

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec3-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

63 of 69

> 2 variables: parallel-coord. plots

Color, x, y

Color variable is categorical, others arbitrary

63

64 of 69

> 2 variables: radar charts

  • Similar to parallel-coord. plots
  • Doesn’t pretend that x axis has meaningful order
  • Also good for periodic data

64

65 of 69

Heavy-tailed data: power laws

  • Smart trick for plotting CCDF of any distribution:
    • x-axis: data sorted in ascending order
    • y-axis: (n:1)/n (where n is number of data points)

CCDF

66 of 69

Interactive chart design: simplifying

  • With interactive charts you can keep things very simple by hiding and dynamically revealing important structure.
  • On an interactive chart, you reveal the information most useful for navigating the chart.

66

67 of 69

Use structure!

Gestalt psychology principles (1912)

67

68 of 69

A case for ugly visualizations

People instinctively gravitate to attractive visualizations, and they have a better chance of getting on the cover of a journal.

But does this conflict with the goals of visualization?

  • Rapid exploration
  • Focus on most important details
  • Easy and fast to develop and �customize

68

69 of 69

Guide your audience!

17th March�(St. Patrick’s day)

69