2 of 81

“….That which gives the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space” - Edward Tufte

“...graphical excellence requires telling the truth about the data.” - Edward Tufte

edwardtufte.com

3 of 81

Data Visualization

Title
Abstract
Introduction
Methods
Results
Discussion
Citations
Tables
Figures

Summarize the data

tell us what the important trends are

Illustrate and support

details, statistics
tables, figures

4 of 81

What is a graph

A graph/figure/plot maps values in a data set to visual coordinates. It reveals patterns that cannot be understood just from the data.

5 of 81

What is a graph

A graph/figure/plot maps values in a data set to visual coordinates. It reveals patterns that cannot be understood just from the data.

6 of 81

What is a graph

A graph/figure/plot maps values in a data set to visual coordinates. It reveals patterns that cannot be understood just from the data.

7 of 81

Components of a graph

y-axis - Vertical axis. Often used to represent the response variable.

x-axis - Horizontal axis. Often used to represent the predictor variable.

coordinates - The x and y axis together form the coordinates of the plot.

data - The values of y for each value of x. Commonly shown as points.

colors - Used to distinguish different groups of data. Can be used to represent a separate response variable.

shapes - Used to distinguish different groups of data. Can be used to represent a separate response variable.

lines - Often used to show trends in the data.

8 of 81

Components of a good graph

It shows all of the data
It shows the data summaries
The text is meaningful
The text is legible
It has a good figure caption
All of that looks good in the paper

9 of 81

What can you learn about the study just from the graph?

10 of 81

It shows all of the data

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_jitter(width = 0.05)

iris %>%

filter(Sepal.Length <= 6) %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_jitter(width = 0.05) +

ylim(4, 8)

11 of 81

2. It shows the data summaries

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_boxplot(aes(group = Species)) +

geom_jitter(width = 0.05)

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_jitter(width = 0.05)

12 of 81

3. The text is meaningful

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_boxplot(aes(group = Species)) +

geom_jitter(width = 0.05) +

labs(x = "Plant Species",

y = "Sepal Length (cm)")

13 of 81

4. The text is legible

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_boxplot(aes(group = Species)) +

geom_jitter(width = 0.05) +

labs(x = "Plant Species",

y = "Sepal Length (cm)")

14 of 81

5. Has a good figure legend or caption.

Figure 1. Results from data.

Figure 1. Comparison of sepal length among three species of iris plants.

15 of 81

6. All of it looks good in the paper

16 of 81

How to do this poorly

Deception
Not deceptive, just confusing
Not deceptive, not confusing, but also not revealing the data

19 of 81

How to do this poorly

Deception
Not deceptive, just confusing
Not deceptive, not confusing, but also not revealing the data

24 of 81

It’s all there, but it hurts to look at. Low resolution, corners stretched.

25 of 81

It’s all there, but it hurts to look at. Low resolution, corners stretched.

Life expectancy

GDP

26 of 81

How to do this poorly

Deception
Not deceptive, just confusing
Not deceptive, not confusing, but also not revealing the data

27 of 81

How to do this goodly

Show the full scale
Consider what you want to communicate
Show all of the data
Test it out on a friend
Describe magnitudes in the text

28 of 81

Some tricks

Use the log scale

Sort axis by value

29 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = year, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)")

30 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = year, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)") +

scale_y_log10()

31 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = year, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)") +

scale_y_log10() +

geom_boxplot()

32 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = year, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)") +

scale_y_log10() +

geom_boxplot(aes(group = year))

33 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = country, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)")

34 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = country, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)") +

geom_boxplot(aes(group = country))

35 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”) +

geom_boxplot(aes(group = country))

36 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point(size = 0.2) +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”)) +

geom_boxplot(aes(group = country))

37 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point(size = 0.2) +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”)) +

geom_boxplot(aes(group = country)) +

coord_flip()

38 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point(size = 0.2) +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”) +

geom_boxplot(aes(group = country)) +

coord_flip() +

theme(axis.text.y = element_text(size = 2))

39 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point(size = 0.2) +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”)) +

geom_boxplot(aes(group = country)) +

coord_flip() +

theme(axis.text.y = element_blank())

40 of 81

Use the log scale

Sort axis by value

41 of 81

What are we supposed to say about this graph?

Results

???

How much does y change across groups?
What is the minimum change?
What is the maximum change?
What is the most important result?

43 of 81

Tables

Three parts:

The data summaries

44 of 81

Tables

Three parts:

The data summaries
The lines (visual guides)

45 of 81

Tables

Three parts:

The data summaries
The lines (visual guides)

46 of 81

Tables

Three parts:

The data summaries
The lines (visual guides)

47 of 81

Tables

Three parts:

The data summaries
The lines (visual guides)

48 of 81

Tables

Three parts:

The data summaries
The lines (visual guides)
The description (aka legend/caption/title)

49 of 81

Tables

Table legends go on Top of the table

Three parts:

The data summaries
The lines (visual guides)
The description (aka legend/caption/title)

50 of 81

Table legends go on Top of the table ????

51 of 81

Table legends go on Top of the table ????

52 of 81

Three problems

And a solution

53 of 81

Spurious correlations

Ecological fallacy

Statistical significance

54 of 81

Spurious correlations

Shockingly easy to find

Reveals no useful information

No plausible underlying biology

55 of 81

Spurious correlations

Shockingly easy to find. Can you guess the predictor variables?

64 of 81

Spurious Correlations

Shockingly easy to find

Your brain is a pattern finding machine

You are not alone

65 of 81

Ecological fallacy

Using data from populations to infer something about individual risk

Underlying mechanisms can be worth studying

Can generate useful future hypotheses

66 of 81

https://www.scientificamerican.com/article/graphics-that-seem-clear-can-easily-be-misread/

67 of 81

https://www.scientificamerican.com/article/graphics-that-seem-clear-can-easily-be-misread/

68 of 81

Spurious correlations

Ecological fallacy

Statistical significance

69 of 81

P-values and statistical significance

P-value

70 of 81

P-values and statistical significance

P-value

probability of observing a given value of a test statistic, or a more extreme value, under the assumption that the null hypothesis is exactly true

71 of 81

P-values and statistical significance

P-value

probability of observing a given value of a test statistic, or a more extreme value, under the assumption that the null hypothesis is exactly true

measure of importance

probability result is due to chance alone

probability of your hypothesis (or the null)

72 of 81

P-value measure of importance

73 of 81

P-value measure of importance

74 of 81

P-value measure of importance

75 of 81

P-value measure of importance

76 of 81

P-value measure of importance

77 of 81

P-value measure of importance

78 of 81

P-value measure of importance

79 of 81

P-value measure of importance

80 of 81

P-value measure of importance

81 of 81

Keep it simple
Show the raw data
Label axes clearly
Use ggplot()

ggplot() stands for the Grammar of Graphics. It’s a built in set of design principles that make it harder to make bad graphs (but certainly not impossible)

1 of 81

2 of 81

3 of 81

4 of 81

5 of 81

6 of 81

7 of 81

8 of 81

9 of 81

10 of 81

11 of 81

12 of 81

13 of 81

14 of 81

15 of 81

16 of 81

17 of 81

18 of 81

19 of 81

20 of 81

21 of 81

22 of 81

23 of 81

24 of 81

25 of 81

26 of 81

27 of 81

28 of 81

29 of 81

30 of 81

31 of 81

32 of 81

33 of 81

34 of 81

35 of 81

36 of 81

37 of 81

38 of 81

39 of 81

40 of 81

41 of 81

42 of 81

43 of 81

44 of 81

45 of 81

46 of 81

47 of 81

48 of 81

49 of 81

50 of 81

51 of 81

52 of 81

53 of 81

54 of 81

55 of 81

56 of 81

57 of 81

58 of 81

59 of 81

60 of 81

61 of 81

62 of 81

63 of 81

64 of 81

65 of 81

66 of 81

67 of 81

68 of 81

69 of 81

70 of 81

71 of 81

72 of 81

73 of 81

74 of 81

75 of 81

76 of 81

77 of 81

78 of 81

79 of 81

80 of 81