1 of 81

Data Visualization

2 of 81

“….That which gives the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space” - Edward Tufte

“...graphical excellence requires telling the truth about the data.” - Edward Tufte

edwardtufte.com

3 of 81

Data Visualization

  • Title
  • Abstract
  • Introduction
  • Methods
  • Results
  • Discussion
  • Citations
  • Tables
  • Figures
  1. Summarize the data
    1. tell us what the important trends are

  1. Illustrate and support
    • details, statistics
    • tables, figures

4 of 81

What is a graph

A graph/figure/plot maps values in a data set to visual coordinates. It reveals patterns that cannot be understood just from the data.

5 of 81

What is a graph

A graph/figure/plot maps values in a data set to visual coordinates. It reveals patterns that cannot be understood just from the data.

6 of 81

What is a graph

A graph/figure/plot maps values in a data set to visual coordinates. It reveals patterns that cannot be understood just from the data.

7 of 81

Components of a graph

y-axis - Vertical axis. Often used to represent the response variable.

x-axis - Horizontal axis. Often used to represent the predictor variable.

coordinates - The x and y axis together form the coordinates of the plot.

data - The values of y for each value of x. Commonly shown as points.

colors - Used to distinguish different groups of data. Can be used to represent a separate response variable.

shapes - Used to distinguish different groups of data. Can be used to represent a separate response variable.

lines - Often used to show trends in the data.

8 of 81

Components of a good graph

  1. It shows all of the data
  2. It shows the data summaries
  3. The text is meaningful
  4. The text is legible
  5. It has a good figure caption
  6. All of that looks good in the paper

9 of 81

What can you learn about the study just from the graph?

10 of 81

  • It shows all of the data

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_jitter(width = 0.05)

iris %>%

filter(Sepal.Length <= 6) %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_jitter(width = 0.05) +

ylim(4, 8)

11 of 81

2. It shows the data summaries

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_boxplot(aes(group = Species)) +

geom_jitter(width = 0.05)

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_jitter(width = 0.05)

12 of 81

3. The text is meaningful

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_boxplot(aes(group = Species)) +

geom_jitter(width = 0.05) +

labs(x = "Plant Species",

y = "Sepal Length (cm)")

13 of 81

4. The text is legible

iris %>%

ggplot(aes(x = Species, y = Sepal.Length)) +

geom_boxplot(aes(group = Species)) +

geom_jitter(width = 0.05) +

labs(x = "Plant Species",

y = "Sepal Length (cm)")

14 of 81

5. Has a good figure legend or caption.

Figure 1. Results from data.

Figure 1. Comparison of sepal length among three species of iris plants.

15 of 81

6. All of it looks good in the paper

16 of 81

How to do this poorly

  1. Deception
  2. Not deceptive, just confusing
  3. Not deceptive, not confusing, but also not revealing the data

17 of 81

18 of 81

19 of 81

How to do this poorly

  • Deception
  • Not deceptive, just confusing
  • Not deceptive, not confusing, but also not revealing the data

20 of 81

21 of 81

22 of 81

23 of 81

24 of 81

It’s all there, but it hurts to look at. Low resolution, corners stretched.

25 of 81

It’s all there, but it hurts to look at. Low resolution, corners stretched.

Life expectancy

GDP

26 of 81

How to do this poorly

  • Deception
  • Not deceptive, just confusing
  • Not deceptive, not confusing, but also not revealing the data

27 of 81

How to do this goodly

  • Show the full scale
  • Consider what you want to communicate
  • Show all of the data
  • Test it out on a friend
  • Describe magnitudes in the text

28 of 81

Some tricks

Use the log scale

Sort axis by value

29 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = year, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)")

30 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = year, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)") +

scale_y_log10()

31 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = year, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)") +

scale_y_log10() +

geom_boxplot()

32 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = year, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)") +

scale_y_log10() +

geom_boxplot(aes(group = year))

33 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = country, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)")

34 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = country, y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)") +

geom_boxplot(aes(group = country))

35 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point() +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”) +

geom_boxplot(aes(group = country))

36 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point(size = 0.2) +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”)) +

geom_boxplot(aes(group = country))

37 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point(size = 0.2) +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”)) +

geom_boxplot(aes(group = country)) +

coord_flip()

38 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point(size = 0.2) +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”) +

geom_boxplot(aes(group = country)) +

coord_flip() +

theme(axis.text.y = element_text(size = 2))

39 of 81

Use the log scale

Sort axis by value

d %>%

ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +

geom_point(size = 0.2) +

labs(y = "Gross Domestic Product (US dollars 2007)",

x = “Country”)) +

geom_boxplot(aes(group = country)) +

coord_flip() +

theme(axis.text.y = element_blank())

40 of 81

Use the log scale

Sort axis by value

41 of 81

What are we supposed to say about this graph?

Results

???

  • How much does y change across groups?
  • What is the minimum change?
  • What is the maximum change?
  • What is the most important result?

42 of 81

Tables

43 of 81

Tables

Three parts:

  1. The data summaries

44 of 81

Tables

Three parts:

  • The data summaries
  • The lines (visual guides)

45 of 81

Tables

Three parts:

  • The data summaries
  • The lines (visual guides)

46 of 81

Tables

Three parts:

  • The data summaries
  • The lines (visual guides)

47 of 81

Tables

Three parts:

  • The data summaries
  • The lines (visual guides)

48 of 81

Tables

Three parts:

  • The data summaries
  • The lines (visual guides)
  • The description (aka legend/caption/title)

49 of 81

Tables

Table legends go on Top of the table

Three parts:

  • The data summaries
  • The lines (visual guides)
  • The description (aka legend/caption/title)

50 of 81

Table legends go on Top of the table ????

51 of 81

Table legends go on Top of the table ????

52 of 81

Three problems

And a solution

53 of 81

Spurious correlations

Ecological fallacy

Statistical significance

54 of 81

Spurious correlations

Shockingly easy to find

Reveals no useful information

No plausible underlying biology

55 of 81

Spurious correlations

Shockingly easy to find. Can you guess the predictor variables?

56 of 81

57 of 81

58 of 81

59 of 81

60 of 81

61 of 81

62 of 81

63 of 81

64 of 81

Spurious Correlations

  • Shockingly easy to find

  • Your brain is a pattern finding machine

  • You are not alone

65 of 81

Ecological fallacy

Using data from populations to infer something about individual risk

Underlying mechanisms can be worth studying

Can generate useful future hypotheses

66 of 81

67 of 81

68 of 81

Spurious correlations

Ecological fallacy

Statistical significance

69 of 81

P-values and statistical significance

P-value

70 of 81

P-values and statistical significance

P-value

probability of observing a given value of a test statistic, or a more extreme value, under the assumption that the null hypothesis is exactly true

71 of 81

P-values and statistical significance

P-value

probability of observing a given value of a test statistic, or a more extreme value, under the assumption that the null hypothesis is exactly true

measure of importance

probability result is due to chance alone

probability of your hypothesis (or the null)

72 of 81

P-value measure of importance

73 of 81

P-value measure of importance

74 of 81

P-value measure of importance

75 of 81

P-value measure of importance

76 of 81

P-value measure of importance

77 of 81

P-value measure of importance

78 of 81

P-value measure of importance

79 of 81

P-value measure of importance

80 of 81

P-value measure of importance

81 of 81

  1. Keep it simple
  2. Show the raw data
  3. Label axes clearly
  4. Use ggplot()

ggplot() stands for the Grammar of Graphics. It’s a built in set of design principles that make it harder to make bad graphs (but certainly not impossible)