Data Visualization
“….That which gives the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space” - Edward Tufte
“...graphical excellence requires telling the truth about the data.” - Edward Tufte
edwardtufte.com
Data Visualization
What is a graph
A graph/figure/plot maps values in a data set to visual coordinates. It reveals patterns that cannot be understood just from the data.
What is a graph
A graph/figure/plot maps values in a data set to visual coordinates. It reveals patterns that cannot be understood just from the data.
What is a graph
A graph/figure/plot maps values in a data set to visual coordinates. It reveals patterns that cannot be understood just from the data.
Components of a graph
y-axis - Vertical axis. Often used to represent the response variable.
x-axis - Horizontal axis. Often used to represent the predictor variable.
coordinates - The x and y axis together form the coordinates of the plot.
data - The values of y for each value of x. Commonly shown as points.
colors - Used to distinguish different groups of data. Can be used to represent a separate response variable.
shapes - Used to distinguish different groups of data. Can be used to represent a separate response variable.
lines - Often used to show trends in the data.
Components of a good graph
What can you learn about the study just from the graph?
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_jitter(width = 0.05)
iris %>%
filter(Sepal.Length <= 6) %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_jitter(width = 0.05) +
ylim(4, 8)
2. It shows the data summaries
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_boxplot(aes(group = Species)) +
geom_jitter(width = 0.05)
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_jitter(width = 0.05)
3. The text is meaningful
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_boxplot(aes(group = Species)) +
geom_jitter(width = 0.05) +
labs(x = "Plant Species",
y = "Sepal Length (cm)")
4. The text is legible
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_boxplot(aes(group = Species)) +
geom_jitter(width = 0.05) +
labs(x = "Plant Species",
y = "Sepal Length (cm)")
5. Has a good figure legend or caption.
Figure 1. Results from data.
Figure 1. Comparison of sepal length among three species of iris plants.
6. All of it looks good in the paper
How to do this poorly
How to do this poorly
It’s all there, but it hurts to look at. Low resolution, corners stretched.
It’s all there, but it hurts to look at. Low resolution, corners stretched.
Life expectancy
GDP
How to do this poorly
How to do this goodly
Some tricks
Use the log scale
Sort axis by value
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = year, y = gdpPercap)) +
geom_point() +
labs(y = "Gross Domestic Product (US dollars 2007)")
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = year, y = gdpPercap)) +
geom_point() +
labs(y = "Gross Domestic Product (US dollars 2007)") +
scale_y_log10()
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = year, y = gdpPercap)) +
geom_point() +
labs(y = "Gross Domestic Product (US dollars 2007)") +
scale_y_log10() +
geom_boxplot()
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = year, y = gdpPercap)) +
geom_point() +
labs(y = "Gross Domestic Product (US dollars 2007)") +
scale_y_log10() +
geom_boxplot(aes(group = year))
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = country, y = gdpPercap)) +
geom_point() +
labs(y = "Gross Domestic Product (US dollars 2007)")
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = country, y = gdpPercap)) +
geom_point() +
labs(y = "Gross Domestic Product (US dollars 2007)") +
geom_boxplot(aes(group = country))
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +
geom_point() +
labs(y = "Gross Domestic Product (US dollars 2007)",
x = “Country”) +
geom_boxplot(aes(group = country))
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +
geom_point(size = 0.2) +
labs(y = "Gross Domestic Product (US dollars 2007)",
x = “Country”)) +
geom_boxplot(aes(group = country))
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +
geom_point(size = 0.2) +
labs(y = "Gross Domestic Product (US dollars 2007)",
x = “Country”)) +
geom_boxplot(aes(group = country)) +
coord_flip()
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +
geom_point(size = 0.2) +
labs(y = "Gross Domestic Product (US dollars 2007)",
x = “Country”) +
geom_boxplot(aes(group = country)) +
coord_flip() +
theme(axis.text.y = element_text(size = 2))
Use the log scale
Sort axis by value
d %>%
ggplot(aes(x = reorder(country, gdpPercap), y = gdpPercap)) +
geom_point(size = 0.2) +
labs(y = "Gross Domestic Product (US dollars 2007)",
x = “Country”)) +
geom_boxplot(aes(group = country)) +
coord_flip() +
theme(axis.text.y = element_blank())
Use the log scale
Sort axis by value
What are we supposed to say about this graph?
Results
???
Tables
Tables
Three parts:
Tables
Three parts:
Tables
Three parts:
Tables
Three parts:
Tables
Three parts:
Tables
Three parts:
Tables
Table legends go on Top of the table
Three parts:
Table legends go on Top of the table ????
Table legends go on Top of the table ????
Three problems
And a solution
Spurious correlations
Ecological fallacy
Statistical significance
Spurious correlations
Shockingly easy to find
Reveals no useful information
No plausible underlying biology
Spurious correlations
Shockingly easy to find. Can you guess the predictor variables?
Spurious Correlations
Ecological fallacy
Using data from populations to infer something about individual risk
Underlying mechanisms can be worth studying
Can generate useful future hypotheses
Spurious correlations
Ecological fallacy
Statistical significance
P-values and statistical significance
P-value
P-values and statistical significance
P-value
probability of observing a given value of a test statistic, or a more extreme value, under the assumption that the null hypothesis is exactly true
P-values and statistical significance
P-value
probability of observing a given value of a test statistic, or a more extreme value, under the assumption that the null hypothesis is exactly true
measure of importance
probability result is due to chance alone
probability of your hypothesis (or the null)
P-value measure of importance
P-value measure of importance
P-value measure of importance
P-value measure of importance
P-value measure of importance
P-value measure of importance
P-value measure of importance
P-value measure of importance
P-value measure of importance
ggplot() stands for the Grammar of Graphics. It’s a built in set of design principles that make it harder to make bad graphs (but certainly not impossible)