Same Data, Different Visual Forms
Data Visualization for Scientific Discovery
Zan Armstrong
@zanstrong
zanarmstrong@gmail.com
I'm a data visualization designer and engineer
Common first lessons in data viz
Pie charts are bad.
Common first lessons in data viz
Pie charts are bad.
Non-zero bar charts are bad.
Common first lessons in data viz
Pie charts are bad.
Non-zero bar charts are bad.
Dual y-axis are bad.
Common first lessons in data viz
Pie charts are bad.
Non-zero bar charts are bad.
Dual y-axis are bad.
Rainbows are bad.
Common first lessons in data viz
Pie charts are bad.
Non-zero bar charts are bad.
Dual y-axis are bad.
Rainbows are bad.
Chart-junk is bad.
Common first lessons in data viz
Pie charts are bad.
Non-zero bar charts are bad.
Dual y-axis are bad.
Rainbows are bad.
Chart-junk is bad.
A high ink-to-data ratio is bad.
What should you do?
What should you do?
Know your goals & Consider your Constraints
What should you do?
Know your goals & Consider your Constraints
Look at your Data
What should you do?
Know your goals & Consider your Constraints
Look at your Data
Use color intentionally
What should you do?
Know your goals & Consider your Constraints
Look at your Data
Use color intentionally
Make small multiples (many charts, at same time, related)
Know your Goals, Consider your Constraints
Form is function
Goal & Constraints
For communication (journalism, etc)
Does it attract the audience's attention?
Get the message across?
It is true to the data? (not misleading)
How much space on the page or screen?
Is it immediately understandable?
Can everybody see it (colorblind, etc)?
Is there anything distracting from the main point?
For scientific analysis/discovery
How much of your time and energy does it take to create and interpret?
If something in your data is important, will you see it in the viz?
How much time do you have (as the analyst)?
How long does it take to create a viz?
How much mental effort does it take?
How hard is it to ask a new question?
Same data. Different purpose. Different form.
For analysis/discovery
For communication in Scientific American
Goal: Scientific Analysis & Discovery
1. Iterative exploratory analysis
2. Data processing pipelines
Goal: Scientific Analysis & Discovery
1. Iterative exploratory analysis
2. Data processing pipelines
Many Charts.
Goal: Scientific Analysis & Discovery
1. Iterative exploratory analysis
2. Data processing pipelines
Many Charts.
Mostly boring.
Goal: Scientific Analysis & Discovery
1. Iterative exploratory analysis
2. Data processing pipelines
Many Charts.
Mostly boring.
Important things should be obvious.
Per Chart: different goals, different form
More intuitive sense of the data; Overview rather than detail; more attention to high wind speeds. Compact form.
More clearly see where there are no observations. Can see data for no-wind/low wind as well as for high wind.
Look at your Data
1977: Exploratory Data Analysis by Tukey
1973: Anscombe's Quartet
2016: Datasaurus Dozen
Use Color Intentionally
Use Color Intentionally
to focus attention
Demo
Pre-attentive processing for hue and intensity
Any interesting patterns in this set of numbers*?
* Example inspired by Storytelling with Data. Cole's workshops, book, and blog are fantastic - check them out..
How many 7's are there?
Easier now?
Easier now?
How many 3's are there?
There are lots :)
What about 1's?
It is the loneliest number.
Examples in the wild
In my work
Highlighting points by category
Part of interactive gene expression lookup tool
Gene Expression in the Brain: highlight by cell type
Gene Expression in the Brain: highlight by cell type
Gene Expression in the Brain: highlight by cell type
Moss Landing data & code examples
Data
Data
library(RColorBrewer)
require(ggplot2)
require(dplyr)
library(gridExtra)
library(grid)
theme_set(theme_bw())
df <- read.csv('2017-12.csv')
Data
library(RColorBrewer)
require(ggplot2)
require(dplyr)
theme_set(theme_bw())
df <- read.csv('2017-12.csv')
Sets ggplot2 theme to "theme_bw()" for all charts
Data
colnames(df)
Bucket data we might be interested in
# note there is one reading where barometric pressure is under 1010, will exclude it for this analysis
df <- subset(df, baro > 1010)
# bucket windspeed, direction, rain, barometric pressure
df$wspd_discrete <- cut(df$wspd, seq(-2,12,2))
df$wdir_discrete <- cut(df$wdir, seq(-1,360,5))
df$rain_discrete <- cut(df$rain, seq(-0.005,0.025,.005))
df$baro_discrete <- cut(df$baro, seq(1000,1040,5))
Let's look at wind direction and wind speed
ggplot(df, aes(x=wdir, y=wspd))
Moss Landing: wind direction vs wind speed
ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=.1)
With color for rain
ggplot(df, aes(x=wdir, y=wspd, color=rain_discrete)) + geom_point(alpha=.1)
Raining
Not raining
Not Raining
Raining Some
Raining a Lot
With highlight color for rain
ggplot(df, aes(x=wdir, y=wspd, color=rain_discrete, alpha=rain_discrete)) + geom_point() +
+ scale_color_manual(values = c('lightgrey', 'steelblue', 'steelblue')) + scale_alpha_discrete(range = c(0.1,.3,.3))
Raining
Not raining
Timeseries: time vs wind speed
ggplot(df, aes(x=unix_time, y=wspd)) + geom_point(alpha=0.1)
Timeseries: add color
ggplot(df, aes(x=unix_time, y=wspd, color=rain_discrete)) + geom_point(alpha=0.1)
Not Raining
Raining Some
Raining a Lot
Timeseries: color to highlight rain
ggplot(df, aes(x=unix_time, y=wspd, color=rain_discrete)) + geom_point(alpha=0.1) + scale_color_manual(values = c('grey', 'steelblue', 'steelblue'))
Raining
Not raining
Raining
Not raining
Use Color Intentionally
Use Color Intentionally
by choosing appropriate color schemes
Use Color Intentionally
by choosing appropriate color schemes
with meaningful mappings from numbers to colors
Demo
Choosing color mappings
Choosing meaningful mappings from numbers to colors
Let's take another look at these numbers
Let's add some color. Viridis is popular.
It really makes the 9's pop. What if we invert it?
0
9
Now the 0's pop.
0
9
0->9 is ascending. Let's look at a sequential scale.
0
9
This is just another sequential option
0
9
Interested in extremes? Diverging from 4.5?
0
9
Or, we could clamp the extremes.
0
9
Or, we exclude them entirely.
0
9
Or focus only on the extremes?
0
9
Or, just focus on the half the data (0-4)
0
9
Or, be even more direct: highlighting 3-5
0
9
Different colors & mappings reveal different things
What do you aspects of your data do you want to see?
Does your colors and your color mapping show you that?
If not, how can you change your color map or your mapping from numbers to colors to see that?
In my work
With Lusann Yang on the Google Accelerated Science team and John Gregoire of Caltech.
We want to notice when circles in the same box are different colors
Original color scale
We want to notice when circles in the same box are different colors
Original color scale
New color scale
One chart, not enough visibility.
Problems:
High values way more obvious than low values. But, both actually important for science.
Extreme high values, either real or bad data, could wash out the color scale. This would make it impossible to see differences near the median or low end.
Choose a color map that is more balanced
colormap with more perceptual variation overall, more even highs/lows
Same, but inverted min/max colors
Min/max set to first standard deviation
Color min-to-median,
Everything higher is purple
Color median-to-max, everything lower is red.
And, invert it just in case
Standard colormap with lots of hue variation,
max to min set by min/max in dataset
Same, but inverted min/max colors
Min/max set to first standard deviation
Color min-to-median,
Everything higher is purple
Color median-to-max, everything lower is red.
Set min/max to standard deviation
Standard colormap with lots of hue variation,
max to min set by min/max in dataset
Same, but inverted min/max colors
Min/max set to first standard deviation
Color min-to-median,
Everything higher is purple
Color median-to-max, everything lower is red.
Focus on only the bottom half of the data
Standard colormap with lots of hue variation,
max to min set by min/max in dataset
Same, but inverted min/max colors
Min/max set to first standard deviation
Color min-to-median,
Everything higher is purple
Color median-to-max, everything lower is red.
And, one for the top half
Standard colormap with lots of hue variation,
max to min set by min/max in dataset
Same, but inverted min/max colors
Min/max set to first standard deviation
Color min-to-median,
Everything higher is purple
Color median-to-max, everything lower is red.
New color scheme.
Standard colormap with lots of hue variation,
max to min set by min/max in dataset
Same, but inverted min/max colors
Min/max set to first standard deviation
Color min-to-median,
Everything higher is purple
Color median-to-max, everything lower is red.
5 charts, each with a different mapping between colors and numbers.
Examples from the wild
Is your data diverging?
If yes, use diverging colors.
Annual world temperature compared to average
Dark blues just below average
Highest values in pinks and reds
Dark purples just above average
Coldest, furthest below average in bright blue
https://www.bloomberg.com/graphics/hottest-year-on-record/
If yes, use diverging colors.
Hue variation (bright blue vs dark blue, purple vs red) helps to distinguish near-average from extreme values.
Dark blues just below average
Highest values in pinks and reds
Dark purples just above average
Coldest, furthest below average in bright blue
https://www.bloomberg.com/graphics/hottest-year-on-record/
Examples from the wild
Make Small Multiples
Multiple similar charts that you look at at the same time
Demo
Small multiples, with the same data
Examples from the wild
In my work
With Lusann Yang on the Google Accelerated Science team and John Gregoire of Caltech.
Many charts, fewer problems!
Solution:
Choose a color map that is more balanced to showing both highs and lows.
Replace 1 graph with 5 graphs. Different min/max cut-offs for colors.
Standard colormap with lots of hue variation,
max to min set by min/max in dataset
Same, but inverted min/max colors
Min/max set to first standard deviation
Color min-to-median,
Everything higher is purple
Color median-to-max, everything lower is red.
In my work
Baby birth data
Imagine analyzing birth data, and seeing this
Time of Day
Number of babies born
Drill down: by day of week
Drill down by delivery method & day of week
Total births
C-Section
Induction
Spontaneous
For Communication: Scientific American
Constraints: space, needs to be immediately understood.
Strategic use of small multiples to enable comparison.
For Communication: too much of a good thing
Mon
Tues
Wed
Thurs
Fri
Sat
Sun
Moss Landing data & code examples
Moss Landing: wind direction vs wind speed
ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=.1)
Wind direction
Wind speed
plotAlpha <- function(plot,alpha) {baseplot + geom_point(alpha=alpha)}
baseplot <- ggplot(df, aes(x=wdir, y=wspd))
grid.arrange(
plotAlpha(baseplot, .01),
plotAlpha(baseplot, .05),
plotAlpha(baseplot, .1),
plotAlpha(baseplot, .5),
ncol=2)
Alpha .01
Alpha .05
Alpha .1
Alpha .5
Wind speed
Wind direction
Wind speed
Wind speed
Wind speed
Wind direction
Wind direction
Wind direction
2a. Same axes, different barometric pressure
ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=.05) + facet_wrap(~ baro_discrete)
Wind speed
Wind direction
What about rain? Could use color to highlight.
ggplot(df, aes(x=wdir, y=wspd, color=rain_discrete, alpha=rain_discrete)) + scale_alpha_discrete(range = c(0.025,.4,.3)) + scale_color_manual(values = c('darkgrey', 'lightblue', 'steelblue')) + geom_point() + facet_wrap(~ baro_discrete)
Wind speed
Wind direction
2b. Same axes, diff rain/diff barometric pressure
ggplot(df, aes(x=wdir, y=wspd, color=rain_discrete, alpha=rain_discrete)) + geom_point() + facet_grid(rain_discrete ~ baro_discrete) + scale_alpha_discrete(range = c(0.01,.5,.5)) +
+ scale_color_manual(
values = c(
'darkgrey',
'lightblue',
'Steelblue'
))
Wind speed
Wind dir
Raining?
Barometric pressure
What about other chart forms instead?
ggplot(df, aes(x=wdir, y=wspd, xend=wdir)) + geom_segment(yend=0, alpha=.01) + coord_polar()
2b. Same axes, diff rain/diff barometric pressure
ggplot(df, aes(x=wdir, y=wspd, xend=wdir)) + geom_segment(yend=0, alpha=.05) + coord_polar()
+ facet_grid(
rain_discrete ~
baro_discrete)
Barometric pressure
Raining?
3. Same x-axis, different y-metrics
What if we're interested in wind dir and baro?
Flip direction of color scale
+ scale_color_continuous(high = "#132B43", low = "#56B1F7")
Switch color scales to have more variation*
+ scale_color_continuous(high = "#132B43", low = "#56B1F7")
*not colorblind-safe
Didn't notice these before
Or, bin the data?
Same data, different aspects of data highlighted
Wind direction
Wind direction
Wind direction
Wind direction
Time: Dec 2017
Barometric pressure:
1010-1015
Barometric pressure:
1015-1020
Barometric pressure:
1020-1025
Barometric pressure:
1025-1030
grid.arrange(c,d,e,f,nrow=4)
Same data, different aspects of data highlighted
Wind direction
Wind direction
Wind direction
Wind direction
Time: Dec 2017
Barometric pressure:
1010-1015
Barometric pressure:
1015-1020
Barometric pressure:
1020-1025
Barometric pressure:
1025-1030
grid.arrange(c,d,e,f,nrow=4)
Same data, different aspects of data highlighted
Wind direction
Wind direction
Wind direction
Wind direction
Time: Dec 2017
Which one of these should you do?
Which one of these should you do?
It depends!
Which one of these should you do?
It depends!
What are your goals?
Which one of these should you do?
It depends!
What are your goals?
What are your constraints?
Which one of these should you do?
It depends!
What are your goals?
What are your constraints?
What helps you see what's important in your data?
Which one of these should you do?
It depends!
What are your goals?
What are your constraints?
What helps you see what's important in your data?
Learn more?
Recommended resources
Seaborn- python library with good support for colors, small multiples. Plays well with matplotlib.
Cmocean - matplotlib color scales for oceanographic data
ggplot documentation - good support for small multiples and other visualization in R
Perceptual distance in colormaps - shows which popular color maps have greatest ovearallperceptual differentiation, as well as variation in perceptual distance along colormap
Storytelling with Data - human perception, communicating with data. Book, blog, and workshops.
Tamara Munzer - systematic way of thinking about visualization forms
Flowing Data - data viz blog with lots of examples
Zanarmstrong.com - my portfolio
OpenVis Conf - all presentations posted online each year
Thank you!
Appendix
(recycling bin)
What about when there is no wind?
ggplot(subset(df, wspd == 0), aes(x=baro, y=rain_discrete, color=rain_discrete, alpha=rain_discrete)) + geom_point()+ scale_color_manual(values = c('darkgrey', 'lightblue', 'steelblue')) + scale_alpha_discrete(range = c(0.1,.5,.5))
Barometric pressure
Raining?
Need context
ggplot(df, aes(x=baro, y=rain_discrete, color=rain_discrete)) + geom_point(alpha=.1)
+ facet_wrap(~ wspd_discrete, ncol=1)
+ scale_color_manual(values = c('darkgrey', 'lightblue', 'steelblue'))
Barometric pressure
Raining?
Faceted by Wind Speed
Barometric pressure
Raining?
As a heatmap
[see example scripts]
Heatmap Small Multiple!
[see example scripts]
Points, but radial
ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=.01) + coord_polar()
Points, but radial Small Multiples
ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=0.05) + coord_polar() + facet_grid(rain_discrete ~ baro_discrete)
Barometric pressure
Raining?
Radial Heatmap
[see example script]
Radial Heatmap:
small multiples
[see example script]
Resources
Being Clever with Color - For Explanatory Analysis - by Storytelling with Data
R- many versions - https://www.r-bloggers.com/my-commonly-done-ggplot2-graphs/
For Storytelling/Communication
Where is Larry - Storytelling with Data
Daily data
ggplot(subset(df, year == "2014"), aes(x = date, y=births)) + geom_line() + theme_bw() + ylim(0,14000)
Look at granular data. Every minute.
ggplot(df, aes(x = hourmin, y = value)) + geom_line() + theme_bw() + ylim(0,6000)
outtakes
Same comparison, in polar
All points shown in black
If was raining, blue, otherwise black
The user is not showered with graphical displays. He can get them only with trouble, cunning, and a fighting spirit.
In my work
Color scales for
metagenomics
In my work
Color scales for
metagenomics
What's the tool?
Key challenge:
Two types of data.
For Single-Copy: 1 is good, 0 means sequence is missing data, 2 means that it has too much.
For Gene Abundance: Most important difference is between 0 and at least 1.
Key challenge:
Two types of data.
For Single-Copy: 1 is good, 0 means sequence is missing data, 2 means that it has too much.
For Gene Abundance: Most important difference is between 0 and at least 1.
Use Color Intentionally: Resources
Cmocean - Beautiful matplotlib colormaps for oceanographic variables
Being Clever with Color - Color for Explanatory Analysis - by Storytelling with Data