1 of 157

Same Data, Different Visual Forms

Data Visualization for Scientific Discovery

Zan Armstrong

@zanstrong

zanarmstrong@gmail.com

2 of 157

I'm a data visualization designer and engineer

3 of 157

Common first lessons in data viz

Pie charts are bad.

4 of 157

Common first lessons in data viz

Pie charts are bad.

Non-zero bar charts are bad.

5 of 157

Common first lessons in data viz

Pie charts are bad.

Non-zero bar charts are bad.

Dual y-axis are bad.

6 of 157

Common first lessons in data viz

Pie charts are bad.

Non-zero bar charts are bad.

Dual y-axis are bad.

Rainbows are bad.

7 of 157

Common first lessons in data viz

Pie charts are bad.

Non-zero bar charts are bad.

Dual y-axis are bad.

Rainbows are bad.

Chart-junk is bad.

8 of 157

Common first lessons in data viz

Pie charts are bad.

Non-zero bar charts are bad.

Dual y-axis are bad.

Rainbows are bad.

Chart-junk is bad.

A high ink-to-data ratio is bad.

9 of 157

10 of 157

11 of 157

What should you do?

12 of 157

What should you do?

Know your goals & Consider your Constraints

13 of 157

What should you do?

Know your goals & Consider your Constraints

Look at your Data

14 of 157

What should you do?

Know your goals & Consider your Constraints

Look at your Data

Use color intentionally

15 of 157

What should you do?

Know your goals & Consider your Constraints

Look at your Data

Use color intentionally

Make small multiples (many charts, at same time, related)

16 of 157

Know your Goals, Consider your Constraints

Form is function

17 of 157

Goal & Constraints

For communication (journalism, etc)

Does it attract the audience's attention?

Get the message across?

It is true to the data? (not misleading)

How much space on the page or screen?

Is it immediately understandable?

Can everybody see it (colorblind, etc)?

Is there anything distracting from the main point?

For scientific analysis/discovery

How much of your time and energy does it take to create and interpret?

If something in your data is important, will you see it in the viz?

How much time do you have (as the analyst)?

How long does it take to create a viz?

How much mental effort does it take?

How hard is it to ask a new question?

18 of 157

Same data. Different purpose. Different form.

For analysis/discovery

For communication in Scientific American

19 of 157

Goal: Scientific Analysis & Discovery

1. Iterative exploratory analysis

2. Data processing pipelines

20 of 157

Goal: Scientific Analysis & Discovery

1. Iterative exploratory analysis

2. Data processing pipelines

Many Charts.

21 of 157

Goal: Scientific Analysis & Discovery

1. Iterative exploratory analysis

2. Data processing pipelines

Many Charts.

Mostly boring.

22 of 157

Goal: Scientific Analysis & Discovery

1. Iterative exploratory analysis

2. Data processing pipelines

Many Charts.

Mostly boring.

Important things should be obvious.

23 of 157

Per Chart: different goals, different form

More intuitive sense of the data; Overview rather than detail; more attention to high wind speeds. Compact form.

More clearly see where there are no observations. Can see data for no-wind/low wind as well as for high wind.

24 of 157

Look at your Data

25 of 157

1977: Exploratory Data Analysis by Tukey

26 of 157

1973: Anscombe's Quartet

27 of 157

28 of 157

2016: Datasaurus Dozen

29 of 157

30 of 157

Use Color Intentionally

31 of 157

Use Color Intentionally

to focus attention

32 of 157

Demo

Pre-attentive processing for hue and intensity

33 of 157

Any interesting patterns in this set of numbers*?

* Example inspired by Storytelling with Data. Cole's workshops, book, and blog are fantastic - check them out..

34 of 157

How many 7's are there?

35 of 157

Easier now?

36 of 157

Easier now?

37 of 157

How many 3's are there?

38 of 157

There are lots :)

39 of 157

What about 1's?

40 of 157

It is the loneliest number.

41 of 157

Examples in the wild

42 of 157

43 of 157

In my work

Highlighting points by category

Part of interactive gene expression lookup tool

44 of 157

Gene Expression in the Brain: highlight by cell type

45 of 157

Gene Expression in the Brain: highlight by cell type

46 of 157

Gene Expression in the Brain: highlight by cell type

47 of 157

Moss Landing data & code examples

48 of 157

Data

49 of 157

Data

library(RColorBrewer)

require(ggplot2)

require(dplyr)

library(gridExtra)

library(grid)

theme_set(theme_bw())

df <- read.csv('2017-12.csv')

50 of 157

Data

library(RColorBrewer)

require(ggplot2)

require(dplyr)

theme_set(theme_bw())

df <- read.csv('2017-12.csv')

Sets ggplot2 theme to "theme_bw()" for all charts

51 of 157

Data

colnames(df)

52 of 157

Bucket data we might be interested in

# note there is one reading where barometric pressure is under 1010, will exclude it for this analysis

df <- subset(df, baro > 1010)

# bucket windspeed, direction, rain, barometric pressure

df$wspd_discrete <- cut(df$wspd, seq(-2,12,2))

df$wdir_discrete <- cut(df$wdir, seq(-1,360,5))

df$rain_discrete <- cut(df$rain, seq(-0.005,0.025,.005))

df$baro_discrete <- cut(df$baro, seq(1000,1040,5))

53 of 157

Let's look at wind direction and wind speed

ggplot(df, aes(x=wdir, y=wspd))

54 of 157

Moss Landing: wind direction vs wind speed

ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=.1)

55 of 157

With color for rain

ggplot(df, aes(x=wdir, y=wspd, color=rain_discrete)) + geom_point(alpha=.1)

Raining

Not raining

Not Raining

Raining Some

Raining a Lot

56 of 157

With highlight color for rain

ggplot(df, aes(x=wdir, y=wspd, color=rain_discrete, alpha=rain_discrete)) + geom_point() +

+ scale_color_manual(values = c('lightgrey', 'steelblue', 'steelblue')) + scale_alpha_discrete(range = c(0.1,.3,.3))

Raining

Not raining

57 of 157

Timeseries: time vs wind speed

ggplot(df, aes(x=unix_time, y=wspd)) + geom_point(alpha=0.1)

58 of 157

Timeseries: add color

ggplot(df, aes(x=unix_time, y=wspd, color=rain_discrete)) + geom_point(alpha=0.1)

Not Raining

Raining Some

Raining a Lot

59 of 157

Timeseries: color to highlight rain

ggplot(df, aes(x=unix_time, y=wspd, color=rain_discrete)) + geom_point(alpha=0.1) + scale_color_manual(values = c('grey', 'steelblue', 'steelblue'))

Raining

Not raining

Raining

Not raining

60 of 157

Use Color Intentionally

61 of 157

Use Color Intentionally

by choosing appropriate color schemes

62 of 157

Use Color Intentionally

by choosing appropriate color schemes

with meaningful mappings from numbers to colors

63 of 157

Demo

Choosing color mappings

Choosing meaningful mappings from numbers to colors

64 of 157

Let's take another look at these numbers

65 of 157

Let's add some color. Viridis is popular.

66 of 157

It really makes the 9's pop. What if we invert it?

0

9

67 of 157

Now the 0's pop.

0

9

68 of 157

0->9 is ascending. Let's look at a sequential scale.

0

9

69 of 157

This is just another sequential option

0

9

70 of 157

Interested in extremes? Diverging from 4.5?

0

9

71 of 157

Or, we could clamp the extremes.

0

9

72 of 157

Or, we exclude them entirely.

0

9

73 of 157

Or focus only on the extremes?

0

9

74 of 157

Or, just focus on the half the data (0-4)

0

9

75 of 157

Or, be even more direct: highlighting 3-5

0

9

76 of 157

Different colors & mappings reveal different things

What do you aspects of your data do you want to see?

Does your colors and your color mapping show you that?

If not, how can you change your color map or your mapping from numbers to colors to see that?

77 of 157

In my work

With Lusann Yang on the Google Accelerated Science team and John Gregoire of Caltech.

78 of 157

We want to notice when circles in the same box are different colors

Original color scale

79 of 157

We want to notice when circles in the same box are different colors

Original color scale

New color scale

80 of 157

One chart, not enough visibility.

Problems:

High values way more obvious than low values. But, both actually important for science.

Extreme high values, either real or bad data, could wash out the color scale. This would make it impossible to see differences near the median or low end.

81 of 157

Choose a color map that is more balanced

colormap with more perceptual variation overall, more even highs/lows

Same, but inverted min/max colors

Min/max set to first standard deviation

Color min-to-median,

Everything higher is purple

Color median-to-max, everything lower is red.

82 of 157

And, invert it just in case

Standard colormap with lots of hue variation,

max to min set by min/max in dataset

Same, but inverted min/max colors

Min/max set to first standard deviation

Color min-to-median,

Everything higher is purple

Color median-to-max, everything lower is red.

83 of 157

Set min/max to standard deviation

Standard colormap with lots of hue variation,

max to min set by min/max in dataset

Same, but inverted min/max colors

Min/max set to first standard deviation

Color min-to-median,

Everything higher is purple

Color median-to-max, everything lower is red.

84 of 157

Focus on only the bottom half of the data

Standard colormap with lots of hue variation,

max to min set by min/max in dataset

Same, but inverted min/max colors

Min/max set to first standard deviation

Color min-to-median,

Everything higher is purple

Color median-to-max, everything lower is red.

85 of 157

And, one for the top half

Standard colormap with lots of hue variation,

max to min set by min/max in dataset

Same, but inverted min/max colors

Min/max set to first standard deviation

Color min-to-median,

Everything higher is purple

Color median-to-max, everything lower is red.

86 of 157

New color scheme.

Standard colormap with lots of hue variation,

max to min set by min/max in dataset

Same, but inverted min/max colors

Min/max set to first standard deviation

Color min-to-median,

Everything higher is purple

Color median-to-max, everything lower is red.

5 charts, each with a different mapping between colors and numbers.

87 of 157

Examples from the wild

88 of 157

Is your data diverging?

89 of 157

If yes, use diverging colors.

Annual world temperature compared to average

Dark blues just below average

Highest values in pinks and reds

Dark purples just above average

Coldest, furthest below average in bright blue

https://www.bloomberg.com/graphics/hottest-year-on-record/

90 of 157

If yes, use diverging colors.

Hue variation (bright blue vs dark blue, purple vs red) helps to distinguish near-average from extreme values.

Dark blues just below average

Highest values in pinks and reds

Dark purples just above average

Coldest, furthest below average in bright blue

https://www.bloomberg.com/graphics/hottest-year-on-record/

91 of 157

92 of 157

Examples from the wild

93 of 157

94 of 157

Make Small Multiples

Multiple similar charts that you look at at the same time

95 of 157

Demo

Small multiples, with the same data

96 of 157

97 of 157

Examples from the wild

98 of 157

99 of 157

In my work

With Lusann Yang on the Google Accelerated Science team and John Gregoire of Caltech.

100 of 157

Many charts, fewer problems!

Solution:

Choose a color map that is more balanced to showing both highs and lows.

Replace 1 graph with 5 graphs. Different min/max cut-offs for colors.

Standard colormap with lots of hue variation,

max to min set by min/max in dataset

Same, but inverted min/max colors

Min/max set to first standard deviation

Color min-to-median,

Everything higher is purple

Color median-to-max, everything lower is red.

101 of 157

In my work

Baby birth data

102 of 157

Imagine analyzing birth data, and seeing this

Time of Day

Number of babies born

103 of 157

Drill down: by day of week

104 of 157

105 of 157

106 of 157

107 of 157

Drill down by delivery method & day of week

Total births

C-Section

Induction

Spontaneous

108 of 157

For Communication: Scientific American

Constraints: space, needs to be immediately understood.

Strategic use of small multiples to enable comparison.

109 of 157

For Communication: too much of a good thing

Mon

Tues

Wed

Thurs

Fri

Sat

Sun

110 of 157

Moss Landing data & code examples

111 of 157

Moss Landing: wind direction vs wind speed

ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=.1)

Wind direction

Wind speed

112 of 157

  1. Same data, different opacity

plotAlpha <- function(plot,alpha) {baseplot + geom_point(alpha=alpha)}

baseplot <- ggplot(df, aes(x=wdir, y=wspd))

grid.arrange(

plotAlpha(baseplot, .01),

plotAlpha(baseplot, .05),

plotAlpha(baseplot, .1),

plotAlpha(baseplot, .5),

ncol=2)

Alpha .01

Alpha .05

Alpha .1

Alpha .5

Wind speed

Wind direction

Wind speed

Wind speed

Wind speed

Wind direction

Wind direction

Wind direction

113 of 157

2a. Same axes, different barometric pressure

ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=.05) + facet_wrap(~ baro_discrete)

Wind speed

Wind direction

114 of 157

What about rain? Could use color to highlight.

ggplot(df, aes(x=wdir, y=wspd, color=rain_discrete, alpha=rain_discrete)) + scale_alpha_discrete(range = c(0.025,.4,.3)) + scale_color_manual(values = c('darkgrey', 'lightblue', 'steelblue')) + geom_point() + facet_wrap(~ baro_discrete)

Wind speed

Wind direction

115 of 157

2b. Same axes, diff rain/diff barometric pressure

ggplot(df, aes(x=wdir, y=wspd, color=rain_discrete, alpha=rain_discrete)) + geom_point() + facet_grid(rain_discrete ~ baro_discrete) + scale_alpha_discrete(range = c(0.01,.5,.5)) +

+ scale_color_manual(

values = c(

'darkgrey',

'lightblue',

'Steelblue'

))

Wind speed

Wind dir

Raining?

Barometric pressure

116 of 157

What about other chart forms instead?

ggplot(df, aes(x=wdir, y=wspd, xend=wdir)) + geom_segment(yend=0, alpha=.01) + coord_polar()

117 of 157

2b. Same axes, diff rain/diff barometric pressure

ggplot(df, aes(x=wdir, y=wspd, xend=wdir)) + geom_segment(yend=0, alpha=.05) + coord_polar()

+ facet_grid(

rain_discrete ~

baro_discrete)

Barometric pressure

Raining?

118 of 157

3. Same x-axis, different y-metrics

119 of 157

120 of 157

What if we're interested in wind dir and baro?

121 of 157

Flip direction of color scale

+ scale_color_continuous(high = "#132B43", low = "#56B1F7")

122 of 157

Switch color scales to have more variation*

+ scale_color_continuous(high = "#132B43", low = "#56B1F7")

*not colorblind-safe

Didn't notice these before

123 of 157

Or, bin the data?

124 of 157

Same data, different aspects of data highlighted

Wind direction

Wind direction

Wind direction

Wind direction

Time: Dec 2017

Barometric pressure:

1010-1015

Barometric pressure:

1015-1020

Barometric pressure:

1020-1025

Barometric pressure:

1025-1030

grid.arrange(c,d,e,f,nrow=4)

125 of 157

Same data, different aspects of data highlighted

Wind direction

Wind direction

Wind direction

Wind direction

Time: Dec 2017

Barometric pressure:

1010-1015

Barometric pressure:

1015-1020

Barometric pressure:

1020-1025

Barometric pressure:

1025-1030

grid.arrange(c,d,e,f,nrow=4)

126 of 157

Same data, different aspects of data highlighted

Wind direction

Wind direction

Wind direction

Wind direction

Time: Dec 2017

127 of 157

Which one of these should you do?

128 of 157

Which one of these should you do?

It depends!

129 of 157

Which one of these should you do?

It depends!

What are your goals?

130 of 157

Which one of these should you do?

It depends!

What are your goals?

What are your constraints?

131 of 157

Which one of these should you do?

It depends!

What are your goals?

What are your constraints?

What helps you see what's important in your data?

132 of 157

Which one of these should you do?

It depends!

What are your goals?

What are your constraints?

What helps you see what's important in your data?

133 of 157

Learn more?

134 of 157

Recommended resources

Seaborn- python library with good support for colors, small multiples. Plays well with matplotlib.

Cmocean - matplotlib color scales for oceanographic data

ggplot documentation - good support for small multiples and other visualization in R

Perceptual distance in colormaps - shows which popular color maps have greatest ovearallperceptual differentiation, as well as variation in perceptual distance along colormap

Storytelling with Data - human perception, communicating with data. Book, blog, and workshops.

Tamara Munzer - systematic way of thinking about visualization forms

Flowing Data - data viz blog with lots of examples

Zanarmstrong.com - my portfolio

OpenVis Conf - all presentations posted online each year

135 of 157

Thank you!

136 of 157

Appendix

(recycling bin)

137 of 157

What about when there is no wind?

ggplot(subset(df, wspd == 0), aes(x=baro, y=rain_discrete, color=rain_discrete, alpha=rain_discrete)) + geom_point()+ scale_color_manual(values = c('darkgrey', 'lightblue', 'steelblue')) + scale_alpha_discrete(range = c(0.1,.5,.5))

Barometric pressure

Raining?

138 of 157

Need context

ggplot(df, aes(x=baro, y=rain_discrete, color=rain_discrete)) + geom_point(alpha=.1)

+ facet_wrap(~ wspd_discrete, ncol=1)

+ scale_color_manual(values = c('darkgrey', 'lightblue', 'steelblue'))

Barometric pressure

Raining?

Faceted by Wind Speed

Barometric pressure

Raining?

139 of 157

As a heatmap

[see example scripts]

140 of 157

Heatmap Small Multiple!

[see example scripts]

141 of 157

Points, but radial

ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=.01) + coord_polar()

142 of 157

Points, but radial Small Multiples

ggplot(df, aes(x=wdir, y=wspd)) + geom_point(alpha=0.05) + coord_polar() + facet_grid(rain_discrete ~ baro_discrete)

Barometric pressure

Raining?

143 of 157

Radial Heatmap

[see example script]

144 of 157

Radial Heatmap:

small multiples

[see example script]

145 of 157

Resources

Being Clever with Color - For Explanatory Analysis - by Storytelling with Data

R- many versions - https://www.r-bloggers.com/my-commonly-done-ggplot2-graphs/

For Storytelling/Communication

Where is Larry - Storytelling with Data

146 of 157

147 of 157

Daily data

ggplot(subset(df, year == "2014"), aes(x = date, y=births)) + geom_line() + theme_bw() + ylim(0,14000)

148 of 157

Look at granular data. Every minute.

ggplot(df, aes(x = hourmin, y = value)) + geom_line() + theme_bw() + ylim(0,6000)

149 of 157

outtakes

150 of 157

Same comparison, in polar

  • coord_polar()

All points shown in black

If was raining, blue, otherwise black

151 of 157

The user is not showered with graphical displays. He can get them only with trouble, cunning, and a fighting spirit.

  • Anscombe, 1973

152 of 157

In my work

Color scales for

metagenomics

153 of 157

In my work

Color scales for

metagenomics

154 of 157

What's the tool?

155 of 157

Key challenge:

Two types of data.

For Single-Copy: 1 is good, 0 means sequence is missing data, 2 means that it has too much.

For Gene Abundance: Most important difference is between 0 and at least 1.

156 of 157

Key challenge:

Two types of data.

For Single-Copy: 1 is good, 0 means sequence is missing data, 2 means that it has too much.

For Gene Abundance: Most important difference is between 0 and at least 1.

157 of 157

Use Color Intentionally: Resources

Cmocean - Beautiful matplotlib colormaps for oceanographic variables

Being Clever with Color - Color for Explanatory Analysis - by Storytelling with Data