All about histograms.
Data 94, Spring 2021 @ UC Berkeley
Suraj Rampure, with help from many others
Visualizing Numerical Variables
26
Overview
Announcements
Note: the linked textbook reading for this lecture doesn’t follow our lecture exactly; it explains a different kind of histogram. Still might be somewhat helpful.
Review: bar charts
barh
Histograms
Towards numerical distributions
Bar charts visualize the distribution of a categorical variable, or the relationship between a categorical variable and one or more numerical variables.
Bar charts cannot display the distribution of a numerical variable.
Question: How might we display the distribution of a numerical variable?
Answer: One solution is binning.
Binning
Binning is counting the number of numerical values that fall within ranges, called “bins”.
72, 61, 63, 74, 68, 67, 65, 73, 65, 62, 66
60
64
68
72
76
[72, 76) bin
Width: 4
Histograms
A histogram visualizes the distribution of a numerical variable by binning. The method
t.hist(column, density = False)
creates a histogram of the column column of t. This column must contain numerical values.
“right-skewed” or “right-tailed”
Example
Telling us that the number of tips between $1.90 and $2.80 is 79.
Aside: can confirm results using where and are.between
Also telling us that the number of tips between $1.90 and $2.80 is 79.
Why do we need density = False?
Same data as before. The y-axis is no longer “Count”.
Quick Check 1
Answer the following questions about this histogram of (fake) heights. If you don’t believe it’s possible to answer the question, write “can’t tell”.
A note on bins
By looking at a histogram, we cannot tell how values are distributed within a bin.
All heights in this bin could be 64 inches.
Or they could all be 66 inches.
Or half could be 65 and half could be 67.
Unless we have the actual data, we can’t tell.
Customization
We can use the same optional customization arguments with hist as we did with barh.
Choosing bins
Bins
hist chooses bins by default for us.
The optional bins argument in hist requires an array, and uses the values in the array as bins instead.
Very last bin is [10, 11), not [11, 12).
np.arange, revisited
If we only need a few bins, we can make an array of bins by hand, like in the previous example.
But if we want more than a handful of bins, it will be tiring to create them by hand. Instead, we can use np.arange to create an array of equally spaced values.
Same as np.array([0, 1, 2, 3, …, 11]).
Example
It’s a good idea to determine the min and max values in a column before choosing bins, to make sure your bins encompass the entire range.
We need to make sure that the largest value in the column is part of one of our bins.
If we used
np.arange(3, 51, 3)
for example, the last two elements in bins_3 would be 45 and 48, meaning that the last bin is [45, 48), and 50.81 isn’t placed in a bin.
Tradeoff
The width of each bin used dramatically changes the resulting histogram.
In practice, it’s up to you to choose the bin size of your histogram.
Width = 3
Width = 7
Width = 10
Overlaid and side-by-side histograms
group and overlay
What if we want to show the distribution of a numerical variable, grouped by some categorical variable? For example, tips on weekends vs. tips on weekdays?
We can create grouped histograms by using the group and overlay arguments in hist.
t.group(column, density = False, group = categorical_column)
t.group(column, density = False, group = categorical_column,
overlay = False)
Example
See the notebook for more details.
It can be hard to compare two distributions if one has significantly more observations than the other.
When comparing distributions, compare the shape, not the absolute heights.
Quick Check 2
Which other arguments are needed to create this histogram? Select all that apply.
tips.where('tip percentage', are.below(25)) \
.hist('tip percentage', density = False, ...)
(See Ed for the options.)
“left-skewed” or “left-tailed”
Summary, next time
Bar charts vs. histograms
Bar charts visualize the distribution of a categorical variable, or the relationship between a categorical variable and a numerical variable.
Histograms visualize the distribution of an numerical variable.
Both have bars, but only one is a bar chart!
In our class, bar charts are horizontal while histograms are vertical; this won’t always be true.
Three-step visualization process
Homework 7!
hist
A histogram visualizes the distribution of a numerical variable by binning. The method
t.hist(column, density = False)
creates a histogram of the column column of t. This column must contain numerical values.
Optional arguments:
Documentation
In addition to referring to lecture slides, make sure to refer to the following resources from earlier:
Next time
Announcements