1 of 31

All about histograms.

Data 94, Spring 2021 @ UC Berkeley

Suraj Rampure, with help from many others

Visualizing Numerical Variables

26

2 of 31

Overview

Announcements

  • Review of bar charts.
  • Numerical distributions.
    • Binning.
  • Histograms.
  • Choosing bins.
  • Overlaid and side-by-side histograms.�

Note: the linked textbook reading for this lecture doesn’t follow our lecture exactly; it explains a different kind of histogram. Still might be somewhat helpful.

  • Fill out Survey 6 if you haven’t already – even if you didn’t submit Homework 6!
  • Homework 7 and Survey 7 are out and are due on Thursday (11:59PM).
    • Do the survey after you finish the homework.

3 of 31

Review: bar charts

4 of 31

barh

  • Bar charts are used to display the distribution of a categorical variable, or the relationship between a categorical variable and one or more numerical variables.
  • The method t.barh(column_for_categories) creates a bar chart with:
    • The values of the column column_for_categories as the categories on the y-axis. This column should contain a categorical variable, and values should be unique.
    • Bars for every other column in t. These columns should contain numerical variables.
      • Multiple other columns → grouped bar chart.
    • Bars should be sorted depending on the type of the categorical variable.
  • Depending on the goal and table structure, grouping or pivoting may be necessary before creating the bar chart.
    • A somewhat-common pattern: t.group('column').barh('column').

5 of 31

6 of 31

7 of 31

Histograms

8 of 31

Towards numerical distributions

Bar charts visualize the distribution of a categorical variable, or the relationship between a categorical variable and one or more numerical variables.

Bar charts cannot display the distribution of a numerical variable.

Question: How might we display the distribution of a numerical variable?

Answer: One solution is binning.

9 of 31

Binning

Binning is counting the number of numerical values that fall within ranges, called “bins”.

  • A bin is defined by a left endpoint (lower bound) and right endpoint (upper bound).
  • A value falls in a bin if it is greater than or equal to the left endpoint and less than the right endpoint.
    • [a, b): a is included, b is not.
  • The width of a bin is its right endpoint minus its left endpoint.

72, 61, 63, 74, 68, 67, 65, 73, 65, 62, 66

60

64

68

72

76

[72, 76) bin

Width: 4

10 of 31

Histograms

A histogram visualizes the distribution of a numerical variable by binning. The method

t.hist(column, density = False)

creates a histogram of the column column of t. This column must contain numerical values.

  • It automatically chooses bins for us. We can change them.
  • We will always set density = False.

“right-skewed” or “right-tailed”

11 of 31

Example

Telling us that the number of tips between $1.90 and $2.80 is 79.

12 of 31

Aside: can confirm results using where and are.between

Also telling us that the number of tips between $1.90 and $2.80 is 79.

13 of 31

Why do we need density = False?

  • By setting density = False, the resulting histogram shows counts on the y-axis (i.e. number of observations per bin).
  • The default, density = True, creates a histogram with “percent per unit”.
    • Percentage of observations in bin = width of bin * height of bar.
    • Allows us to have bins with unequal sizes.
    • We will not worry about this version of histograms in this class – see Data 8 or future courses.

Same data as before. The y-axis is no longer “Count”.

14 of 31

Quick Check 1

Answer the following questions about this histogram of (fake) heights. If you don’t believe it’s possible to answer the question, write “can’t tell”.

  1. How many heights are between 60 inches (inclusive) and 64 inches (exclusive)?
  2. How many heights are between 60 inches (inclusive) and 68 inches (exclusive)?
  3. How many heights are between 62 inches (inclusive) and 68 inches (exclusive)?

15 of 31

A note on bins

By looking at a histogram, we cannot tell how values are distributed within a bin.

All heights in this bin could be 64 inches.

Or they could all be 66 inches.

Or half could be 65 and half could be 67.

Unless we have the actual data, we can’t tell.

16 of 31

Customization

We can use the same optional customization arguments with hist as we did with barh.

  • xaxis_title
  • yaxis_title
  • title
  • width
  • height

17 of 31

Choosing bins

18 of 31

Bins

hist chooses bins by default for us.

  • The resulting histogram often looks nice, but has non-integer bins which are harder for us to interpret.
  • We can choose our own bins.
  • We will only consider histograms where all bins have equal width, though it is possible to draw histograms where bins have unequal width.

The optional bins argument in hist requires an array, and uses the values in the array as bins instead.

Very last bin is [10, 11), not [11, 12).

19 of 31

np.arange, revisited

If we only need a few bins, we can make an array of bins by hand, like in the previous example.

But if we want more than a handful of bins, it will be tiring to create them by hand. Instead, we can use np.arange to create an array of equally spaced values.

Same as np.array([0, 1, 2, 3,, 11]).

20 of 31

Example

It’s a good idea to determine the min and max values in a column before choosing bins, to make sure your bins encompass the entire range.

We need to make sure that the largest value in the column is part of one of our bins.

If we used

np.arange(3, 51, 3)

for example, the last two elements in bins_3 would be 45 and 48, meaning that the last bin is [45, 48), and 50.81 isn’t placed in a bin.

21 of 31

Tradeoff

The width of each bin used dramatically changes the resulting histogram.

  • Narrow bins → many bins, each with few values. Resulting histogram is choppy.
  • Wide bins → few bins, each with many values. Resulting histogram is smooth.

In practice, it’s up to you to choose the bin size of your histogram.

Width = 3

Width = 7

Width = 10

22 of 31

Overlaid and side-by-side histograms

23 of 31

group and overlay

What if we want to show the distribution of a numerical variable, grouped by some categorical variable? For example, tips on weekends vs. tips on weekdays?

We can create grouped histograms by using the group and overlay arguments in hist.

  • To create one histogram for every unique category in some categorical column, assign the group argument to the name of the categorical column.

t.group(column, density = False, group = categorical_column)

  • By default, the above creates multiple histograms on the same set of axes. If you would rather have multiple individual histograms, set the group argument to False.

t.group(column, density = False, group = categorical_column,

overlay = False)

24 of 31

Example

See the notebook for more details.

It can be hard to compare two distributions if one has significantly more observations than the other.

When comparing distributions, compare the shape, not the absolute heights.

25 of 31

Quick Check 2

Which other arguments are needed to create this histogram? Select all that apply.

tips.where('tip percentage', are.below(25)) \

.hist('tip percentage', density = False, ...)

(See Ed for the options.)

“left-skewed” or “left-tailed”

26 of 31

Summary, next time

27 of 31

Bar charts vs. histograms

Bar charts visualize the distribution of a categorical variable, or the relationship between a categorical variable and a numerical variable.

  • Length of bar corresponds to value.
  • Width of bar means nothing.

Histograms visualize the distribution of an numerical variable.

  • Length of bar corresponds to number of values within bin.
  • Width of bar corresponds to the width of the bin.
    • Wider bin → more values within bin → smoother histogram.

Both have bars, but only one is a bar chart!

In our class, bar charts are horizontal while histograms are vertical; this won’t always be true.

28 of 31

Three-step visualization process

  1. Create a table with only the columns necessary to create the visualization.
  2. Call the correct visualization method. (So far, barh and hist.)
  3. Provide the correct arguments to customize the plot (axis labels, title, size, etc).

Homework 7!

29 of 31

hist

A histogram visualizes the distribution of a numerical variable by binning. The method

t.hist(column, density = False)

creates a histogram of the column column of t. This column must contain numerical values.

Optional arguments:

  • bins (array): allows us to manually select bins. Frequently used with np.arange.
  • group (string): allows us to draw separate histograms, one for each unique value in the specified categorical column.
  • overlay (boolean): only used in conjunction with group. If False, draws histograms on separate axes rather than on top of one another.
  • All customization arguments from before (xaxis_label, yaxis_label, title, width, height) – see Slide 23 of Lecture 25.

30 of 31

Documentation

In addition to referring to lecture slides, make sure to refer to the following resources from earlier:

  • Data 8 Python reference.
  • Data 8 textbook (reading linked on course website).
  • Official datascience documentation.
    • Scroll down to “Visualizations”.
    • Remember that you can type the name of a method followed by ? in a notebook to see the documentation inline.

31 of 31

Next time

Announcements

  • Visualizing pairs of numerical variables.
  • Scatter plots.
  • Line plots.
  • Fill out Survey 6 if you haven’t already – even if you didn’t submit Homework 6!
  • Homework 7 and Survey 7 are out and are due on Thursday (11:59PM).
    • Do the survey after you finish the homework.