1 of 20

Lecture 9

Histograms

DATA 8

Summer 2017

Slides created by John DeNero (denero@berkeley.edu), Ani Adhikari (adhikari@berkeley.edu), and Sam Lau (samlau95@berkeley.edu)

2 of 20

Announcements

3 of 20

Bar Charts (Review)

4 of 20

Types of Data

All values in a column should be both the same type and be comparable to each other in some way

  • Numerical — Each value is from a numerical scale
    • Numerical measurements are ordered
    • Differences are meaningful
  • Categorical — Each value is from a fixed inventory
    • May or may not have an ordering
    • Categories are the same or different

5 of 20

“Numerical” Data

Just because the values are numbers, doesn’t mean the variable is numerical

  • Census example had numerical SEX code (0, 1, and 2)

  • It doesn’t make sense to perform arithmetic on these “numbers”, e.g. 1 - 0 or (0+1+2)/3 are nonsense here

  • The variable SEX is still categorical, even though numbers were used for the categories

6 of 20

Bar Charts

Compare some quantity across categories

  • % of smartphone owners who have used their phone for the following in the last year: online banking, job search, etc.
  • Gross ticket sales for individual movies

(Demo)

7 of 20

Discussion Question

Top 10 highest grossing movies

How long ago each one was released

8 of 20

Bar Charts of Counts

Distributions:

  • The distribution of a variable (a column) describes the frequency of its different values
  • The group method counts the number of rows for each value in a column

Bar charts can display the distribution of categorical values

  • Proportion of how many US residents are male or female
  • Count of how many top movies were released by each studio

(Demo)

9 of 20

Binning

10 of 20

Binning Numerical Values

Binning is counting the number of numerical values that lie within ranges, called bins.

  • Bins are defined by their lower bounds (inclusive)
  • The upper bound is the lower bound of the next bin

188, 170, 189, 163, 183, 171, 185, 168, 173, ...

160

165

170

175

180

185

190

The [185,190) bin

11 of 20

Histogram

Chart to display the distribution of numerical values using bins

(Demo)

12 of 20

Attendance

13 of 20

The Density Scale

14 of 20

Histogram Axes

By default, hist uses a scale (normed=True) that ensures the area of the chart sums to 100%

  • The horizontal axis is a number line (e.g., years)
  • The vertical axis is a rate (e.g., percent per year)
  • The area of a bar is a percentage of the whole

(Demo)

15 of 20

How to Calculate Height

The [20, 40) bin contains 59 out of 200 movies

  • “59 out of 200” is 29.5%
  • The bin is 40 - 20 = 20 years wide

29.5 percent

Height of bar = --------------------------

20 years

= 1.475 percent per year

16 of 20

Height Measures Density

% in bin

Height = ---------------------

width of bin

  • The height measures the percent of data in the bin relative to the amount of space in the bin.

  • So height measures crowdedness, or density.

(Demo)

17 of 20

Area Measures Percent

Area = % in bin = Height x width of bin

  • “How many individuals in the bin?” Use area.
  • “How crowded is the bin?” Use height.

18 of 20

Chart Types

19 of 20

Bar Chart vs. Histogram

Bar Chart

  • 1 categorical axis &�1 numerical axis
  • Bars have arbitrary �(but equal) widths and spacings
  • For distributions:�height (or length) of bars are proportional to the percent of individuals

Histogram

  • Horizontal axis is numerical, hence to scale with no gaps
  • Height measures density; areas are proportional to the percent of individuals

20 of 20

Overlaid Graphs

For visually comparing two populations

(Demo)