1 of 16

Lecture 7

Distributions

Summer 2021

2 of 16

Announcements

  • Lab03 submit by Tues at 11:59pm passing all autograder tests to receive credit
  • HW2 due at 11:59pm tonight
  • HW3 released later today, due Fri 7/02 at 11:59pm
  • Lab01 & Lab02 regrades due by Wednesday 6/30, make sure to read the piazza post on how to submit a regrade request

3 of 16

Weekly Goals

  • Today
    • Recap of Visualizations
    • Visualize two kinds of distributions: Categorical, Numerical
  • Wednesday
    • Proportions as areas
    • Histograms + Numerical Distributions
  • Thursday
    • Functions!

4 of 16

Terminology

  • Individuals: those whose features are recorded
  • Variable: a feature, an attribute
    • acquires different values across rows
    • numerical, and therefore ordered; or
    • categorical (e.g., colors, genders) w/ possible sub-types
    • takes on exactly one value for each individual
    • has a Distribution:
      • For each different value of the variable, the frequency of individuals that have that value
        • Ex. Count of movies made in the same studio
        • Ex. Count of 50-year-olds in our data set

5 of 16

Categorical

Numerical

Individual

6 of 16

Charts

7 of 16

Plotting Two Numerical Variables

Line plot: plot

Scatter plot : scatter

(Demo)

8 of 16

Line Plot v. Scatter Plot: When to Use Each?

  • Use line plots for sequential data: if...
    • Attribute plotted along x-axis has an order
    • Sequential differences in y values are meaningful (e.g., stocks)
    • There’s only one y-value for each x-value
  • Usually, x-axis denotes time or distance
  • Use scatter plots for data that is not necessarily sequential
    • When you’re looking for associations (e.g., trends)
    • Multiple y values per x value

9 of 16

Numerical and Categorical

  • How does a numerical value vary with a categorical variable?
  • One axis is categorical, one numerical
  • Bars equally spaced, same width, length corresponds to numerical value

(Demo)

10 of 16

Discussion Question

For the following, which visualization would you use?

  1. Association between price of BTC and market cap of USDT?
  2. Average percent change in BTC price on days when USDT market cap increases, decreases or stays the same
  3. How did the price of dogecoin change during Elon Musk’s SNL appearance?

11 of 16

Distributions of Categorical Variables

12 of 16

Distribution

Distribution:

  • For each different value of a variable, the frequency of individuals that have that value

Categorical Distribution

  • For each different value of a Categorical variable, the frequency of individuals that have that category

13 of 16

Bar Charts & Visualization of Categorical Variables

  • Bar charts are commonly used to visualize categorical distributions
  • One axis is categorical (the category), one numerical (the frequency)

(Demo)

14 of 16

Displaying a Categorical Distribution

  • The distribution of a variable (a column, e.g., Stable Coin) describes the frequencies of its different values
  • The group method counts the number of rows for each value in the column (e.g., the number of coins priced at $1)
  • Bar charts can display the distribution of a categorical variable (e.g., studios):
    • One bar for each category
    • Length of bar is the count of individuals in that category
    • You can choose the order of the bars

15 of 16

Distributions of Numerical Variables

16 of 16

Visualizing Numerical Values: Binning

Binning is counting the number of numerical values that lie within ranges, called bins.

  • Bins are defined by their lower bounds (inclusive)
  • The upper bound is the lower bound of the next bin

188, 170, 189, 163, 183, 171, 185, 168, 173, ...

160

165

170

175

180

185

190

The [185,190) bin

(Demo)