1 of 24

Descriptive Statistics/ Summary Statistics (Chapter 7)

CMSC 320 - Introduction to Data Science

Fardina Alam

Part 02

2 of 24

Descriptive Statistics/ Summary Statistics

What are the main characteristics that describe a dataset?

Descriptive statistics help us:

  • Summarize data quickly
  • Organize information clearly
  • Present key patterns concisely�

allow to understand a dataset without examining every detail and provide a foundation for further analysis and prediction.

Common examples: Mean, median, mode

All four of these have the same mean and variance, but are clearly generated by four different processes.

3 of 24

Types of Descriptive/ Summary Statistics

  • Measures of central tendency/The location of the data
    • e.g. mean, median, mode, etc.
  • The shape of the data:
    • Measures of Skewness
    • Modality
  • The spread of the data/ Measure of dispersion:
    • Measures of variability (spread)
    • e.g. range, variance, etc…
  • Associations between variables: (LATER TOPIC)
    • Understanding correlation, covariance etc.
    • Measuring correlation
    • Scatter plots and regression

3

Part of descriptive statistics, used to summarize data:

  • Convey lots of information with extreme simplicity

4 of 24

Choose Topic

⭐ Star

🔔 Bell

🏆 Trophy

🎈 Balloon

🍀 Clover

🍎 Apple

🌸 Flower

💎 Gem

🔷 Diamond

🎵 Note

⚡ Lightning

🕊️ Dove

🌙 Moon

🎯 Target

🐾 Paw

🍉 Watermelon

5 of 24

Topic No.

⭐ Star (6)

🔔 Bell (13)

🏆 Trophy (5)

🎈 Balloon (4)

🍀 Clover (9)

🍎 Apple (14)

🌸 Flower (3)

💎 Gem (15)

🔷 Diamond (16)

🎵 Note (2)

⚡ Lightning (8)

🕊️ Dove (7)

🌙 Moon (1)

🎯 Target (12)

🐾 Paw (10)

🍉 Watermelon (11)

6 of 24

  1. Measures of Central Tendency

Measures of Central Tendency describe the center of a distribution by representing the entire dataset with a single value. These include:

6

  • Pythagorean means: three types of averages:
    • Arithmetic → for additive data
    • Geometric → for growth/multiplicative data
    • Harmonic → for rates and ratios�
  • Median
  • Mode

Mean → average value�

Median → middle ordered value�

Mode → most common value

7 of 24

  • Data form a distribution
  • We want to describe its center
  • Mean and median both measure central tendency
  • They differ when data are not symmetric

7

These are 30 hours of average defect data on sets of circuit boards. Roughly what is the typical value?

1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70�2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35�1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35

Example: MEASURES OF LOCATION

Purpose: Identify the "center" of a distribution of values.

8 of 24

What is the “Center” of Data?

The center is a value that is closest to most data points in a distribution. To define “closeness,” we need a distance measure between values.

Two Ways to Measure Distance

  • Absolute deviation: | x1 – x2 |
  • Squared deviation: (x1 – x2)2

We’ll define the center based on these metrics

  • Median: minimizes the sum of absolute deviations
  • Mean: minimizes the sum of squared deviations

9 of 24

THE MEAN OF AGGREGATE DATA

Average list price:

1/51 ($898,800 + $713,864 + … + $164,326) = $369,687

9

10 of 24

Averaging Averages? Be Careful!

10

Simply averaging group averages can be misleading if group sizes differ.

Both states get equal weight, even though Illinois has 10× more people. When group sizes differ, use a weighted mean to avoid misleading conclusions.

11 of 24

WEIGHTED AVERAGE

11

State population data: http://www.factmonster.com/ipka/A0004986.html

Sometimes an unequal weighting of the observations is necessary

New average is $409,234 compared to $369,687 without weights, an error of 11%

Solution: Because Illinois has many more people, it should receive more weight when computing the average.

12 of 24

Pythagorean Means: (1) Arithmetic Mean

12

  • Arithmetic Mean: Your typical average
    • Sensitive to Outliers: Affected by extreme values.
      • Equal Weight: Treats all values equally in calculation, regardless of its distance from the center of the dataset.

13 of 24

Pythagorean Means: (2) Geometric Mean

13

Geometric Mean: Used for growth, ratios, and percentages. Multiply all values → take the n-th root

14 of 24

Geometric Means is Less sensitive to outliers

Arithmetic mean → adds values (Large numbers pull it up a lot)

Geometric mean → multiplies values (Looks at relative change, not just size)

Example: Consider two datasets of investment returns (in percentage):

Dataset A: 2%, 4%, 6%, 8% Dataset B (with an outlier): 2%, 4%, 6%, 100%

Calculating the Mean:

Observe: In Dataset B, Geometric mean is less sensitive to outliers because it uses multiplication (relative change), not addition.

15 of 24

Pythagorean Means: (3) Harmonic Mean

A measure of central tendency used for rates, ratios, and “per-unit” data.

How it compute: taking the reciprocal of the average of the reciprocals of the values.

When to Use: when averaging speeds, work rates, and ratios (such as price-to-earnings or density).

Key Properties

  • Gives more weight to smaller values
  • Less influenced by very large values
  • Cannot be used if any value is zero

16 of 24

Harmonic Means: Example (Average Speed)

16

A car travels equal distances at two speeds:

  • First half of the trip: 60 km/h
  • Second half of the trip: 40 km/h

Try Arithmetic Mean = (60+40)/2=50 km/h

Incorrect: assumes equal weighting by speed, not time.

Try Geometric Mean= √(60*40)/2 ~ 48.99 km/h: Harmonic mean works because it accounts for time spent at each speed and gives more weight to lower speeds.

Solution: Harmonic Mean =2/ (1/60 +1/40) = 2/ (2/120 + 3/120)=2/(5/120)= (2×120)/5 =48 km/h

Harmonic mean is correct for average speed because slower speeds dominate total travel time.

17 of 24

r2. Measures of Skewness: Other Descriptors

Extreme observations distort means but not medians.

Outlying observations distort the mean:

  • Med [1,2,4,6,8,9,17] = 6
  • Mean[1,2,4,6,8,9,17] = 6.714
  • Med [1,2,4,6,8,9,17000] = 6 (still)
  • Mean[1,2,4,6,8,9,17000] = 2432.8 (!)

Typically occurs when there are some outlying observations, such as in cross sections of income or wealth and/or when the sample is not very large.

17

The shape of the data

18 of 24

SKEWED DATA

18

Monthly Earnings�N = 595, �Median = 800�Mean = 883

These data are skewed to the right.

Median

Mean

The mean will exceed the median when the distribution is skewed to the right.

Skewness is in the direction of the long tail

Mean = Median

Mean < Median

Median < Mean

Right-Skewed

Left-Skewed

Symmetric

19 of 24

3. Modality: Other Descriptors

19

Understand the number and nature of peaks in a distribution, which provides insight into the data's structure and characteristics

The shape of the data

20 of 24

3. Modality: number of peaks or modes in a distribution

  • Unimodal Distribution: One peak, showing a single trend.

Example: Adult human heights having one peak around the average height.

  • Bimodal Distribution: Two peaks, indicating two groups.

Example: Exam scores with high and low achievers.

  • Trimodal (three peaks) Example: Monthly website traffic, with peaks around product launches, seasonal sales, and special events.
  • Multimodal (more than three peaks): suggesting various factors. Example: Retail sales with peaks during holidays and promotions.

20

Modality is useful because it reveals the underlying structure and characteristics of the data by identifying the number of distinct groups or clusters within a dataset.

21 of 24

4. Measures of Variance

21

The spread of the data/ Measure of dispersion:

An important characteristic of any set of data is the variation in the data; it reflects how tightly or widely data points are distributed around the mean.

The standard deviation and variance are the most common measures of this spread.

22 of 24

The standard deviation σ and The Variance

Variance: Measures the average squared distance from the mean

  • Units are squared (harder to interpret)

Standard Deviation: Square root of variance

  • Measured in the same units as the data
  • Easier to interpret

Population vs Sample

  • Population uses N
  • Sample uses n − 1 (Bessel’s correction)�

Key Idea: Larger variance or standard deviation → greater spread in the data.

Good one: https://www.shiksha.com/online-courses/articles/variance-and-standard-deviation/

23 of 24

Interpreting Variability with Standard Deviation

23

Greater Variability (a wide, flat distribution curve): Larger standard deviation.

Less Variability (tall, spike-like distribution curve): Smaller standard deviation (data points are closer to the mean).

This visual representation helps us understand the level of variability in the data.

24 of 24

5. Correlation and Relationships ( Later Topics)

The End

Correlation: Measures the strength and direction of the relationship between two variables.

Understanding Correlation:

  • Positive: Variables move in the same direction.
  • Negative: Variables move in opposite directions.

Measuring Correlation: Pearson’s Correlation Coefficient (r): Ranges from -1 to 1.

  • Example: Variables: Hours studied vs. Exam scores Correlation: Positive (r≈0.8).