Descriptive Statistics/ Summary Statistics (Chapter 7)
CMSC 320 - Introduction to Data Science
Fardina Alam
Part 02
Descriptive Statistics/ Summary Statistics
What are the main characteristics that describe a dataset?
Descriptive statistics help us:
allow to understand a dataset without examining every detail and provide a foundation for further analysis and prediction.
Common examples: Mean, median, mode
All four of these have the same mean and variance, but are clearly generated by four different processes.
Types of Descriptive/ Summary Statistics
3
Part of descriptive statistics, used to summarize data:
Choose Topic
⭐ Star | 🔔 Bell | 🏆 Trophy | 🎈 Balloon |
🍀 Clover | 🍎 Apple | 🌸 Flower | 💎 Gem |
🔷 Diamond | 🎵 Note | ⚡ Lightning | 🕊️ Dove |
🌙 Moon | 🎯 Target | 🐾 Paw | 🍉 Watermelon |
Topic No.
⭐ Star (6) | 🔔 Bell (13) | 🏆 Trophy (5) | 🎈 Balloon (4) |
🍀 Clover (9) | 🍎 Apple (14) | 🌸 Flower (3) | 💎 Gem (15) |
🔷 Diamond (16) | 🎵 Note (2) | ⚡ Lightning (8) | 🕊️ Dove (7) |
🌙 Moon (1) | 🎯 Target (12) | 🐾 Paw (10) | 🍉 Watermelon (11) |
Measures of Central Tendency describe the center of a distribution by representing the entire dataset with a single value. These include:
6
Mean → average value�
Median → middle ordered value�
Mode → most common value
7
These are 30 hours of average defect data on sets of circuit boards. Roughly what is the typical value?
1.45 1.65 1.50 2.25 1.65 1.60 2.30 2.20 2.70 1.70�2.35 1.70 1.90 1.45 1.40 2.60 2.05 1.70 1.05 2.35�1.90 1.55 1.95 1.60 2.05 2.05 1.70 2.30 1.30 2.35
Example: MEASURES OF LOCATION
Purpose: Identify the "center" of a distribution of values.
What is the “Center” of Data?
The center is a value that is closest to most data points in a distribution. To define “closeness,” we need a distance measure between values.
Two Ways to Measure Distance
We’ll define the center based on these metrics
THE MEAN OF AGGREGATE DATA
Average list price:
1/51 ($898,800 + $713,864 + … + $164,326) = $369,687
9
Averaging Averages? Be Careful!
10
Simply averaging group averages can be misleading if group sizes differ.
Both states get equal weight, even though Illinois has 10× more people. When group sizes differ, use a weighted mean to avoid misleading conclusions.
WEIGHTED AVERAGE
11
State population data: http://www.factmonster.com/ipka/A0004986.html
Sometimes an unequal weighting of the observations is necessary
New average is $409,234 compared to $369,687 without weights, an error of 11%
Solution: Because Illinois has many more people, it should receive more weight when computing the average.
Pythagorean Means: (1) Arithmetic Mean
12
Pythagorean Means: (2) Geometric Mean
13
Geometric Mean: Used for growth, ratios, and percentages. Multiply all values → take the n-th root
Geometric Means is Less sensitive to outliers
Arithmetic mean → adds values (Large numbers pull it up a lot)
Geometric mean → multiplies values (Looks at relative change, not just size)
Example: Consider two datasets of investment returns (in percentage):
Dataset A: 2%, 4%, 6%, 8% Dataset B (with an outlier): 2%, 4%, 6%, 100%
Calculating the Mean:
Observe: In Dataset B, Geometric mean is less sensitive to outliers because it uses multiplication (relative change), not addition.
Pythagorean Means: (3) Harmonic Mean
A measure of central tendency used for rates, ratios, and “per-unit” data.
How it compute: taking the reciprocal of the average of the reciprocals of the values.
When to Use: when averaging speeds, work rates, and ratios (such as price-to-earnings or density).
Key Properties
Harmonic Means: Example (Average Speed)
16
A car travels equal distances at two speeds:
Try Arithmetic Mean = (60+40)/2=50 km/h
Incorrect: assumes equal weighting by speed, not time.
Try Geometric Mean= √(60*40)/2 ~ 48.99 km/h: Harmonic mean works because it accounts for time spent at each speed and gives more weight to lower speeds.
Solution: Harmonic Mean =2/ (1/60 +1/40) = 2/ (2/120 + 3/120)=2/(5/120)= (2×120)/5 =48 km/h
Harmonic mean is correct for average speed because slower speeds dominate total travel time.
r2. Measures of Skewness: Other Descriptors
Extreme observations distort means but not medians.
Outlying observations distort the mean:
Typically occurs when there are some outlying observations, such as in cross sections of income or wealth and/or when the sample is not very large.
17
The shape of the data
SKEWED DATA
18
Monthly Earnings�N = 595, �Median = 800�Mean = 883
These data are skewed to the right.
Median
Mean
The mean will exceed the median when the distribution is skewed to the right.
Skewness is in the direction of the long tail
Mean = Median
Mean < Median
Median < Mean
Right-Skewed
Left-Skewed
Symmetric
3. Modality: Other Descriptors
19
Understand the number and nature of peaks in a distribution, which provides insight into the data's structure and characteristics
The shape of the data
3. Modality: number of peaks or modes in a distribution
Example: Adult human heights having one peak around the average height.
Example: Exam scores with high and low achievers.
20
Modality is useful because it reveals the underlying structure and characteristics of the data by identifying the number of distinct groups or clusters within a dataset.
4. Measures of Variance
21
The spread of the data/ Measure of dispersion:
An important characteristic of any set of data is the variation in the data; it reflects how tightly or widely data points are distributed around the mean.
The standard deviation and variance are the most common measures of this spread.
The standard deviation σ and The Variance
Variance: Measures the average squared distance from the mean
Standard Deviation: Square root of variance
Population vs Sample
Key Idea: Larger variance or standard deviation → greater spread in the data.
Good one: https://www.shiksha.com/online-courses/articles/variance-and-standard-deviation/
Interpreting Variability with Standard Deviation
23
Greater Variability (a wide, flat distribution curve): Larger standard deviation.
Less Variability (tall, spike-like distribution curve): Smaller standard deviation (data points are closer to the mean).
This visual representation helps us understand the level of variability in the data.
5. Correlation and Relationships ( Later Topics)
The End
Correlation: Measures the strength and direction of the relationship between two variables.
Understanding Correlation:
Measuring Correlation: Pearson’s Correlation Coefficient (r): Ranges from -1 to 1.