Overview

Purpose

This module introduces students to the idea of related data sets and describes how to merge data sets into one large data collection. Real data sets are analyzed in depth to demonstrate the benefits of building a nuanced, complete picture of the world.

Lessons

Data sets are related when they describe some common feature, topic, idea, or metric. They can be related by several similarities and do not require a complete overlap of variables or data points to be related.
Data sets with different variables can be merged by adding more columns (one per variable) and data sets describing different data points can be merged by adding more rows (one per data point).
Data sets don’t always merge cleanly and may contain missing data, differing units, conflicting data, and more. There is no right answer when it comes to how to resolve these issues but there are questions you can consider to make an informed decision.
Given that data sets can be merged, we should stop thinking of data in terms of “sets” but instead as one collection of information that you can continually expand with more data.
Merging data has several benefits such as improved organization, discrepancy checking, and, most importantly, more nuanced analysis.
Using a variety of data sets and expanding your data collection gives you a more complete picture of the world and can prevent one-dimensional, surface-level analysis.

Introduction to Relating Data “Sets”

What Are Related Data Sets?

Examples of Related Data Sets

What About “Unrelated” Data Sets?

Exercise: Finding Relationships

How Do We Combine Related Data Sets?

Visualizing Data Sets

Examples of Tabular Data

Basic Data Combinations

Adding Rows

Adding Columns

More Complicated Combinations

Exercise: Find the Combination

Analyzing Combined Data

Why Do We Combine Related Data Sets?

Case Studies

Blood Pressure vs. Age

Simpson’s Paradox

Global Covid Data

Which Country is the “Best”?

Exercise: Exploring New Places

Exercise: Where to Live?

Introduction to Relating Data “Sets”

What Are Related Data Sets?

Data sets are related when they describe some common feature, topic, idea, or metric. Although this definition may feel very open-ended, it is purposely vague because much of the data available in our world today is connected by some common thread.

This connection could be something as simple as “both data sets contain data for the year 2018” or as specific as “all three data sets measure child obesity in Shenzhen, China for the month of June 2020.” In this module, we will focus on data sets that involve the same location(s) or overlap in time periods for the sake of simplicity.

It is also worth noting that data sets do not need to have complete overlap to be related. For example, if data set A has data from 2018 to 2020 and data set B has data from 2019 to 2022, they are still related since they contain data for the years 2019 and 2020.

Examples of Related Data Sets

The following examples introduce pairs of data sets and some reasons why they are related.

1. “US Malaria Cases Data” and “US Chickenpox Cases Data”

The first data set contains the number of malaria cases year by year for the United States. The second data set contains the number of chickenpox cases year by year for the United States.

Quarter	US Consumer Price Index
2018-01	248
2018-04	251
2018-06	252
2018-11	252

State	State Capital	Population	Median Income ($)
New York	Albany	19.3M	25.4K
California	Sacramento	39.4M	33.7K
Washington	Olympia	7.69M	37.7K
Alabama	Montgomery	4.92M	27K

State	Population: Business Major	Population: Education Major
Florida	1.04M	585k
Texas	1.22M	653K

State	Population: Science and Engineering Major
Florida	1.35M
Texas	1.82M

County	State	Median Income ($)
New Castle	Delaware	37,000
Kent	Delaware	31,100
Sussex	Delaware	32,700

State	Median Income (¢)	Population
Delaware	3,400,000	1,000,000
Nevada	3,190,000	3,140,000

	GDP	GDP per capita	Gini Index	Life Expectancy	Crime
1	United States	United States	France	France	China
2	China	France	India	United States	France
3	India	China	United States	China	India
4	France	Brazil	China	Brazil	United States
5	Brazil	India	Brazil	India	Brazil