Relating Data "Sets"
Combining Data Collections & Exploring Nuanced Data Analysis
This module introduces students to the idea of related data sets and describes how to merge data sets into one large data collection. Real data sets are analyzed in depth to demonstrate the benefits of building a nuanced, complete picture of the world.
Introduction to Relating Data “Sets”
What About “Unrelated” Data Sets?
Exercise: Finding Relationships
How Do We Combine Related Data Sets?
Exercise: Find the Combination
Why Do We Combine Related Data Sets?
Exercise: Exploring New Places
Data sets are related when they describe some common feature, topic, idea, or metric. Although this definition may feel very open-ended, it is purposely vague because much of the data available in our world today is connected by some common thread.
This connection could be something as simple as “both data sets contain data for the year 2018” or as specific as “all three data sets measure child obesity in Shenzhen, China for the month of June 2020.” In this module, we will focus on data sets that involve the same location(s) or overlap in time periods for the sake of simplicity.
It is also worth noting that data sets do not need to have complete overlap to be related. For example, if data set A has data from 2018 to 2020 and data set B has data from 2019 to 2022, they are still related since they contain data for the years 2019 and 2020.
The following examples introduce pairs of data sets and some reasons why they are related.
1. “US Malaria Cases Data” and “US Chickenpox Cases Data”
The first data set contains the number of malaria cases year by year for the United States. The second data set contains the number of chickenpox cases year by year for the United States.
[Source]
Similarities
2. “US Annual Inflation Rate” and “China Annual Inflation Rate”
The first data set measures the annual inflation rate for the US using the Consumer Price Index and the second data set measures the annual inflation rate for China using the same index.
[Source]
[Source]
Similarities
3. “Music Publishers in California” and “Median Income in California”
The first data set measures the number of music publishers in California and the second data set measures the median income in California.
[Source]
[Source]
Similarities:
4. “Greenhouse Gas Emissions in California (2010 - 2019)” and “Greenhouse Gas Emissions in New York (2010 - 2019)”
The first data set measures greenhouse gas emissions in California between 2010 and 2019 and the second data set measures the same variable for New York over the same time period.
[Source]
Similarities:
Given how broad the definition of a related data set is, you might wonder exactly how different two data sets have to be in order to be “unrelated.” In reality, it is difficult to define when data sets are unrelated because almost every pair of data sets has some theoretical similarity.
Therefore, it is best to define “unrelated” data sets within the context in which you are using the data. For example, if you are trying to analyze health trends in California, then income data for Russia is likely contextually unrelated. Although the two data sets may be related, those similarities are irrelevant to your end goal.
In summary, data sets are related when they describe some common feature, topic, idea, or metric. They can be related by several similarities and do not require a complete overlap of variables or data points to be related. Sometimes, it helps to narrow the scope of what counts as “related” depending on the context.
Read the descriptions of the data sets listed below. Write down at least three pairs of related data sets and describe what makes them related.
Data Set A: A data set that measures the median age in the US from 2011 and 2019 (source).
Data Set B: A data set that measures the life expectancy for France, Italy, and the US from 1960 and 2019 (source).
Data Set C: A data set that measures the number of people who have attained a bachelors degree or higher in each state of the US for the year 2020 (source).
Data Set D: A data set that measures the number of criminal activities in each state of the US for the year 2019 (source).
Data Set E: A data set that measures gender income inequality in 2018 and the median age in 2020 for each county in Nebraska (source).
Oftentimes, when data sets are related, it is useful to combine them into one big data set to perform analysis on (we will discuss what kind of analysis is possible in a later section). This section will discuss how to combine various types of related data sets. Once we learn how data can be merged, we should stop thinking of data in terms of “sets” but instead as one collection of information that you can continually expand with more data.
Most data sets can be visualized tabularly, i.e. each column represents a variable that is being measured and each row represents a data point, where each value in that row all describes a common feature like location or time.
The following tables provide examples of what data visualized tabularly might look like, and include captions that explain what each row or column represents.
US Inflation Rate
Quarter | US Consumer Price Index |
2018-01 | 248 |
2018-04 | 251 |
2018-06 | 252 |
2018-11 | 252 |
Table 1: A simple example where each row represents a quarter in 2018 and the variables are “quarter” and “US Consumer Price Index.”
[Source]
State Demographics (2020)
State | State Capital | Population | Median Income ($) |
New York | Albany | 19.3M | 25.4K |
California | Sacramento | 39.4M | 33.7K |
Washington | Olympia | 7.69M | 37.7K |
Alabama | Montgomery | 4.92M | 27K |
Table 2: A more complex example where each row represents a state and the variables are “state”, “state capital”, “population”, and “median income” for the year 2020.
[Source]
While there are also other ways to visualize data (see Module 4 for an overview), tables allow us to easily determine how to combine data sets, as we’ll get into in the next subsection.
When combining data sets, there are usually two main approaches depending on the structure of the data: adding columns or adding rows.
When data sets have the same variables, you can simply add the second data set as additional rows to the table. For example, the following data sets both have the same four variables (state, state capital, population, and median income) for the same year.
Data Set A: State Demographics (2020)
State | State Capital | Population | Median Income ($) |
New York | Albany | 19.3M | 25.4K |
California | Sacramento | 39.4M | 33.7K |
Washington | Olympia | 7.69M | 37.7K |
Alabama | Montgomery | 4.92M | 27K |
[Source]
Data Set B: State Demographics (2020)
State | State Capital | Population | Median Income ($) |
Florida | Tallahassee | 19.3M | 29.2K |
Texas | Austin | 39.4M | 31.5K |
[Source]
Combined Data Set: State Demographics (2020)
State | State Capital | Population | Median Income ($) |
New York | Albany | 19.3M | 25.4K |
California | Sacramento | 39.4M | 33.7K |
Washington | Olympia | 7.69M | 37.7K |
Alabama | Montgomery | 4.92M | 27K |
Florida | Tallahassee | 19.3M | 29.2K |
Texas | Austin | 39.4M | 31.5K |
When the data sets don’t have the same variables but have data points that represent the same location, topic, time, or some other unique variable value, you can combine data sets by adding columns for the variables that are unique to the second data set. Then, you fill in the missing values as appropriate.
Data Set A: State Education (2018)
State | Population: Business Major | Population: Education Major |
Florida | 1.04M | 585k |
Texas | 1.22M | 653K |
[Source]
Data Set B: State Education (2018)
State | Population: Science and Engineering Major |
Florida | 1.35M |
Texas | 1.82M |
[Source]
Combined Data Set: State Education (2018)
State | Population: Business Major | Population: Education Major | Population: Science and Engineering Major |
Florida | 1.04M | 585k | 1.35M |
Texas | 1.22M | 653K | 1.82M |
In summary, data sets with different variables can be merged by adding more columns (one per variable) and data sets describing different data points can be merged by adding more rows (one per data point).
Sometimes, data sets don’t merge cleanly. For example, you might run into any of the following problems:
In these cases, you need to evaluate your priorities and make a decision on what data to include or exclude based on your use case and expectations. Here are some questions to consider:
Data sets don’t always merge cleanly and may contain missing data, differing units, conflicting data, and more. There is no right answer when it comes to how to resolve these issues but there are questions you can consider to make an informed decision.
Clearly, data combination can be a daunting task especially since many real world data sets are not as complete, consistent, and structured as the data sets we’ve seen so far. There are many efforts to compile these data sets into one large collection such as Data Commons, which already contains data from a wide variety of sources.
Data Set 1
County | State | Median Income ($) |
New Castle | Delaware | 37,000 |
Kent | Delaware | 31,100 |
Sussex | Delaware | 32,700 |
[Source]
Data Set 2: This data set was collected by the national census bureau using an online survey that reached millions of citizens. It contains state-level data on median income and the population.
Data Set 2
State | Median Income (¢) | Population |
Delaware | 3,400,000 | 1,000,000 |
Nevada | 3,190,000 | 3,140,000 |
Combined Data Set
As mentioned above, combined data sets can be analyzed in the same way we analyze regular data sets. However, by combining them, we unlock certain benefits:
Perhaps the most critical benefit is the ability to perform more nuanced analysis. When we are limited to one data set, the analysis performed often lacks information about the greater context and by combining various data sets, we can gather a more complete picture. We will focus on this idea for the rest of this section.
In general, merging data has several benefits such as improved organization, discrepancy checking, and, most importantly, more nuanced analysis.
Oftentimes, when we have few data points or few variables, it can be easy to draw conclusions or spot patterns that don’t actually exist. The following case studies walks through how adding more data allows us to draw increasingly informed conclusions.
The following graph plots a data set that contains data about the median age and high blood pressure prevalence of counties in Mississippi and Vermont. The blue line plots the line of best fit and we can see that the data seems to imply that age is negatively correlated with blood pressure prevalence (i.e. the older you are, the lower your chance of having high blood pressure).
However, once we add in a variable that accounts for the state of each county, we can see that within each state there is a positive correlation between the two variables.
When we explore why the original data to figure out why our data initially seemed to suggest that older people were less at risk of high blood pressure, we can see that Vermont is generally older and healthier than Mississippi so the data appears to trend downward. By adding in the state variable, we are able to separate out differences between states to better isolate the relationship between age and high blood pressure. This is a great example of Simpson’s Paradox, where a trend can disappear or reverse when combined with data from other groups
The following graph depicts the cumulative count of Covid cases for Brazil, China, France, India, and the United States:
[Source]
We can observe from this graph the following facts: (1) Brazil and France appear to have roughly the same number of cases, and (2) the United States has the most cases but is currently only a factor of 2 greater than India.
These observations might lead us to believe that the severity of the Covid-19 pandemic is the worst in the United States and the ranking by Covid-19 severity for these five countries, from most severe to least, would be the United States, India, Brazil, France, and finally China.
However, if we gather data on the population of each country, we would observe that the ranking of countries by population would be, from most to least, China, India, the United States, Brazil, and finally France.
[Source]
If we combine these data sets, we can see that France has significantly more cases per capita (i.e. cases per person) than any other country. The following graph shows the cumulative count of Covid-19 cases per capita. Note how the ranking of the countries now changes to France, the United States, Brazil, India, and finally China. Additionally, we can see that the cumulative count per capita in the United States is roughly 10 times worse than in India, instead of the factor of 2 we observed before.
[Source]
Just by combining two data sets, we were able to better inform our view of the Covid-19 pandemic and prevent ourselves from making misleading claims about which country has higher prevalence of Covid-19. However, there are several other potential explanations for this data that we have not yet explored. For example, medical reporting infrastructure might be better in wealthier countries like France and the United States than in developing countries like India. This might lead to more unreported cases of Covid-19 in India and therefore the data that we have might not truly reflect the state of the world.
It is also important, once you have observed Covid-19 prevalence or severity, to not draw the wrong conclusions from this data. You may have several hypotheses for why some countries are harder hit by Covid-19 than others, but you should base your conclusion on data exploring these hypotheses.
For example, if you believe that some countries had fewer Covid-19 cases because they had higher rates of vaccination, you could combine a vaccination rate data set into your collection to observe if the trend is true. In this case, we can see that the countries with the highest vaccination rates (i.e. number of vaccines administered per capita) are China, France, Brazil, the United States, and finally, India.
[Source]
Therefore, it seems as if our hypothesis would not hold true since France has the second highest vaccination rate while also having the highest prevalence of Covid-19. However, there are variants of Covid-19 that did not have corresponding vaccines at the time of data collection. Therefore, it is possible that France is mainly contracting the variant that people are unprotected by. If we could obtain data on the counts separated by the variant of Covid-19 that was vaccinated against or contracted, we could maybe spot an inverse relationship between vaccination rate and covid prevalence.
In general, public health is an incredibly nuanced field and combining data sets will allow you to approach the true state of the world and make more informed observations.
When it comes to comparing countries, there are a handful of metrics that are commonly used to determine which country is the “best.” One such metric is gross domestic product (GDP), which is a popular measure of the size of a country’s economy. The following graph shows the GDP of a handful of countries:
[Source]
We can see ranking the countries by highest to lowest GDP would give us the United States, China, India, France, and finally Brazil. We can perform the same per capita calculation that we did in the previous case study by adding in population data. This gives us a new ranking of the United States, France, China, Brazil, and finally India.
[Source]
However, GDP is only a measure of the size of the overall economy and does not tell us what the distribution of wealth or income looks like within a country. For this, we could turn to the Gini Index, which measures income inequality where a higher value means more inequality. Under this new metric, the country rankings would be France, India, the United States, China, and finally Brazil (for the year 2010 since data for India does not exist past 2011).
[Source]
Alternatively, some would argue that economic status should not define what country is best, and we should instead consider the health and well-being of its residents. If we examine life expectancy data (i.e. the average lifespan of a resident), we would see the rankings shift to France, the United States, China, Brazil, and finally India.
[Source]
Finally, we can look at public safety and crime data as another facet of society. Focusing on just the per capita count of homicide and manslaughter, we can see that the new ranking would be China, France, India, the United States, and finally Brazil.
[Source]
The following table summarizes how the rankings would change depending on the metric:
GDP | GDP per capita | Gini Index | Life Expectancy | Crime | |
1 | United States | United States | France | France | China |
2 | China | France | India | United States | France |
3 | India | China | United States | China | India |
4 | France | Brazil | China | Brazil | United States |
5 | Brazil | India | Brazil | India | Brazil |
As we can see, there is no consistent ranking across each of these variables. Attempting to evaluate each country using just one of these metrics would provide a reductionist picture of the world and it takes several variables from a variety of data sets to even begin to understand the whole picture.
Additionally, some data sets may be misleading without the presence of additional explanatory variables. For example, trying to rank the countries by prevalence of depression would lead to many developed countries ranking lower than small developing nations. However, large, developed nations often have better mental health infrastructure and thus are more likely to diagnose depression than smaller or less developed countries.
Similarly, places like the United States would be ranked low if we used hate crimes as the main metric but it is also worth noting that the United States has greater diversity of race, religion, etc. than many countries with lower hate crime rates. By adding in diversity data, we can explain certain trends and factor that into our analysis.
In summary, using a variety of data sets and expanding your data collection gives you a more complete picture of the world and can prevent one-dimensional, surface-level analysis.
[Source]
If no such panel shows up, pick another location and try again until you see this graph with the “Explore more” button at the bottom.
Imagine you are moving to New York City and you have to choose a neighborhood to live in. Unfortunately, you know nothing about the city and need to do some research to make your choice.
An example of a neighborhood summary site that combines several data sets and presents summaries of this data for its users. This picture shows data for Chelsea, NY.
[Source]
Use one of these neighborhood summary sites (e.g. Zillow, Redfin, Good Migrations, etc.) and try to find the best 5 neighborhoods in New York using the original 5 concerns you listed in step (1). Discuss how this process was different from steps (2) and (3), and how the sites made it easier (or harder!) to form this ranking.