Reshaping data
Google Sheets
Apps Script
Google Colab
A guide by Huyen Nguyen, Kansas State University
For my slides:
https://bit.ly/48T57bz
Intro
Reshaping data is considered a part of analyzing and visualizing data. Basically, you will process your data for a specific type of data visualization.
Two approaches:
1. Convert data from a wide to a long format, or spread data (Unpivot Table)
2. Convert data from a long to a wide format, or summarize data (Pivot Table)
Normally, we reshape data from wide to long format when we have multiple columns, variables or measures that we want to consolidate or plot together.
1. Wide to long
Source: Unemployment Data / Bureau of Labor Statistics
Wide table
Long table
Line chart
Dataset #1 from BLS – Job Seekers
In some situations, you will need to collapse your data from multiple columns into just one column, for example, if you have to process historical unemployment data from the Bureau of Labor Statistics, to build a line chart.
The website is a bit hard to navigate around. You want to locate Historical 'A' Tables (Household data): Previous years and months. Select Tables.
Locate Table A-11. Unemployed persons by reason for unemployment
Under UNEMPLOYED AS A PERCENT OF THE CIVILIAN LABOR FORCE, check all boxes on the second column Seasonally Adjusted.
Click Retrieve data.
Create a new spreadsheet on Google Sheets
Sign up/Launch Google Sheets, a spreadsheet tool to handle data in rows and columns
Make a copy of my Google Sheet here. File > Make a copy
Look at the first sheet Job seekers.
We are going to combine the months and the years in one single column and collapse all relevant values (in percent of unemployment ) in another column, in order to build a line chart.
We will look at the reasons in Job losers and persons who completed temporary jobs, or the first data table.
Google Sheets Function – Flatten()
The =FLATTEN() function helps consolidate all the values from one or more data ranges into a single column.
Let's create a new sheet. Name it Job seekers_wtl
Create two new columns starting from cell A1. Name them Time and Percent, respectively.
In cell A2, type a formula to flatten the rows and columns into one single column.
=FLATTEN('Job seekers'!B15:M15&'Job seekers'!A16:A73)
Google Sheets Function – ARRAYFORMULA()
To hold the range of consolidated values returned from the FLATTEN() function, we need to use the =ARRAYFORMULA() function because it enables the display of values into multiple rows and/or columns.
So, we have this formula:
=ARRAYFORMULA(FLATTEN('Job seekers'!A16:A73&'Job seekers'!B15:M15))
But if we just use this formula, the year and the month will stick together, so let's modify the FLATTEN formula by adding a vertical bar between the two ranges. Note: You can use any delimiters, like a space or a comma.
FLATTEN('Job seekers'!B15:M15&"| "&'Job seekers'!A16:A73)
Or, we will have this:
=ARRAYFORMULA(FLATTEN('Job seekers'!B15:M15&" "&'Job seekers'!A16:A73))
Do It Again – FLATTEN()
Now, we just need to collapse all percentage values into the second column using the same =FLATTEN() function. In cell B2, type or select the relevant data range on the Job seekers sheet.
=FLATTEN('Job seekers'!B16:M73)
This time, you don’t need to wrap the function inside the =ARRAYFORMULA() function because you just flatten one dimension, not using the &.
That's how you flip data from wide to long format in most cases. Voila! We're done!
Dataset #2 from DHS – Green card
Lawful permanent residents (LPRs) are normally referred to as “green card” holders, or foreigners who have long-term working permits. The U.S. Department of Homeland Security kept track of these people through a multi-table database. Table 4 contains data classified by state or territory of residence from 2013 to 2022.
The Census Bureau estimated the US population based on its survey and published data on this site.
Using =ARRAYFORMULA() and FLATTEN() functions, please flip lawful permanent residents data from the wide to the long format.
Hint: On a new sheet, you need to create three new columns: State or territory, Year, and Persons.
The case of 3 columns
With the first dataset, you need to flip a wide table to just a 2-column long table. But in this dataset, you need to flip the wide table to a 3-column long table: State or territory, Year, and Persons.
Unfortunately, using the formulas above can only return and store consolidated values in a single column.
If you need to flip and spread data into 2 columns, you need to use the SPLIT() function
Google Sheets Function – SPLIT()
Let’s say you wrote this formula:
=ARRAYFORMULA(FLATTEN('Greencard holders'!A5:A60& "|" & 'Greencard holders'!B4:K4))
If you need to split the newly created column into two columns, you can use the =SPLIT() function and wrap it around the =FLATTEN() function, as below.
SPLIT(FLATTEN('Greencard holders'!A5:A60& "| " & 'Greencard holders'!B4:K4), "| ")
Just make sure you wrap the above function inside the =ARRAYFORMULA() function as below.
=ARRAYFORMULA(SPLIT(FLATTEN('Greencard holders'!A5:A60& "|" & 'Greencard holders'!B4:K4), " |"))
* You can also go to Data > Split text to columns after selecting the newly flipped column.
Normally, we create a pivot table when we have data in a long format with multiple categories or groups in one column and we want to see the summary statistics for each category/group.
2. Long to wide
Source: Weekly United States COVID-19 Cases and Deaths/ Centers for Disease Control and Prevention
Wide table
Long table
Racing bar chart
Dataset #3 from CDC – COVID-19
Since the start of the COVID-19 pandemic, COVID-19 data have been gathered by the CDC. Reporting of new Aggregate Case and Death Count data was discontinued May 11, 2023, with the expiration of the COVID-19 public health emergency declaration.
The data can be accessed through this portal. The datatable was structured in a long format. If you want to create a racing bar chart, you will need to flip the data from a long to a wide format.
The task can be done easily by creating a Pivot table, a feature popular for data summary and drill-down.
Use Pivot Table
First, select the long table (the one with year and month separately) by placing your cursor inside it, then hold Cmd/Ctrl-A.
Next, go to Insert > Pivot table > New sheet. Name the new sheet COVID-19_ltw
Under Rows, add state.
Under Columns, add data_updated.
Under Values, add tot_deaths.
Uncheck Show Totals
Dataset #4 from the GCP - CO2
The Washington Post published a racing line chart on March 1, 2023. Harry Stevens, the reporter and designer, provides his data source and codebook at this link.
Estimates for every country’s carbon dioxide emissions since 1750 was gathered from The Global Carbon Project. The data was last updated on Oct. 17, 2022.
The data can be accessed through this portal. The datatable was also structured in a long format. If you want to create a racing line chart, you will need to flip the data from a long to a wide format.
This dataset has 3 ID columns, instead of one as Dataset #3.
The case of multiple ID columns
ID columns refer to the columns in the long table that you want to keep and not spread in the flipped table.
Using a Pivot table in Google Sheets may be the fastest and most efficient way to flip data from a long to a wide format. It works even when you have multiple ID columns in your datatable.
Use Pivot Table
First, select the long table (the one with year and month separately) by placing your cursor inside it, then hold Cmd/Ctrl-A.
Next, go to Insert > Pivot table > New sheet. Name the new sheet CO2_ltw
Under Filters > Year > Filter by condition > Greater than or equal to 1900. You will notice that Google Sheets can’t flip data since 1750.
Under Rows, add all 3 ID columns: Country, ISO 3166-1, and UN M49.
Under Columns, add Year.
Under Values, add Total.
Uncheck Show Totals
App Scripts
What’s for?
Apps Script can be used to create custom functions interacting with your Google Sheet files and the data they contain.
To write the scripts, you need to first go to Extensions > Apps Script and add the custom scripts.
To learn how to write the scripts, check out this guide.
In this session, we will adopt and revise the free scripts written by Robin Gertenbach, a data scientist working for Google. Using scripts is a faster way to reshape data. These scripts will generate two custom functions melt(), to reshape data from wide to long, and cast(), to reshape data from long to wide.
You just need to set up the scripts for one Google Sheets template, and you can make a copy and reuse it for multiple reshaping tasks.
How to set up the scripts?
Please create a new spreadsheet. Go to Extensions > Apps Script and add the custom scripts. You need to click on the Plus sign to add a new script or a new HTML page.
First, please copy and paste all Robin’s code inside Auxilliaries.gs, Cast.gs, Front End.gs, and Melt.gs into Apps Script.
Make sure you modified the function nativeType as in the sidebar.
Then, add an HTML page. Make sure you paste Robin’s LICENSE text between the <body> </body> HTML tags.
Save them all.
If you are successful, you can start using the cast() and melt() functions in Google Sheets.
For your convenience, I have the scripts ready here for you to make a copy.
* Casts a string to a number of possible, otherwise returns the string.
*/
function nativeType(x) {
if (!isNaN(Number(x))) {
return Number(x);
}
// Check if the string represents a date in the format YYYY-MM-DD
var dateRegex = /^\d{4}-\d{2}-\d{2}$/;
if (dateRegex.test(x)) {
return x;
}
return x;
}
How to use custom functions?
WIDE TO LONG
Type the MELT() function:
=MELT(<Table>, *<ID Columns>*, <Measure Column Name>, <Value Column Name>, <Blanks Behavior>)
Table is the range of the wide table. �ID Columns are the range of the column headers that are not pivoted.�Measure Column Name (optional) contains the names of the molten columns.�Value Column Name (optional) contains the molten values.�Blanks Behavior (optional) controls how the melt function handles blank columns and or rows. No Removal is default.
LONG TO WIDE
Type the CAST() function:
=CAST(<Table>, <Measure Column>, <Value Column>, <Default Value>)
Table is the range of the long table.�Measure Column is the column that will be pivoted into multiple columns.�Value Column is the column that contains the actual measurement that will be populated into a two dimensional matrix.�Default Value The default value is equal to a blank cell.
Wide to Long
Let’s reuse the first dataset about job seekers.
You just need to create a new sheet and type:
=MELT('Job seekers'!A15:M73,'Job seekers'!A15, "Month", "Percent")
For the second dataset about green card holders, you can type:
=MELT('Greencard holders'!A4:K60, 'Greencard holders'!A4, "Year", "Persons")
Long to wide
If you want to flip the first dataset from a long back to a wide format, you can type:
=CAST(Jobseekers_2!A1:C697, Jobseekers_2!B1, Jobseekers_2!C1)
�If you want to flip the second dataset from a long back to a wide format, you can type:
=CAST('GCH 2_wtl'!A1:C561,'GCH 2_wtl'!B1,'GCH 2_wtl'!C1)
*The Apps Script we added to Google Sheets can’t flip long tables of the third and fourth datasets because both are too large for the script to handle.
Google Colab
What’s for?
To deal with large datasets, you need a more powerful platform or a programming language such as Python and R outside of Google Sheets.
A convenient platform for you to write Python/R code to reshape data and write the output back to Google Sheets is Google Colab.
To learn more about Colab, check out this introduction.
For this training, please make a copy of my Colab notebook.
How to reshape data in Colab?
First, you need to upload your data to Google Drive, create a new spreadsheet in Google Sheets, or upload your data to Colab directly (if datasets are too large, and then you have to use a different method to read data into a dataframe).
Then, connect Google Colab to Google Sheets using the gspread package.
Third, import the pandas package and use the pivot() function to flip data from a long to a wide format. To flip data from a wide to a long format, use the melt() function.
Fourth, write the new dataframe back to your Google Sheets.
Let’s get started!
Dataset #3 from CDC – COVID-19
Since the start of the COVID-19 pandemic, COVID-19 data have been gathered by the CDC. Reporting of new Aggregate Case and Death Count data was discontinued May 11, 2023, with the expiration of the COVID-19 public health emergency declaration.
We will reuse this dataset.
To keep it simple, you can make a copy of my spreadsheet. The datatable was first structured in a long format. You will flip it in a wide format and reverse the process.
Dataset #4 from the GCP - CO2
The Washington Post published a racing line chart on March 1, 2023. Harry Stevens, the reporter and designer, provides his data source and codebook at this link.
Estimates for every country’s carbon dioxide emissions since 1750 was gathered from The Global Carbon Project. The data was last updated on Oct. 17, 2022.
The data can be accessed through this portal. The datatable was also structured in a long format. If you want to create a racing line chart, you will need to flip the data from a long to a wide format.
Try using the pivot() function, but make sure you keep the ID columns or the index columns in Python.
wide_df = df.pivot(index=['Country', 'ISO 3166-1 alpha-3','UN M49'], columns='Year', values='Total')
Dataset #5 from DW - Air Quality
Deutsche Welle (DW) published an article about air pollution in Europe, in collaboration with the European Data Journalism Network.
The data can be accessed through this GitHub account. DW processed data from the Copernicus Atmospheric Monitoring Service (CAMS).
There are several Excel files on the GitHub account.
You can try flipping data in CAMS-Europe-Renalaysis-Countries-Yearly-2018-2022.xlsx file, which has a wide format, and CAMS-Europe-Forecast-Daily-2023.xlsx, which has a long format.
Check out the short answer here.
I hope these tips are useful for your data processing projects.
If you have any questions, please feel free to contact: huyenme@ksu.edu
Thank you for your attention!