Data Preparation I: Structural
October 3, 2023
Data 101, Fall 2023 @ UC Berkeley
Lisa Yan https://fa23.data101.org/
1
LECTURE 12
Join at slido.com�#12345679
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
Data Preparation
Data preparation is a very broad subject, covering everything from data models to statistical assessments of data to string algorithms to scalable data processing. In some sense, most of Data Engineering---most of data science!---boils down to Data Preparation.
This week:
3
Last time, we looked at different model paradigms for rectangular data. This is a key component of Structural Data Transformations.
#12345679
[Review] PIVOT: Use a DataFrame to transform Relation → Matrix
4
Currently, our data is tidy since we used df.melt() [docs melt].
How do we pivot to get back the original (dense) matrix?�[pandas docs pivot]
A. mm_melted.pivot(index='variable', columns='Year')
B. mm_melted.pivot(index='Year', columns='variable')
goal
original data
🤔
Exercise
#12345679
How do we pivot to get back the original (dense) matrix?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
[Review] PIVOT: Use a DataFrame to transform Relation → Matrix
6
Currently, our data is tidy since we used df.melt() [docs melt].
How do we pivot to get back the original (dense) matrix?�[pandas docs pivot]
A. mm_melted.pivot(index='variable', columns='Year')
B. mm_melted.pivot(index='Year', columns='variable')
goal
original data
Demo
Summary so far
We’ve covered three rectangular data models: matrix, relation, and dataframe.
Transposing, pivoting, and unpivoting are very common operations that change the model paradigm and enable:
7
#12345679
Data Unboxing
Pivot/Unpivot [from last time]
Data Unboxing
Structural Data Transformations
Data Type Transformations
Data Value Transformations
Scalar Functions [next time]
8
Lecture 12, Data 101 Fall 2023
Common file categories
Record per line: newline-delimited rows of uniform, symbol-delimited data
Dictionaries/Objects: explicit key:value pairs, may be nested! Two common cases:
Unions: a mixture of rows from k distinct schemas. Two common cases:
Natural Language (prose): Natural language intended for human consumption, e.g, Project Gutenberg
Everything else: There is a long tail of file formats in the world. If they're not readable as text, there's likely a commercial or open source tool to translate them to a readable text format.
9
#12345679
Common file categories
Record per line: newline-delimited rows of uniform, symbol-delimited data
Dictionaries/Objects: explicit key:value pairs, may be nested! Two common cases:
Unions: a mixture of rows from k distinct schemas. Two common cases:
Natural Language (prose): Natural language intended for human consumption, e.g, Project Gutenberg
Everything else: There is a long tail of file formats in the world. If they're not readable as text, there's likely a commercial or open source tool to translate them to a readable text format.
10
For the rest of this class, we will focus on the common case of text that can be read and written easily in tools like Jupyter, Python, PostgreSQL, etc.
One final note in the wild: text formats themselves are a complex issue.
#12345679
Command-line tools for assessing structure
Suppose you have a file named fname:
These are command-line tools!
11
Demo
Command-line tools for assessing structure
Suppose you have a file named fname:
These are command-line tools!
Start by running du -h and head on some files, or have a peek in your Finder/Explorer and favorite (scale-savvy) text editor.
12
Demo
Exploration, Part 1
What category of data is each file? Any observations about the data?
13
🤔
A. Record per line
B. Dictionaries/Objects
C. Unions of distinct schemas
D. Natural Language
E. Something else
(slido) head -c 1024 data/jq2.txt
Demo
What category of data is data/jc2.txt? Any observations about the data?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Exploration, Part 1
What category of data is each file? Any observations about the data?
15
Note: this process is a form of data visualization!
In short, with this text-based lean data visualization, you're working with limited information
A. Record per line
B. Dictionaries/Objects
C. Unions of distinct schemas
D. Natural Language
E. Something else
Demo
[On your own] Data Unboxing, Part II
16
Demo
Data Unboxing, Part II
[Bonus] For a richer data visualization:�Trifacta aka Google Cloud Dataprep.
Next up: More transformations!
17
Demo
Structural Data Transformations
Pivot/Unpivot [from last time]
Data Unboxing
Structural Data Transformations
Data Type Transformations
Data Value Transformations
Scalar Functions [next time]
18
Lecture 12, Data 101 Fall 2023
Up next: Various data transformations
19
Different types of data transformation; we’ll cover a few today.
#12345679
Structural Transformations
Even within rectangular data, there are several structural transformations we can do. From earlier:
Next, we work with a scrape of rainfall data from the website of the National Oceanic and Atmospheric Administration (NOAA).
This was compiled by visiting each annual website of monthly precipitation by regions in the California Nevada River Forecast Center (CNRFC) area.
20
Demo
CNRFC-\NOAA Rainfall Data Unboxing - Structural Analysis
A scrape of rainfall data from the CNRFC-NOAA website.
1. mm.txt 2. mmp.txt 3. mmr.txt 4. mpf.txt
21
🤔
A. Matrix (wide)
B. Pivot (wide)
C. Tidy (long)
D. Union of models
E. Something else
How would you characterize these four files, structurally?
A, A, A, A
B, B, B, B
A, A, C, C
A, B, C, C
A, B, C, D
A, B, C, E
A, E, C, E
Demo
How would you characterize these four files, structurally?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Slido exercise screenshot
A scrape of rainfall data from the website of the National Oceanic and Atmospheric Administration (NOAA).
1. mm.txt 2. mmp.txt 3. mmr.txt 4. mpf.txt
23
How would you characterize these four files, structurally?
A. Matrix (wide)
B. Pivot (wide)
C. Tidy (long)
D. Union of models
E. Something else
#12345679
CNRFC-NOAA Rainfall Data Unboxing - Structural Analysis
A scrape of rainfall data from the website of the National Oceanic and Atmospheric Administration (NOAA).
1. mm.txt 2. mmp.txt 3. mmr.txt 4. mpf.txt
24
A. matrix
B. pivot
D. The union of different year/region pivots!
C. tidy
Demo
Fulfilling structural transformation requests
Suppose we have the “relational” version of the data, mmr.txt.
25
A data analyst asks for just the precipitation grouped by year and month, with no location data. What does this mean? How do we do this?
🤔
(no slido)
?
Demo
Fulfilling structural transformation requests
One possible solution
mmr_grouped = mmr.groupby(['Year', 'Month'] ).mean('Inches of Precipitation')
mmr_grouped = mmr_grouped.reset_index()
mmr_pivoted = mmr_grouped.pivot(index='Year', columns='Month')
26
Demo
Structural Transformation: Takeaways
Pivoting and unpivoting is relatively easy in dataframes, though perhaps at a loss of data type/metadata.
Rectangular model performance and scale: Also super important considerations! Some questions to always keep in mind:
27
Demo
Slow Jupyter? Try these tips…
How to restart your server/delete database:
https://fa23.data101.org/resources/assignment-tips/
Project party 2 tomorrow (Wed 10/4) 5-7pm�Warren 101B - we scaled up DataHub node allocation during this time period + around the deadline
Project 1 grades released - Regrade window until this Friday 10/6 [Ed announcement]
28
Data Type Transformations (Induction/Coercion)
Pivot/Unpivot [from last time]
Data Unboxing
Structural Data Transformations
Data Type Transformations
Data Value Transformations
Scalar Functions [next time]
29
Lecture 12, Data 101 Fall 2023
Type Induction and Coercion
[Review] Dataframes are symmetric, but idiosyncratic.
Many of the systems we work with will enforce and coerce types to data.
30
As data engineers, we need to coerce data types to ones that data scientists/analysts can easily use.
#12345679
Type Induction and Coercion
[Review] Dataframes are symmetric, but idiosyncratic.
Many of the systems we work with will enforce and coerce types to data.
It is very common to work with data that has little or no metadata
31
As a result, data types must be inferred, i.e., type induction must occur!
As data engineers, we need to coerce data types to ones that data scientists/analysts can easily use.
#12345679
Formalizing Techniques for Type Induction
Given: A column c of potentially “dirty” data values, and a set of possible data types H�Goal: Write an algorithm to choose a type for column c
Assume H = {bool, char, int16, int32, int64, float32, double64, string}.
We discuss three options:
32
1. “Hard” Rules
3. Minimum Description Length (MDL)
2. Classification rules
#12345679
Formalizing Techniques for Type Induction
Given: A column c of potentially “dirty” data values, and a set of possible data types H�Goal: Write an algorithm to choose a type for column c
Assume H = {bool, char, int16, int32, int64, float32, double64, string}.
We discuss three options:
33
1. “Hard” Rules
3. Minimum Description Length (MDL)
2. Classification rules
For types H from most- to least-restrictive�(e.g., H as sorted above),�Choose the first one that matches all the values in c.
Supervised Learning
Predict the data types of c, where the model is trained on existing data for classifying data as a type in H.
Compute a simple statistical heuristic that accounts for the “penalty” of encoding data that “doesn’t fit” into a given type.
#12345679
Bits and bytes
Pivot/Unpivot [from last time]
Data Unboxing
Structural Data Transformations
Data Type Transformations
Data Value Transformations
Scalar Functions [next time]
34
Slides Credit to CS61C (Garcia, Yokota), Fall 2023
Lecture 12, Data 101 Fall 2023
35
#12345679
36
#12345679
37
#12345679
38
Data 101 note: in Python,�>=32 bits (plain ints), infinite (long ints)
#12345679
ASCII characters are stored in one byte (8 bits)
39
#12345679
Minimum Description Length (MDL)
Pivot/Unpivot [from last time]
Data Unboxing
Structural Data Transformations
Data Type Transformations
Data Value Transformations
Scalar Functions [next time]
40
Lecture 12, Data 101 Fall 2023
Minimum Description Length (MDL)
MDL is a simple statistical heuristic that accounts for the “penalty” of encoding data that “doesn’t fit” into a given type.
Define:
41
Despite the equation, MDL is straightforward to measure.
if value v “fits” in type T, charge the cost of T
otherwise, charge the default cost of value v, since it has to be processed in some other way
#12345679
Despite the equation, MDL is straightforward to measure
Suppose we have the column of values c = [J25, 2, 12, 4750].
Assume the default type is “string”. If ASCII, then the cost is 8 bits per character.
1. What is the cost of encoding c as 16-bit integers?
2. What is the cost of encoding c as strings?
3. What datatype do we choose?
42
if value v “fits” in type T, charge the cost of T
otherwise, charge the default cost of value v, since it has to be processed in some other way
J25 2 12 4750 values
3*8 16 16 16 bit-cost
74 total cost
J25 2 12 4750 values
3*8 1*8 2*8 4*8 bit-cost
80 total cost
int16 has lower cost.
Note—J25 value likely gets discarded or force-converted
#12345679
Despite the equation, MDL is straightforward to measure
Suppose we have the column of values c = [J25, Lisa, 2, 12, 4750].
Assume the default type is “string”. If ASCII, then the cost is 8 bits per character.
1. What is the cost of encoding c as 16-bit integers?
2. What is the cost of encoding c as strings?
3. What datatype do we choose (int16 or str)?
43
if value v “fits” in type T, charge the cost of T
otherwise, charge the default cost of value v, since it has to be processed in some other way
🤔
(no slido)
#12345679
Despite the equation, MDL is straightforward to measure
Suppose we have the column of values c = [J25, Lisa, 2, 12, 4750].
Assume the default type is “string”. If ASCII, then the cost is 8 bits per character.
1. What is the cost of encoding c as 16-bit integers?
2. What is the cost of encoding c as strings?
3. What datatype do we choose (int16 or str)?
44
if value v “fits” in type T, charge the cost of T
otherwise, charge the default cost of value v, since it has to be processed in some other way
J25 Lisa 12 4750 values
3*8 4*8 16 16 bit-cost
88 total cost
J25 Lisa 12 4750 values
3*8 4*8 2*8 4*8 bit-cost
104 total cost
int16 still has lower cost.
#12345679
Data Value Transformations
Pivot/Unpivot [from last time]
Data Unboxing
Structural Data Transformations
Data Type Transformations
Data Value Transformations
Scalar Functions [next time]
45
Lecture 12, Data 101 Fall 2023
Data Value Transformations
After you’ve restructured your data, you will likely need to transform values.
There are many values transformations. We’ll start covering two today:
46
#12345679
String Manipulation
Typical transforms include the following (names may vary across systems/DSLs):
A lot of the above is covered in Data 100 + our SQL project unit!
47
#12345679
Scalar Functions
Pivot/Unpivot [from last time]
Data Unboxing
Structural Data Transformations
Data Type Transformations
Data Value Transformations
Scalar Functions
48
Lecture 12, Data 101 Fall 2023
Scalar Functions
A scalar function is a function on atomic values.
Etymology note: You have likely seen the word “scalar” in different contexts:
As part of our definition of scalar functions, we combine these to define a scalar as a value in any individual attribute.
49
#12345679
Exercise: How many scalar functions?
50
WITH yearnum AS
(SELECT yearid, (yearid % 100) AS year
FROM batting
)
SELECT yearid,
CONCAT('''', LPAD(year::text, 2, '0')) AS year
FROM yearnum
LIMIT 5;
LPAD [docs 16.2 link]:
lpad ( string text, length integer [, fill text ] ) → text
Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).
lpad('hi', 5, 'xy') → xyxhi
'''' ← postgres escape (character ') of single quote '
value::type ← postgres-specific typecast syntax
🤔
A. 0
B. 1
C. 2
D. 3
E. 4
F. Something else
#12345679
Exercise: How many scalar functions?
51
WITH yearnum AS
(SELECT yearid, (yearid % 100) AS year
FROM batting
)
SELECT yearid,
CONCAT('''', LPAD(year::text, 2, '0')) AS year
FROM yearnum
LIMIT 5;
Scalar functions are computed individually on each record in your data.
#12345679
Side note: Scalar Functions and Parallelism
Scalar functions are computed individually on each record in your data.
Because of the per-record abstract, the processing engine has several options for efficiency:
52
#12345679
Let’s check it out
A “flattening” of the query above (to remove the CTE):
EXPLAIN (VERBOSE true)�SELECT yearid,� CONCAT('''', LPAD((yearid % 100)::text, 2, '0')) AS year�FROM batting;
53
Demo
What if scalar functions mention multiple tables?
54
The below query computes an arbitrary statistic for pitchers:
EXPLAIN (VERBOSE true)�SELECT p.playerid, p.so - b.so� FROM pitching p� INNER JOIN batting b� ON p.playerid=b.playerid;
Scalar functions are computed as each tuple is output from the join.
Demo
What if scalar functions mention multiple tables?
55
The below query computes an arbitrary statistic for pitchers:
EXPLAIN (VERBOSE true)�SELECT p.playerid, p.so - b.so� FROM pitching p� INNER JOIN batting b� ON p.playerid=b.playerid;
Nested Loop (cost=0.43..13004.27 rows=339358 width=13)
Output: p.playerid, (p.so - b.so)
-> Seq Scan on public.pitching p (cost=0.00..1374.06 rows=45806 width=13)
Output: p.playerid, p.yearid, p.stint, p.teamid, p.lgid, p.w, p.l, p.g, p.gs, p.cg, p.sho, p.sv, p.ipouts, p.h, p.er, p.hr, p.bb, p.so, p.baopp, p.era, p.ibb, p.wp, p.hbp, p.bk, p.bfp, p.gf, p.r, p.sh, p.sf, p.gidp
-> Memoize (cost=0.43..0.73 rows=7 width=13)
Output: b.so, b.playerid
Cache Key: p.playerid
Cache Mode: logical
-> Index Scan using batting_pkey on public.batting b (cost=0.42..0.72 rows=7 width=13)
Output: b.so, b.playerid
Index Cond: ((b.playerid)::text = (p.playerid)::text)
(11 rows)
(using psql directly, which preserves whitespace formatting)
Demo
[Bonus] User-Defined Functions (UDFs)
Some DBMSes allow you to define scalar functions of your own.
%sql CREATE EXTENSION plpythonul;
%%sql
CREATE OR REPLACE FUNCTION pyhash(s text)
RETURNS text
AS $$
## Python text goes here, can reference variable s
import hashlib
m = hashlib.sha256()
m.update(s) # uses variable s
return m.hexdigest() # return a text
$$ LANGUAGE plpythonu;
%sql SELECT pyhash('Joe'), pyhash('Joel');
56
Demo