2 of 56

Join at slido.com�#12345679

ⓘ

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

3 of 56

Data Preparation

Data preparation is a very broad subject, covering everything from data models to statistical assessments of data to string algorithms to scalable data processing. In some sense, most of Data Engineering---most of data science!---boils down to Data Preparation.

This week:

Data "Unboxing": parsing & structure assessment
Structural Data Transformations
Type Induction/Coercion [uncovered from last time]
String Manipulation [extra; Data 100 review]

Numerical Transformations
Changes in granularity: aggregation�and hierarchies

Last time, we looked at different model paradigms for rectangular data. This is a key component of Structural Data Transformations.

#12345679

4 of 56

[Review] PIVOT: Use a DataFrame to transform Relation → Matrix

Currently, our data is tidy since we used df.melt() [docs melt].

How do we pivot to get back the original (dense) matrix?�[pandas docs pivot]

A. mm_melted.pivot(index='variable', columns='Year')

B. mm_melted.pivot(index='Year', columns='variable')

goal

original data

🤔

Exercise

#12345679

5 of 56

How do we pivot to get back the original (dense) matrix?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

6 of 56

[Review] PIVOT: Use a DataFrame to transform Relation → Matrix

Currently, our data is tidy since we used df.melt() [docs melt].

How do we pivot to get back the original (dense) matrix?�[pandas docs pivot]

A. mm_melted.pivot(index='variable', columns='Year')

B. mm_melted.pivot(index='Year', columns='variable')

goal

original data

Demo

7 of 56

Summary so far

We’ve covered three rectangular data models: matrix, relation, and dataframe.

“Tidy” dataframes give a view of “per record/observation.”
“Pivoting” is a summarization step, useful for actual table display.

Transposing, pivoting, and unpivoting are very common operations that change the model paradigm and enable:

Human visualization
Further data analysis

#12345679

8 of 56

Data Unboxing

Pivot/Unpivot [from last time]

Data Unboxing

Structural Data Transformations

Data Type Transformations

Inductions/Coercions

Data Value Transformations

Scalar Functions [next time]

Lecture 12, Data 101 Fall 2023

9 of 56

Common file categories

Record per line: newline-delimited rows of uniform, symbol-delimited data

csv, tsv files; also newline-delimited rows of uniform but ad-hoc structured text (e.g., ASCII)

Dictionaries/Objects: explicit key:value pairs, may be nested! Two common cases:

Object-per-line: e.g. newline-delimited rows of JSON, XML, YAML, etc. (JSON in this format is sometimes called json lines
Complex object: the entire dataset is one fully-nested JSON, XML or YAML object

Unions: a mixture of rows from k distinct schemas. Two common cases:

Tagged Unions: each row has an ID or name identifying its schema. Often the tag is in the first column.
Untagged Unions: the schema for the row must be classified by its content

Natural Language (prose): Natural language intended for human consumption, e.g, Project Gutenberg

Everything else: There is a long tail of file formats in the world. If they're not readable as text, there's likely a commercial or open source tool to translate them to a readable text format.

Example: SQL database dumps with suffix .db or .sql

#12345679

10 of 56

Common file categories

Record per line: newline-delimited rows of uniform, symbol-delimited data

csv, tsv files; also newline-delimited rows of uniform but ad-hoc structured text (e.g., ASCII)

Dictionaries/Objects: explicit key:value pairs, may be nested! Two common cases:

Object-per-line: e.g. newline-delimited rows of JSON, XML, YAML, etc. (JSON in this format is sometimes called json lines
Complex object: the entire dataset is one fully-nested JSON, XML or YAML object

Unions: a mixture of rows from k distinct schemas. Two common cases:

Tagged Unions: each row has an ID or name identifying its schema. Often the tag is in the first column.
Untagged Unions: the schema for the row must be classified by its content

Natural Language (prose): Natural language intended for human consumption, e.g, Project Gutenberg

Everything else: There is a long tail of file formats in the world. If they're not readable as text, there's likely a commercial or open source tool to translate them to a readable text format.

Example: SQL database dumps with suffix .db or .sql

For the rest of this class, we will focus on the common case of text that can be read and written easily in tools like Jupyter, Python, PostgreSQL, etc.

One final note in the wild: text formats themselves are a complex issue.

Traditionally there were two encodings of roman-alphabet characters: EBCDIC and ASCII.
ASCII mostly won. Represents 256 roman characters and symbols.
But in our multilingual world, we now deal with characters from multiple languages and beyond! Unicode, UTF-8, etc…

#12345679

11 of 56

Command-line tools for assessing structure

Suppose you have a file named fname:

du -h fname: disk size of fname in human-readable format

ls -lh fname list with details (owner, size, date mod.)

head -c 1024 fname: show the first 1024 bytes of fname
head fname: show the first 10 lines (use -n 5) to adjust
your eyes: in addition to being the window to your soul, they're a very common tool for understanding data.

These are command-line tools!

To run the first three in a notebook cell, prefix with !, e.g.,�!head -c 1024 fname

Demo

12 of 56

Command-line tools for assessing structure

Suppose you have a file named fname:

du -h fname: disk size of fname in human-readable format

ls -lh fname list with details (owner, size, date mod.)

head -c 1024 fname: show the first 1024 bytes of fname
head fname: show the first 10 lines (use -n 5) to adjust
your eyes: in addition to being the window to your soul, they're a very common tool for understanding data

These are command-line tools!

To run the first three in a notebook cell, prefix with !, e.g.,�!head -c 1024 fname

Start by running du -h and head on some files, or have a peek in your Finder/Explorer and favorite (scale-savvy) text editor.

Lisa likes the vim text editor, which is pre-installed on DataHub.�See CS61C Lab 0 for vim basics.

Demo

13 of 56

Exploration, Part 1

What category of data is each file? Any observations about the data?

(demo) head -c 1024 data/jc1.txt
(slido) head -c 1024 data/jc2.txt

🤔

A. Record per line

B. Dictionaries/Objects

C. Unions of distinct schemas

D. Natural Language

E. Something else

(slido) head -c 1024 data/jq2.txt

Demo

14 of 56

What category of data is data/jc2.txt? Any observations about the data?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

15 of 56

Exploration, Part 1

What category of data is each file? Any observations about the data?

(demo) head -c 1024 data/jc1.txt
(slido) head -c 1024 data/jc2.txt

Note: this process is a form of data visualization!

Just because you’re not graphing anything does not mean you’re not interpreting the data visually!
This happens to a be a text-based visualization.
Be aware of the power and biases of your eyeballs and cognition.

You are probably making all kinds of assumptions based on what your eyeballs are sensing! Mostly good, I'm sure…

In short, with this text-based lean data visualization, you're working with limited information

A. Record per line

B. Dictionaries/Objects

C. Unions of distinct schemas

D. Natural Language

E. Something else

Demo

16 of 56

[On your own] Data Unboxing, Part II

Check data
Check data types
Do initial transformations as make sense
Do initial visualizations as make sense

Demo

17 of 56

Data Unboxing, Part II

Check data
Check data types
Do initial transformations as make sense
Do initial visualizations as make sense

[Bonus] For a richer data visualization:�Trifacta aka Google Cloud Dataprep.

Trifacta is a visual environment for "low-code" dataprep.
The commercialization of joint research at Berkeley and Stanford in the Wrangler and Potter’s Wheel projects.

Next up: More transformations!

Demo

18 of 56

Structural Data Transformations

Pivot/Unpivot [from last time]

Data Unboxing

Structural Data Transformations

Data Type Transformations

Inductions/Coercions

Data Value Transformations

Scalar Functions [next time]

Lecture 12, Data 101 Fall 2023

19 of 56

Up next: Various data transformations

Different types of data transformation; we’ll cover a few today.

Structural
Data Type
Data Value (String, Numerical, etc.)

#12345679

20 of 56

Structural Transformations

Even within rectangular data, there are several structural transformations we can do. From earlier:

Matrix → Relational model works.
Relation model → Matrix sometimes works.
In pandas, use pivot, "unpivot" (melt)

Next, we work with a scrape of rainfall data from the website of the National Oceanic and Atmospheric Administration (NOAA).

This was compiled by visiting each annual website of monthly precipitation by regions in the California Nevada River Forecast Center (CNRFC) area.

Demo

21 of 56

CNRFC-\NOAA Rainfall Data Unboxing - Structural Analysis

A scrape of rainfall data from the CNRFC-NOAA website.

1. mm.txt 2. mmp.txt 3. mmr.txt 4. mpf.txt

🤔

A. Matrix (wide)

B. Pivot (wide)

C. Tidy (long)

D. Union of models

E. Something else

How would you characterize these four files, structurally?

A, A, A, A

B, B, B, B

A, A, C, C

A, B, C, C

A, B, C, D

A, B, C, E

A, E, C, E

Demo

22 of 56

How would you characterize these four files, structurally?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

23 of 56

Slido exercise screenshot

A scrape of rainfall data from the website of the National Oceanic and Atmospheric Administration (NOAA).

1. mm.txt 2. mmp.txt 3. mmr.txt 4. mpf.txt

How would you characterize these four files, structurally?

A. Matrix (wide)

B. Pivot (wide)

C. Tidy (long)

D. Union of models

E. Something else

#12345679

24 of 56

CNRFC-NOAA Rainfall Data Unboxing - Structural Analysis

A scrape of rainfall data from the website of the National Oceanic and Atmospheric Administration (NOAA).

1. mm.txt 2. mmp.txt 3. mmr.txt 4. mpf.txt

A. matrix

B. pivot

D. The union of different year/region pivots!

C. tidy

Demo

25 of 56

Fulfilling structural transformation requests

Suppose we have the “relational” version of the data, mmr.txt.

A data analyst asks for just the precipitation grouped by year and month, with no location data. What does this mean? How do we do this?

🤔

(no slido)

Demo

26 of 56

Fulfilling structural transformation requests

One possible solution

mmr_grouped = mmr.groupby(['Year', 'Month'] ).mean('Inches of Precipitation')

mmr_grouped = mmr_grouped.reset_index()

mmr_pivoted = mmr_grouped.pivot(index='Year', columns='Month')

CNRFC-NOAA link

Demo

27 of 56

Structural Transformation: Takeaways

Pivoting and unpivoting is relatively easy in dataframes, though perhaps at a loss of data type/metadata.

You will need to do this throughout your data prep!
Some interpretation of team requests is generally needed on the job.

Rectangular model performance and scale: Also super important considerations! Some questions to always keep in mind:

Have you ever seen a table with 10 million rows? (yes)
Have you ever seen a table with 10 million columns? (no)
How usable/queryable is your data?
(Statistics) Does your data suffer from the �curse of dimensionality (and the blessing?!)

Demo

28 of 56

Slow Jupyter? Try these tips…

How to restart your server/delete database:

https://fa23.data101.org/resources/assignment-tips/

The DataHub team is aware of the uneven load on these shared machines.
As a result, for the assignment, we will grade very leniently on ANALYZE reports.
As long as your written and numeric analysis is consistent within all parts of a question, you will be OK.

Project party 2 tomorrow (Wed 10/4) 5-7pm�Warren 101B - we scaled up DataHub node allocation during this time period + around the deadline

Project 1 grades released - Regrade window until this Friday 10/6 [Ed announcement]

29 of 56

Data Type Transformations (Induction/Coercion)

Pivot/Unpivot [from last time]

Data Unboxing

Structural Data Transformations

Data Type Transformations

Inductions/Coercions

Data Value Transformations

Bits, Bytes, and Storage
MDL

Scalar Functions [next time]

Lecture 12, Data 101 Fall 2023

30 of 56

Type Induction and Coercion

[Review] Dataframes are symmetric, but idiosyncratic.

A type induction operator to infer a type from a column of values

Many of the systems we work with will enforce and coerce types to data.

DataFrame: convert to object if all else fails

exports to string, imports to some inferred type

SQL and other DBMSes: coerce to attribute type in schema, or error

Exports type-enforced schema with data

As data engineers, we need to coerce data types to ones that data scientists/analysts can easily use.

#12345679

31 of 56

Type Induction and Coercion

[Review] Dataframes are symmetric, but idiosyncratic.

A type induction operator to infer a type from a column of values

Many of the systems we work with will enforce and coerce types to data.

DataFrame: convert to object if all else fails

exports to string, imports to some inferred type

SQL and other DBMSes: coerce to attribute type in schema, or error

Exports type-enforced schema with data

It is very common to work with data that has little or no metadata

Rarely specified in advance: feature/statistical type (e.g., categorical, ordinal, …)
Sometimes no mention of storage data type (e.g., pandas defaults to strings)

As a result, data types must be inferred, i.e., type induction must occur!

Features types by data scientists (see Data 100), and
Storage data types by the system or by data engineers.

As data engineers, we need to coerce data types to ones that data scientists/analysts can easily use.

#12345679

32 of 56

Formalizing Techniques for Type Induction

Given: A column c of potentially “dirty” data values, and a set of possible data types H�Goal: Write an algorithm to choose a type for column c

Assume H = {bool, char, int16, int32, int64, float32, double64, string}.

We discuss three options:

1. “Hard” Rules

Occam’s Razor (“simple is best”)

3. Minimum Description Length (MDL)

A compromise of the two above

2. Classification rules

Machine Learning

#12345679

33 of 56

Formalizing Techniques for Type Induction

Given: A column c of potentially “dirty” data values, and a set of possible data types H�Goal: Write an algorithm to choose a type for column c

Assume H = {bool, char, int16, int32, int64, float32, double64, string}.

We discuss three options:

1. “Hard” Rules

Occam’s Razor (“simple is best”)

3. Minimum Description Length (MDL)

A compromise of the two above

2. Classification rules

Machine Learning

For types H from most- to least-restrictive�(e.g., H as sorted above),�Choose the first one that matches all the values in c.

Supervised Learning

Predict the data types of c, where the model is trained on existing data for classifying data as a type in H.

Compute a simple statistical heuristic that accounts for the “penalty” of encoding data that “doesn’t fit” into a given type.

#12345679

34 of 56

Bits and bytes

Pivot/Unpivot [from last time]

Data Unboxing

Structural Data Transformations

Data Type Transformations

Inductions/Coercions

Data Value Transformations

MDL
Bits, Bytes, and Storage

Scalar Functions [next time]

Slides Credit to CS61C (Garcia, Yokota), Fall 2023

https://cs61c.org/fa23/

Lecture 12, Data 101 Fall 2023

35 of 56

#12345679

36 of 56

#12345679

37 of 56

#12345679

38 of 56

Data 101 note: in Python,�>=32 bits (plain ints), infinite (long ints)

#12345679

39 of 56

ASCII characters are stored in one byte (8 bits)

https://www.ascii-code.com/

https://ss64.com/ascii.html

#12345679

40 of 56

Minimum Description Length (MDL)

Pivot/Unpivot [from last time]

Data Unboxing

Structural Data Transformations

Data Type Transformations

Inductions/Coercions

Data Value Transformations

Bits and Bytes
MDL

Scalar Functions [next time]

Lecture 12, Data 101 Fall 2023

41 of 56

Minimum Description Length (MDL)

MDL is a simple statistical heuristic that accounts for the “penalty” of encoding data that “doesn’t fit” into a given type.

Like the “hard” rules, this heuristic prefers simplicity.
Originally a way to characterize how “good” a model is on some data, e.g., how many bits on disk
For our purposes: the cost of encoding as either the desired type or the default.
Then, choose the data type that has the least storage cost.

Define:

len(v) as the bit-length for encoding a value of v in some default type.
bit-length of encoding a value in a particular type T with |T| distinct values as log2|T|.

ASCII characters have 256 values → 8-bit.
32-bit integers have 2^32 values → 32-bit.

Despite the equation, MDL is straightforward to measure.

if value v “fits” in type T, charge the cost of T

otherwise, charge the default cost of value v, since it has to be processed in some other way

#12345679

42 of 56

Despite the equation, MDL is straightforward to measure

Suppose we have the column of values c = [J25, 2, 12, 4750].

Assume the default type is “string”. If ASCII, then the cost is 8 bits per character.

1. What is the cost of encoding c as 16-bit integers?

2. What is the cost of encoding c as strings?

3. What datatype do we choose?

if value v “fits” in type T, charge the cost of T

otherwise, charge the default cost of value v, since it has to be processed in some other way

J25 2 12 4750 values

3*8 16 16 16 bit-cost

74 total cost

J25 2 12 4750 values

3*8 1*8 2*8 4*8 bit-cost

80 total cost

int16 has lower cost.

Note—J25 value likely gets discarded or force-converted

#12345679

43 of 56

Despite the equation, MDL is straightforward to measure

Suppose we have the column of values c = [J25, Lisa, 2, 12, 4750].

Assume the default type is “string”. If ASCII, then the cost is 8 bits per character.

1. What is the cost of encoding c as 16-bit integers?

2. What is the cost of encoding c as strings?

3. What datatype do we choose (int16 or str)?

if value v “fits” in type T, charge the cost of T

otherwise, charge the default cost of value v, since it has to be processed in some other way

🤔

(no slido)

#12345679

44 of 56

Despite the equation, MDL is straightforward to measure

Suppose we have the column of values c = [J25, Lisa, 2, 12, 4750].

Assume the default type is “string”. If ASCII, then the cost is 8 bits per character.

1. What is the cost of encoding c as 16-bit integers?

2. What is the cost of encoding c as strings?

3. What datatype do we choose (int16 or str)?

if value v “fits” in type T, charge the cost of T

otherwise, charge the default cost of value v, since it has to be processed in some other way

J25 Lisa 12 4750 values

3*8 4*8 16 16 bit-cost

88 total cost

J25 Lisa 12 4750 values

3*8 4*8 2*8 4*8 bit-cost

104 total cost

int16 still has lower cost.

#12345679

45 of 56

Data Value Transformations

Pivot/Unpivot [from last time]

Data Unboxing

Structural Data Transformations

Data Type Transformations

Inductions/Coercions

Data Value Transformations

Scalar Functions [next time]

Lecture 12, Data 101 Fall 2023

46 of 56

Data Value Transformations

After you’ve restructured your data, you will likely need to transform values.

We’ve seen this referred to as part of data wrangling or data cleaning.
I personally don’t like “cleaning”—it implies our original data were dirty.
But as we’ve seen, some value inconsistency may simply arise from structural transformations!

There are many values transformations. We’ll start covering two today:

String transformations (briefly)
Scalar functions on attributes

#12345679

47 of 56

String Manipulation

Typical transforms include the following (names may vary across systems/DSLs):

Split a string into separate rows/columns.

Often by position or delimiter
Sometimes via parsing: e.g. counting nested parentheses (e.g. JSON/XML row splits)

Count Matches: Create a new column with the count of matches of a pattern in a string column
Extract: create a column of substrings derived from another column
Replace a (sub)string with a constant, a "captured group",�or any string formula (e.g. lowercase, trim, etc)

A lot of the above is covered in Data 100 + our SQL project unit!

All operations can be done directly in SQL.
How-to with regular expressions (regex) goes a very long way! See Multivitamin practice..

#12345679

48 of 56

Scalar Functions

Pivot/Unpivot [from last time]

Data Unboxing

Structural Data Transformations

Data Type Transformations

Inductions/Coercions

Data Value Transformations

Scalar Functions

Lecture 12, Data 101 Fall 2023

49 of 56

Scalar Functions

A scalar function is a function on atomic values.

In the relational model, this is a function on constants�and individual attributes of a single relational tuple.

Etymology note: You have likely seen the word “scalar” in different contexts:

A scalar is tensor of dimension 0; in other words, a scalar is a value in some numeric field.
More colloquially, a scalar is used to represent a single value in any atomic data type, e.g., integer, string, float, …,

As part of our definition of scalar functions, we combine these to define a scalar as a value in any individual attribute.

#12345679

50 of 56

Exercise: How many scalar functions?

WITH yearnum AS

(SELECT yearid, (yearid % 100) AS year

FROM batting

)

SELECT yearid,

CONCAT('''', LPAD(year::text, 2, '0')) AS year

FROM yearnum

LIMIT 5;

LPAD [docs 16.2 link]:

lpad ( string text, length integer [, fill text ] ) → text

Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).

lpad('hi', 5, 'xy') → xyxhi

'''' ← postgres escape (character ') of single quote '

value::type ← postgres-specific typecast syntax

🤔

A. 0

B. 1

C. 2

D. 3

E. 4

F. Something else

#12345679

51 of 56

Exercise: How many scalar functions?

WITH yearnum AS

(SELECT yearid, (yearid % 100) AS year

FROM batting

)

SELECT yearid,

CONCAT('''', LPAD(year::text, 2, '0')) AS year

FROM yearnum

LIMIT 5;

Scalar functions are computed individually on each record in your data.

#12345679

52 of 56

Side note: Scalar Functions and Parallelism

Scalar functions are computed individually on each record in your data.

In relational algebra, use a scalar function f in a projection operator, e.g.,
In query-based languages like SQL, put in SELECT clause
In code-based/dataflow libraries like Spark and Pandas, invoke via map function

Because of the per-record abstract, the processing engine has several options for efficiency:

Run these functions in parallel on many different records
Free up the memory from one invocation before starting the other (memory reuse)
DBMSes oten use pipelining to execute scalar functions to execute “on-the-fly” as tuples are emitted, e.g., while accessing the data via a Sequential/Index Scan

#12345679

53 of 56

Let’s check it out

A “flattening” of the query above (to remove the CTE):

EXPLAIN (VERBOSE true)�SELECT yearid,� CONCAT('''', LPAD((yearid % 100)::text, 2, '0')) AS year�FROM batting;

Demo

54 of 56

What if scalar functions mention multiple tables?

The below query computes an arbitrary statistic for pitchers:

1 point for every strikeout they throw as pitcher
–1 for every point they themselves struck out as batter

EXPLAIN (VERBOSE true)�SELECT p.playerid, p.so - b.so� FROM pitching p� INNER JOIN batting b� ON p.playerid=b.playerid;

Scalar functions are computed as each tuple is output from the join.

Demo

55 of 56

What if scalar functions mention multiple tables?

The below query computes an arbitrary statistic for pitchers:

1 point for every strikeout they throw as pitcher
–1 for every point they themselves struck out as batter

EXPLAIN (VERBOSE true)�SELECT p.playerid, p.so - b.so� FROM pitching p� INNER JOIN batting b� ON p.playerid=b.playerid;

Nested Loop (cost=0.43..13004.27 rows=339358 width=13)

Output: p.playerid, (p.so - b.so)

-> Seq Scan on public.pitching p (cost=0.00..1374.06 rows=45806 width=13)

Output: p.playerid, p.yearid, p.stint, p.teamid, p.lgid, p.w, p.l, p.g, p.gs, p.cg, p.sho, p.sv, p.ipouts, p.h, p.er, p.hr, p.bb, p.so, p.baopp, p.era, p.ibb, p.wp, p.hbp, p.bk, p.bfp, p.gf, p.r, p.sh, p.sf, p.gidp

-> Memoize (cost=0.43..0.73 rows=7 width=13)

Output: b.so, b.playerid

Cache Key: p.playerid

Cache Mode: logical

-> Index Scan using batting_pkey on public.batting b (cost=0.42..0.72 rows=7 width=13)

Output: b.so, b.playerid

Index Cond: ((b.playerid)::text = (p.playerid)::text)

(11 rows)

(using psql directly, which preserves whitespace formatting)

Demo

56 of 56

[Bonus] User-Defined Functions (UDFs)

Some DBMSes allow you to define scalar functions of your own.

Postgres supports Python…but be warned.
Not available on DataHub
Set it up on your own Postgres server [link]:

%sql CREATE EXTENSION plpythonul;

%%sql

CREATE OR REPLACE FUNCTION pyhash(s text)

RETURNS text

AS $$

## Python text goes here, can reference variable s

import hashlib

m = hashlib.sha256()

m.update(s) # uses variable s

return m.hexdigest() # return a text

$$ LANGUAGE plpythonu;

%sql SELECT pyhash('Joe'), pyhash('Joel');

Demo