1 of 62

1

Applied Data Analysis (CS401)

Maria Brbic / Robert West

Lecture 2

Handling data

18 Sep 2024

2 of 62

Announcements

2

Register your teams (5 people) here by Fri Sep 27^th

Each team member must individually complete the form!

Project milestone P1 to be released this Fri, due Fri Oct 4^th
First quiz (“Q1”*) will be released today after the lecture

Exercise for recap lecture materials

Friday’s lab session:

Intro to Pandas (very important for Homework H1 and rest of course)
Exercise 1 already on Github

3 of 62

Deadlines

Homeworks

Homework 1

Release Oct 4^th 2024
Due Oct 18^th 2024

Homework 2

Release Nov 15^th 2024
Due Nov 29^th 2024

Final exam

Date TBD

Project deliverables

Project milestone P1

Due Oct 4^th 2024

Project milestone P2

Due Nov 15^th 2024

Project milestone P3

Due Dec 20^th 2024

3

All deadlines are 23:59 CET

4 of 62

Feedback

4

Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec2-feedback

Feedback form available for each lecture and lab session

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
Is it nicer to follow the lecture online or offline?
…

5 of 62

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

6 of 62

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

7 of 62

Simple definition: A data model specifies how you think about the world

8 of 62

Q: “How do you think about the world?”

entity

relationship

attributes

A: “See my entity–relationship diagram!”

9 of 62

Part 1:

Data models

entity

relationship

attributes

How to store my data on a computer?

Q: “How do you think about the world?”

A: “See my entity–relationship diagram!”

10 of 62

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

“The world is simple: one type of entity, all with the same attributes”�→ Flat model
“The world contains many types of entities, connected by relationships”�→ Relational model
“The world is a hierarchy of entities”�→ Document model
“The world is a complex network of entities”�→ Network model

11 of 62

Flat model

66.249.65.107 - - [08/Oct/2007:04:54:20 -0400] "GET /support.html HTTP/1.1" 200 11179 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

111.111.111.111 - - [08/Oct/2007:11:17:55 -0400] "GET / HTTP/1.1" 200 10801 "http://www.google.com/search?q=in+love+with+ada+lovelace+what+to+do&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"

Example: log files; e.g., Apache web server (httpd)

Entities = requests from clients to server

Another common format: CSV (“comma-separated vector”)

“The world is simple: one type of entity, all with the same attributes”

12 of 62

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

“The world is simple: one type of entity, all with the same attributes”�→ Flat model
“The world contains many types of entities, connected by relationships”�→ Relational model
“The world is a hierarchy of entities”�→ Document model
“The world is a complex network of entities”�→ Network model

13 of 62

Relational model

The relational model is ubiquitous:

MySQL, PostgreSQL, Oracle, DB2, SQLite, …
You use it many times every day

Data represented as tables describing

entities
relationships between entities

13

id	name
1	Bush
2	Trump
3	Obama

president	successor
1	3
3	2

“The world contains many types of entities, connected by relationships.”

14 of 62

Processing data in the relational model: SQL

Declarative language for core data manipulations
You think about what you want, not how to compute it

14

SQL (Structured Query Language)

Declarative

(e.g., SQL)

Imperative

(e.g., Python)

15 of 62

SQL

15

You should know basics of SQL
Need a refresher? → Watch/do online tutorials!
Key operations:

Select (!), update, delete
Unique keys
Joins (inner, left outer, right outer, full)
Sorting
Aggregation (group by, count, min, max, avg, etc.)

16 of 62

16

POLLING TIME

“Have you worked with SQL joins?”
Scan QR code or go to�https://app.sli.do/event/jiFfkrP812UjtpwMbFJNxD

17 of 62

SQL implementations

etc.

17

18 of 62

SQL and “SQL”

The declarative-programming principles of SQL are widespread, even where it’s less obvious

18

19 of 62

“SQL”: Pandas (Python library)

Similar to SQL (declarative), with additional elements of functional programming (map(), filter(), etc.)
SQL “table” ←→ Pandas “DataFrame”
Need intro? Come to Friday’s lab session!

19

20 of 62

Pandas vs. SQL

+ Pandas is lightweight and fast

+ Natively Python, i.e., full SQL expressiveness plus the expressiveness of Python, especially for function evaluation

+ Integration with plotting functions like Matplotlib

- In Pandas, tables must fit into memory

- No post-load indexing functionality: indices are built when a table is created

- No transactions, journaling, etc. (matters for parallel applications)

- Large, complex joins are slower

20

21 of 62

“SQL”: Unix command line

21

cat users.txt \

| awk ‘$2 >= 18 && $2 <= 25’ \

| join -1 1 -2 1 url_visits.txt - \

| cut -f 4 \

| sort \

| uniq -c \

| sort -k 1,1 -n -r \

| head -n 5

user_id age

User145 33

User24 15

User5 76

…

user_id url

User2 ada.epfl.ch

User244 facebook.com

…

Goal: Find top 5 URLs visited most frequently by users between 18 and 25 years old

users.txt

url_visits.txt

22 of 62

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

“The world is simple: one type of entity, all with the same attributes”�→ Flat model
“The world contains many types of entities, connected by relationships”�→ Relational model
“The world is a hierarchy of entities”�→ Document model
“The world is a complex network of entities”�→ Network model

23 of 62

Document model

address

street,

city,

country

XML format:

<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>

contact

JSON format:

contact: {� id: 656,� firstname: “Chuck”,� lastname: “Smith”,� phones: [“(123) 555-0178”,� “(890) 555-0133”],� address: {� street: “Rue de l’Ale 8”,� city: “Lausanne”,� zip: 1007,� country: “CH”� }�}

“The world is a hierarchy of entities”

24 of 62

“Think for a minute”

Think for a minute:

If we want to use a relational DB (e.g., MySQL)

instead of XML, how can we store

several phone numbers for the same person?

(Feel free to discuss with your neighbor.)

Document model

<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>

25 of 62

Solution to “Think for a minute”

id	first name	...
656	Chuck	...
...	...	...

id	phone
656	(123) 555-0178
656	(890) 555-0133
...	...

Same in relational model

Document model

<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>

26 of 62

Processing XML and JSON

Document structure = tree
Processing via tree traversal (depth- or breadth-first search)
Or use proper query language, such as XQuery or jq

26

27 of 62

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

“The world is simple: one type of entity, all with the same attributes”�→ Flat model
“The world contains many types of entities, connected by relationships”�→ Relational model
“The world is a hierarchy of entities”�→ Document model
“The world is a complex network of entities”�→ Network model

28 of 62

Network model

“The world is a complex network of entities”

29 of 62

Commercial break

29

ADA: more than just rocket science!

30 of 62

“How should I store my data on a computer?”

—A word (or two) on binary formats

“Parsing” = converting strings (as stored in text files) to data types used by computer programs (e.g., int, float, boolean, array, list)
Possibly expensive, but can be avoided by using binary formats: store data to disk “as is”, without first converting to strings
Modern binary formats support nested structures, various levels of schema enforcement, compression, etc.
Python pickle, Java Serializable, Protocol Buffers (Google), Avro (supports schema evolution), Parquet (column-oriented), etc.

→ Consider converting to a binary format at the beginning of your processing pipeline (especially when using “big data”)

30

31 of 62

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

32 of 62

Data sources at Web companies

Examples from Facebook / Meta

Application databases
Web server logs
Client-side event logs
API server logs
Ad server logs
Search server logs
Advertisement landing page content
Wikipedia
Images and video

32

Structured data (with clear schema)

Semi-structured data

(“self-describing” structure; CSV etc.)

Unstructured data

33 of 62

Another example: Wikipedia

300+ languages
42 million entities
Mind-boggling richness of data

33

34 of 62

34

link

35 of 62

Wikipedia

How to work with Wikipedia?

REST (cf. later) API
XML dumps with wiki markup, SQL database dumps
Issues: Unicode, size, recency, etc.
To make your life easier:
Find projects on GitHub to help you
Use more structured versions (p.t.o.)

35

36 of 62

Wikidata

“Database version” of Wikipedia
{fr:Suisse, de:Schweiz, it:Svizzera,�en:Switzerland, …} → Q39

36

37 of 62

37

38 of 62

Wikidata

“Database version” of Wikipedia
{fr:Suisse, de:Schweiz, it:Svizzera,�en:Switzerland, …} → Q39
Both API access and full database dumps
Available as

JSON (document model)
RDF (network model)

38

39 of 62

39

Wikipedia pageview logs

Link: https://pageviews.wmcloud.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=all-time&pages=Vladimir_Putin|Ukraine

40 of 62

Crawling and processing webpages: HTML

Plenty of bulk-downloadable HTML data:

Common Crawl dataset, about 1.82 billion web pages -- huge!
(… but less than 0.1% of Google’s Web crawl, as of 2015)
145 TB, hosted on Amazon S3, also available for download

… but if you need a specific website: use a crawler/“spider”: Apache Nutch, Storm, Heritrix 3, Scrapy, etc. (or simply wget…)

40

41 of 62

Useful HTML tools

Requests http://docs.python-requests.org/en/master/

An elegant and simple HTTP library for Python

Scrapy https://scrapy.org/

An open-source framework to build Web crawlers

Beautiful Soup http://www.crummy.com/software/BeautifulSoup/ �A Python API for handling real HTML

Plain ol’ /regular/expres*ion/s…

41

42 of 62

Schema.org: microformats for Web pages

Nuggets of structured information embedded�in (semantically) unstructured HTML

42

Avatar

Director: James Cameron (born August 16, 1954)

…

Text as rendered by browser:

HTML under the hood:

43 of 62

Web services

Most large web sites today actively discourage “screen-scraping” to get their content
Instead: Web service APIs, for interoperable machine-to-machine interaction over a network
The preferred way to get data from online sources
Most common framework: REST

You request a URL from the server via HTTP
The server responds with a text file (e.g., JSON, XML, plain text)

43

44 of 62

REST example

← This resource is a description of a user named Jane

Requested by sending GET request for the resource’s URL, e.g., via curl:�curl http://www.example.org/users/jane/
If users need to modify the resource, they GET it, modify it, and PUT it back
The href to the location resource allows savvy clients to get more information with another simple GET request
Implication: Clients cannot be too “thin”; need to understand resource formats!

44

{

"user": {

"name": "Jane",

"gender": "female",

"location": {

"href": "http://www.example.org/us/ny/new_york",

"text": "New York"

}

45 of 62

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

46 of 62

Working with raw data sucks

Data comes in all shapes and sizes�– CSV files, PDFs, SQL dumps, .jpg, …

Different files have different formatting�– Empty string or space instead of NULL, extra header rows, character encoding, …

“Dirty” data�– Unwanted anomalies, duplicates

46

47 of 62

Raw data without thinking:

A recipe for disaster!

47

48 of 62

What is data wrangling?

Goal: extract and standardize the raw data
Combine multiple data sources
Clean data anomalies
Strategy: Combine automation with interactive visualizations to aid in cleaning

48

49 of 62

49

Wrangling takes between 50% and 80% of your time…

[Source]

50 of 62

Types of data problems

Missing data
Incorrect data
Inconsistent representations of the same data
About 75% of data problems require human intervention (e.g., experts, crowdsourcing, etc.)
Tradeoff between cleaning data vs. over-sanitizing data

50

51 of 62

“Dirty data” horror stories

“Dear Idiot” letter

17,000 men are pregnant

As the crow flies

[Source]

51

52 of 62

Diagnosing data problems

Visualizations and basic stats can convey issues in “raw” data
Different representations highlight different types of issues:

Outliers often stand out in the right kind of plot
Missing data will cause gaps or zero values in the right kind of plot

Becomes increasingly difficult as data gets larger

Sampling to the rescue!

52

53 of 62

Facebook graph

53

54 of 62

Matrix view (1)

Automatic permutation of rows and columns to highlight patterns of connectivity

54

55 of 62

Matrix view (2)

Rows and columns sorted in the order in which data was retrieved via the Facebook API

Can you guess what’s going on here?

55

56 of 62

Viz at scale? Careful!

56

57 of 62

Before you start analyzing your data

“Do I have missing data?” “If data were missing, how could I know?”
“Do I have corrupted data?” (May arise from measurement errors, processing bugs, etc.)
Parse/transform data into appropriate format for your specific analysis (see “Part 1: Data models”)
Don’t be surprised if you need to come back to this stage!

57

58 of 62

Desiderata

It’s always ideal if you can put your hands on the code/documentation about the dataset you are analyzing (provenance)

It’s always ideal if the provided data format is nicely parseable (otherwise you need regexes, or maybe even pay humans)

58

59 of 62

Highly non-parseable data

Entire NY Times archive (since 1851) digitized as of 2015

59

60 of 62

Feedback

60

Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec2-feedback

Feedback form available for each lecture and lab session

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
Is it nicer to follow the lecture online or offline?
…

61 of 62

Dealing with missing data

Set values to zero?
Interpolate based on existing data?
Omit missing data?

Knowledge about domain and data collection should drive your choice!

61

U.S. census counts of people working as ‘‘farm laborers’’; values from 1890 are missing due to records being burned in a fire

62 of 62

Inconsistent data: “My name is Willy”

62

First name	Last name
Willy	NULL
...	...