1
Applied Data Analysis (CS401)
Maria Brbic / Robert West
Lecture 2
Handling data
18 Sep 2024
Announcements
2
Deadlines
3
All deadlines are 23:59 CET
Feedback
4
Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec2-feedback
Feedback form available for each lecture and lab session
Cooking with data
Part 2:
Data sources
Part 1:
Data models
Part 3:
Data wrangling
Cooking with data
Part 2:
Data sources
Part 1:
Data models
Part 3:
Data wrangling
Simple definition: A data model specifies how you think about the world
Q: “How do you think about the world?”
entity
relationship
attributes
A: “See my entity–relationship diagram!”
Part 1:
Data models
entity
relationship
attributes
How to store my data on a computer?
Q: “How do you think about the world?”
A: “See my entity–relationship diagram!”
Q1: “How should I store my data on a computer?”
Q2: “How do I think about the world?”
Flat model
66.249.65.107 - - [08/Oct/2007:04:54:20 -0400] "GET /support.html HTTP/1.1" 200 11179 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
111.111.111.111 - - [08/Oct/2007:11:17:55 -0400] "GET / HTTP/1.1" 200 10801 "http://www.google.com/search?q=in+love+with+ada+lovelace+what+to+do&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"
“The world is simple: one type of entity, all with the same attributes”
Q1: “How should I store my data on a computer?”
Q2: “How do I think about the world?”
Relational model
13
id | name |
1 | Bush |
2 | Trump |
3 | Obama |
president | successor |
1 | 3 |
3 | 2 |
“The world contains many types of entities, connected by relationships.”
Processing data in the relational model: SQL
14
SQL (Structured Query Language)
Declarative
(e.g., SQL)
Imperative
(e.g., Python)
SQL
15
16
POLLING TIME
SQL implementations
etc.
17
SQL and “SQL”
18
“SQL”: Pandas (Python library)
19
Pandas vs. SQL
+ Pandas is lightweight and fast
+ Natively Python, i.e., full SQL expressiveness plus the expressiveness of Python, especially for function evaluation
+ Integration with plotting functions like Matplotlib
- In Pandas, tables must fit into memory
- No post-load indexing functionality: indices are built when a table is created
- No transactions, journaling, etc. (matters for parallel applications)
- Large, complex joins are slower
20
“SQL”: Unix command line
21
cat users.txt \
| awk ‘$2 >= 18 && $2 <= 25’ \
| join -1 1 -2 1 url_visits.txt - \
| cut -f 4 \
| sort \
| uniq -c \
| sort -k 1,1 -n -r \
| head -n 5
user_id age
User145 33
User24 15
User5 76
…
user_id url
User2 ada.epfl.ch
User244 facebook.com
…
Goal: Find top 5 URLs visited most frequently by users between 18 and 25 years old
users.txt
url_visits.txt
Q1: “How should I store my data on a computer?”
Q2: “How do I think about the world?”
Document model
address
street,
city,
country
<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>
contact
contact: {� id: 656,� firstname: “Chuck”,� lastname: “Smith”,� phones: [“(123) 555-0178”,� “(890) 555-0133”],� address: {� street: “Rue de l’Ale 8”,� city: “Lausanne”,� zip: 1007,� country: “CH”� }�}
“The world is a hierarchy of entities”
“Think for a minute”
Think for a minute:
If we want to use a relational DB (e.g., MySQL)
instead of XML, how can we store
several phone numbers for the same person?
(Feel free to discuss with your neighbor.)
<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>
Solution to “Think for a minute”
id | first name | ... |
656 | Chuck | ... |
... | ... | ... |
id | phone |
656 | (123) 555-0178 |
656 | (890) 555-0133 |
... | ... |
<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>
Processing XML and JSON
26
Q1: “How should I store my data on a computer?”
Q2: “How do I think about the world?”
Network model
“The world is a complex network of entities”
Commercial break
29
ADA: more than just rocket science!
“How should I store my data on a computer?”
—A word (or two) on binary formats
→ Consider converting to a binary format at the beginning of your processing pipeline (especially when using “big data”)
30
Cooking with data
Part 2:
Data sources
Part 1:
Data models
Part 3:
Data wrangling
Data sources at Web companies
Examples from Facebook / Meta
32
Structured data (with clear schema)
Semi-structured data
(“self-describing” structure; CSV etc.)
Unstructured data
Another example: Wikipedia
33
34
Wikipedia
How to work with Wikipedia?
35
Wikidata
36
37
Wikidata
38
39
Wikipedia pageview logs
Crawling and processing webpages: HTML
Plenty of bulk-downloadable HTML data:
… but if you need a specific website: use a crawler/“spider”: Apache Nutch, Storm, Heritrix 3, Scrapy, etc. (or simply wget…)
40
Useful HTML tools
Requests http://docs.python-requests.org/en/master/
An elegant and simple HTTP library for Python
Scrapy https://scrapy.org/
An open-source framework to build Web crawlers
Beautiful Soup http://www.crummy.com/software/BeautifulSoup/ �A Python API for handling real HTML
Plain ol’ /regular/expres*ion/s…
41
Schema.org: microformats for Web pages
42
Avatar
Director: James Cameron (born August 16, 1954)
…
Text as rendered by browser:
HTML under the hood:
Web services
43
REST example
← This resource is a description of a user named Jane
44
{
"user": {
"name": "Jane",
"gender": "female",
"location": {
"href": "http://www.example.org/us/ny/new_york",
"text": "New York"
}
}
}
Cooking with data
Part 2:
Data sources
Part 1:
Data models
Part 3:
Data wrangling
Working with raw data sucks
Data comes in all shapes and sizes�– CSV files, PDFs, SQL dumps, .jpg, …
Different files have different formatting�– Empty string or space instead of NULL, extra header rows, character encoding, …
“Dirty” data�– Unwanted anomalies, duplicates
46
Raw data without thinking:
A recipe for disaster!
47
What is data wrangling?
48
49
Types of data problems
50
“Dirty data” horror stories
51
Diagnosing data problems
52
Facebook graph
53
Matrix view (1)
Automatic permutation of rows and columns to highlight patterns of connectivity
54
Matrix view (2)
Rows and columns sorted in the order in which data was retrieved via the Facebook API
Can you guess what’s going on here?
55
Viz at scale? Careful!
56
Before you start analyzing your data
57
Desiderata
It’s always ideal if you can put your hands on the code/documentation about the dataset you are analyzing (provenance)
It’s always ideal if the provided data format is nicely parseable (otherwise you need regexes, or maybe even pay humans)
58
Highly non-parseable data
Entire NY Times archive (since 1851) digitized as of 2015
59
Feedback
60
Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec2-feedback
Feedback form available for each lecture and lab session
Dealing with missing data
Knowledge about domain and data collection should drive your choice!
61
U.S. census counts of people working as ‘‘farm laborers’’; values from 1890 are missing due to records being burned in a fire
Inconsistent data: “My name is Willy”
62
First name | Last name |
Willy | NULL |
... | ... |