1
Applied Data Analysis (CS401)
Robert West
Lecture 2
Handling data
23 Sep 2020
Announcements
2
Feedback
3
Give us feedback on this lecture here: https://go.epfl.ch/ada2020-lec2-feedback
Feedback form available for each lecture and lab session
Cooking with data
Part 2:
Data sources
Part 1:
Data models
Part 3:
Data wrangling
Cooking with data
Part 2:
Data sources
Part 3:
Data wrangling
Part 1:
Data models
Bob’s definition: A data model specifies how you think about the world
Part 1:
Data models
Q: “How do you think about the world?”
A: “See my entity–relationship diagram!”
entity
relationship
attributes
How to store my data on a computer?
Q1: “How should I store my data on a computer?”
Q2: “How do I think about the world?”
Flat model
66.249.65.107 - - [08/Oct/2007:04:54:20 -0400] "GET /support.html HTTP/1.1" 200 11179 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
111.111.111.111 - - [08/Oct/2007:11:17:55 -0400] "GET / HTTP/1.1" 200 10801 "http://www.google.com/search?q=in+love+with+ada+lovelace+what+to+do&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"
Q1: “How should I store my data on a computer?”
Q2: “How do I think about the world?”
Relational model
11
id | name |
1 | Bush |
2 | Trump |
3 | Obama |
president | successor |
1 | 3 |
3 | 2 |
Processing data in the relational model: SQL
12
SQL
13
SQL implementations
etc.
14
SQL and “SQL”
15
“SQL”: Pandas (Python library)
16
Pandas vs. SQL
+ Pandas is lightweight and fast.
+ Natively Python, i.e., full SQL expressiveness plus the expressiveness of Python, especially for function evaluation.
+ Integration with plotting functions like Matplotlib.
- In Pandas, tables must fit into memory.
- No post-load indexing functionality: indices are built when a table is created.
- No transactions, journaling, etc. (matters for parallel applications)
- Large, complex joins are slower.
17
“SQL”: Unix command line
18
cat users.txt \
| awk ‘$2 >= 18 && $2 <= 25’ \
| join -1 1 -2 1 pages.txt - \
| cut -f 4 \
| sort \
| uniq -c \
| sort -k 1,1 -n -r \
| head -n 5
Q1: “How should I store my data on a computer?”
Q2: “How do I think about the world?”
Document model
address
street,
city,
country
<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>
contact
contact: {� id: 656,� firstname: “Chuck”,� lastname: “Smith”,� phones: [“(123) 555-0178”,� “(890) 555-0133”],� address: {� street: “Rue de l’Ale 8”,� city: “Lausanne”,� zip: 1007,� country: “CH”� }�}
<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>
CHAT ROULETTE!
Think for 1 minute:
If we want to use a relational DB (e.g., MySQL)
instead of XML, how can we store
2 phone numbers for the same person?
Solution to chat roulette puzzle
id | first name | ... |
656 | Chuck | ... |
... | ... | ... |
id | phone |
656 | (123) 555-0178 |
656 | (890) 555-0133 |
... | ... |
<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>
Processing XML and JSON
23
Q1: “How should I store my data on a computer?”
Q2: “How do I think about the world?”
Network model
“How should I store my data on a computer?”
—A word (or two) on binary formats
→ Consider converting to a binary format at the beginning of your processing pipeline (especially when using “big data”)
26
Cooking with data
Part 3:
Data wrangling
Part 2:
Data sources
Part 1:
Data models
Data sources at Web companies
Examples from Facebook
28
Structured data (with clear schema)
Semi-structured data
(“self-describing” structure; CSV etc.)
Unstructured data
Another example: Wikipedia
29
30
Wikipedia
How to work with Wikipedia?
31
Wikidata
32
33
Crawling and processing webpages: HTML
Plenty of bulk-downloadable HTML data:
… but if you need a specific website: use a crawler/“spider”: Apache Nutch, Storm, Heritrix 3, Scrapy, etc. (or simply wget…)
34
Useful HTML tools
Requests http://docs.python-requests.org/en/master/
An elegant and simple HTTP library for Python
Scrapy https://scrapy.org/
An open-source framework to build Web crawlers
Beautiful Soup http://www.crummy.com/software/BeautifulSoup/ �A Python API for handling real HTML
Plain ol’ /regular/expres*ion/s…
35
Schema.org: microformats for Web pages
36
Web services
37
REST example
← This resource is a description of a user named Jane
38
{
"user": {
"name": "Jane",
"gender": "female",
"location": {
"href": "http://www.example.org/us/ny/new_york",
"text": "New York"
}
}
}
39
I said, “What’s that?” and he said, “That’s jazz.” “How do you write that?” And he spelt it out. Somehow I saw my name in there, and I liked this word.
Joe Zawinul
I said, “What’s that?” and he said, “That’s REST.” “How do you write that?” And he spelt it out. Somehow I saw my name in there, and I liked this word.
Robert West
Cooking with data
Part 3:
Data wrangling
Part 2:
Data sources
Part 1:
Data models
Working with raw data sucks
Data comes in all shapes and sizes�– CSV files, PDFs, SQL dumps, .jpg, …
Different files have different formatting�– Empty string or space instead of NULL, extra header rows, character encoding, …
“Dirty” data�– Unwanted anomalies, duplicates
41
Raw data without thinking:
A recipe for disaster!
42
What is data wrangling?
43
44
Types of data problems
45
“Dirty data” horror stories
46
Diagnosing data problems
47
Facebook graph
48
Matrix view (1)
Automatic permutation of rows and columns to highlight patterns of connectivity
49
Matrix view (2)
Rows and columns sorted in the order in which data was retrieved via the Facebook API
Can you guess what’s going on here?
50
51
CHAT ROULETTE!
Think for 1 minute:
What causes the white block�in the adjacency matrix?
Viz at scale? Careful!
52
Dealing with missing data
Knowledge about domain and data collection should drive your choice!
53
U.S. census counts of people working as ‘‘farm laborers’’; values from 1890 are missing due to records being burned in a fire
Inconsistent data: “My name is Willy”
54
First name | Last name |
Willy | NULL |
... | ... |
Before you start analyzing your data
55
Desiderata
It’s always ideal if you can put your hands on the code/documentation about the dataset you are analyzing (provenance)
It’s always ideal if the provided data format is nicely parseable (otherwise you need regexes, or maybe even pay humans)
56
Highly non-parseable data
Entire NY Times archive (since 1851) digitized as of 2015
57
Example from Bob’s research (AD 2013)
Q: What do people eat when and where?
58
Feedback
59
Give us feedback on this lecture here: https://go.epfl.ch/ada2020-lec2-feedback
Feedback form available for each lecture and lab session