1 of 59

1

Applied Data Analysis (CS401)

Robert West

Lecture 2

Handling data

23 Sep 2020

2 of 59

Announcements

2

Register your teams (3 people) here by Wed 30 Sep

Each team member must individually complete form!

Project milestone P0 to be released this Fri 25 Sep, due on Fri 2 Oct
Friday’s lab session:

Intro to Pandas (very important for Homework 1 and rest of course)

3 of 59

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2020-lec2-feedback

Feedback form available for each lecture and lab session

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
Is it nicer to follow the lecture online or offline?
…

4 of 59

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

5 of 59

Cooking with data

Part 2:

Data sources

Part 3:

Data wrangling

Part 1:

Data models

6 of 59

Bob’s definition: A data model specifies how you think about the world

7 of 59

Part 1:

Data models

Q: “How do you think about the world?”

A: “See my entity–relationship diagram!”

entity

relationship

attributes

How to store my data on a computer?

8 of 59

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

“The world is simple: one type of entity, all with the same attributes”�→ Flat model
“The world contains many types of entities, connected by relationships”�→ Relational model
“The world is a hierarchy of entities”�→ Document model
“The world is a complex network of entities”�→ Network model

9 of 59

Flat model

66.249.65.107 - - [08/Oct/2007:04:54:20 -0400] "GET /support.html HTTP/1.1" 200 11179 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

111.111.111.111 - - [08/Oct/2007:11:17:55 -0400] "GET / HTTP/1.1" 200 10801 "http://www.google.com/search?q=in+love+with+ada+lovelace+what+to+do&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"

Example: log files; e.g., Apache web server (httpd)

Entities = requests from clients to server

Another common format: CSV (“comma-separated vector”)

10 of 59

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

“The world is simple: one type of entity, all with the same attributes”�→ Flat model
“The world contains many types of entities, connected by relationships”�→ Relational model
“The world is a hierarchy of entities”�→ Document model
“The world is a complex network of entities”�→ Network model

11 of 59

Relational model

“The world contains many types of entities, connected by relationships”
The relational model is ubiquitous:

MySQL, PostgreSQL, Oracle, DB2, SQLite, …
You use it many times every day

Data represented as tables (“relations”) describing

entities,
relationships between entities

Most of the data we will use can be “reduced”�to the relational model

11

id	name
1	Bush
2	Trump
3	Obama

president	successor
1	3
3	2

12 of 59

Processing data in the relational model: SQL

Declarative language for core data manipulations
You think about what you want, not how to compute it

12

13 of 59

SQL

13

We expect you to know SQL (basics of DBs a class prerequisite)
Need a refresher? → Watch/do online tutorials!
Key concepts:

Select (!), update, delete
Unique keys
Joins (inner, left outer, right outer, full)
Sorting
Aggregation (group by, count, min, max, avg, etc.)

14 of 59

SQL implementations

etc.

14

15 of 59

SQL and “SQL”

The declarative-programming principles of SQL are widespread, even where it’s less obvious

15

16 of 59

“SQL”: Pandas (Python library)

Similar to SQL (declarative), with additional elements of functional programming (map(), filter(), etc.)
SQL “table” ←→ Pandas “DataFrame”
Need intro? Come to Friday’s lab session!

16

17 of 59

Pandas vs. SQL

+ Pandas is lightweight and fast.

+ Natively Python, i.e., full SQL expressiveness plus the expressiveness of Python, especially for function evaluation.

+ Integration with plotting functions like Matplotlib.

- In Pandas, tables must fit into memory.

- No post-load indexing functionality: indices are built when a table is created.

- No transactions, journaling, etc. (matters for parallel applications)

- Large, complex joins are slower.

17

18 of 59

“SQL”: Unix command line

18

cat users.txt \

| awk ‘$2 >= 18 && $2 <= 25’ \

| join -1 1 -2 1 pages.txt - \

| cut -f 4 \

| sort \

| uniq -c \

| sort -k 1,1 -n -r \

| head -n 5

19 of 59

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

“The world is simple: one type of entity, all with the same attributes”�→ Flat model
“The world contains many types of entities, connected by relationships”�→ Relational model
“The world is a hierarchy of entities”�→ Document model
“The world is a complex network of entities”�→ Network model

20 of 59

Document model

“The world is a hierarchy of entities”

address

street,

city,

country

XML format:

<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>

contact

JSON format:

contact: {� id: 656,� firstname: “Chuck”,� lastname: “Smith”,� phones: [“(123) 555-0178”,� “(890) 555-0133”],� address: {� street: “Rue de l’Ale 8”,� city: “Lausanne”,� zip: 1007,� country: “CH”� }�}

21 of 59

Document model

<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>

CHAT ROULETTE!

Think for 1 minute:

If we want to use a relational DB (e.g., MySQL)

instead of XML, how can we store

2 phone numbers for the same person?

Then: chat with a fellow student for 3 minutes

Rolex Forum: Talk to neighbor (priority: left, right)
Zoom: You’ll be randomized into small “breakout rooms”

22 of 59

Solution to chat roulette puzzle

id	first name	...
656	Chuck	...
...	...	...

id	phone
656	(123) 555-0178
656	(890) 555-0133
...	...

Same in relational model

Document model

<contact>� <id>656</id>� <firstname>Chuck</firstname>� <lastname>Smith</lastname>� <phone>(123) 555-0178</phone>� <phone>(890) 555-0133</phone>� <address>� <street>Rue de l’Ale 8</street>� <city>Lausanne</city>� <zip>1007</zip>� <country>CH</country>� </address>�</contact>

23 of 59

Processing XML and JSON

Document structure = tree
Processing via tree traversal (depth- or breadth-first search)
Or use proper query language, such as implemented by jq

23

24 of 59

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

“The world is simple: one type of entity, all with the same attributes”�→ Flat model
“The world contains many types of entities, connected by relationships”�→ Relational model
“The world is a hierarchy of entities”�→ Document model
“The world is a complex network of entities”�→ Network model

25 of 59

Network model

“The world is a complex network of entities”

26 of 59

“How should I store my data on a computer?”

—A word (or two) on binary formats

Binary format often the key to performance, avoiding expensive parsing
Modern binary formats support nested structures, various levels of schema enforcement, compression, etc.
Python pickle, Java Serializable, Protocol Buffers (Google), Avro (supports schema evolution), Parquet (column-oriented), etc.

→ Consider converting to a binary format at the beginning of your processing pipeline (especially when using “big data”)

26

27 of 59

Cooking with data

Part 3:

Data wrangling

Part 2:

Data sources

Part 1:

Data models

28 of 59

Data sources at Web companies

Examples from Facebook

Application databases
Web server logs
Event logs
API server logs
Ad server logs
Search server logs
Advertisement landing page content
Wikipedia
Images and video

28

Structured data (with clear schema)

Semi-structured data

(“self-describing” structure; CSV etc.)

Unstructured data

29 of 59

Another example: Wikipedia

200+ languages
Over 50 million entities
Mind-boggling richness of data

29

30 of 59

30

link

31 of 59

Wikipedia

How to work with Wikipedia?

XML dumps with wiki markup, SQL database dumps
Issues: Unicode, size, recency, etc.
To make your life easier:
Find projects on GitHub to help you
Use more structured versions (p.t.o.)

31

32 of 59

Wikidata

“Database version” of Wikipedia
{fr:Suisse, de:Schweiz, it:Svizzera,�en:Switzerland, …} → Q39
Both API access and full database dumps
Available as

JSON (document model)
RDF (network model)

32

33 of 59

33

34 of 59

Crawling and processing webpages: HTML

Plenty of bulk-downloadable HTML data:

Common Crawl dataset, about 1.82 billion web pages -- huge!
(… but less than 0.1% of Google’s Web crawl, as of 2015)
145 TB, hosted on Amazon S3, also available for download

… but if you need a specific website: use a crawler/“spider”: Apache Nutch, Storm, Heritrix 3, Scrapy, etc. (or simply wget…)

34

35 of 59

Useful HTML tools

Requests http://docs.python-requests.org/en/master/

An elegant and simple HTTP library for Python

Scrapy https://scrapy.org/

An open-source framework to build Web crawlers

Beautiful Soup http://www.crummy.com/software/BeautifulSoup/ �A Python API for handling real HTML

Plain ol’ /regular/expres*ion/s…

35

36 of 59

Schema.org: microformats for Web pages

Nuggets of structured information embedded�in (semantically) unstructured HTML

36

37 of 59

Web services

Most large web sites today actively discourage screen-scraping to get their content
Instead: Web service APIs, for interoperable machine-to-machine interaction over a network
The preferred way to get data from online sources
Most common framework: REST

You request a URL from the server via HTTP
The server responds with a text file (e.g., JSON, XML, plain text)

37

38 of 59

REST example

← This resource is a description of a user named Jane

Requested by sending GET request for the resource’s URL, e.g., via curl:�curl http://www.example.org/users/jane/
If they need to modify the resource, they GET it, modify it, and PUT it back
The href to the location resource allows savvy clients to get more information with another simple GET request
Implication: Clients cannot be too “thin”; need to understand resource formats!

38

{

"user": {

"name": "Jane",

"gender": "female",

"location": {

"href": "http://www.example.org/us/ny/new_york",

"text": "New York"

}

39 of 59

39

I said, “What’s that?” and he said, “That’s jazz.” “How do you write that?” And he spelt it out. Somehow I saw my name in there, and I liked this word.

Joe Zawinul

I said, “What’s that?” and he said, “That’s REST.” “How do you write that?” And he spelt it out. Somehow I saw my name in there, and I liked this word.

Robert West

40 of 59

Cooking with data

Part 3:

Data wrangling

Part 2:

Data sources

Part 1:

Data models

41 of 59

Working with raw data sucks

Data comes in all shapes and sizes�– CSV files, PDFs, SQL dumps, .jpg, …

Different files have different formatting�– Empty string or space instead of NULL, extra header rows, character encoding, …

“Dirty” data�– Unwanted anomalies, duplicates

41

42 of 59

Raw data without thinking:

A recipe for disaster!

42

43 of 59

What is data wrangling?

Goal: extract and standardize the raw data
Combine multiple data sources
Clean data anomalies
Strategy: Combine automation with interactive visualizations to aid in cleaning
Outcome: Improve efficiency and scale of data importing

43

44 of 59

44

Wrangling takes between 50% and 80% of your time…

[Source]

45 of 59

Types of data problems

Missing data
Incorrect data
Inconsistent representations of the same data
About 75% of data problems require human intervention (e.g., experts, crowdsourcing, etc.)
Tradeoff between cleaning data vs. over-sanitizing data

45

link

46 of 59

“Dirty data” horror stories

“Dear Idiot” letter

17,000 men are pregnant

As the crow flies

CHF 10,000 compute-cluster bill

[Source]

46

47 of 59

Diagnosing data problems

Visualizations and basic stats can convey issues in “raw” data
Different representations highlight different types of issues:

Outliers often stand out in the right kind of plot
Missing data will cause gaps or zero values in the right kind of plot

Becomes increasingly difficult as data gets larger

Sampling to the rescue!

47

48 of 59

Facebook graph

48

49 of 59

Matrix view (1)

Automatic permutation of rows and columns to highlight patterns of connectivity

49

50 of 59

Matrix view (2)

Rows and columns sorted in the order in which data was retrieved via the Facebook API

Can you guess what’s going on here?

50

51 of 59

51

CHAT ROULETTE!

Think for 1 minute:

What causes the white block�in the adjacency matrix?

Then: chat with a fellow student for 3 minutes

Rolex Forum: Talk to neighbor (priority: left, right)
Zoom: You’ll be randomized into small “breakout rooms”

52 of 59

Viz at scale? Careful!

52

53 of 59

Dealing with missing data

Set values to zero?
Interpolate based on existing data?
Omit missing data?

Knowledge about domain and data collection should drive your choice!

53

U.S. census counts of people working as ‘‘farm laborers’’; values from 1890 are missing due to records being burned in a fire

54 of 59

Inconsistent data: “My name is Willy”

54

First name	Last name
Willy	NULL
...	...

55 of 59

Before you start analyzing your data

“Do I have missing data?” “If data were missing, how could I know?”
“Do I have corrupted data?” (May arise from measurement errors, wrong sampling strategies, etc.)
Parse/transform data into appropriate format for your specific analysis (see “Part 1: Data models”)
Don’t be surprised if you need to come back to this stage!

55

56 of 59

Desiderata

It’s always ideal if you can put your hands on the code/documentation about the dataset you are analyzing (provenance)

It’s always ideal if the provided data format is nicely parseable (otherwise you need regexes, or maybe even pay humans)

56

57 of 59

Highly non-parseable data

Entire NY Times archive (since 1851) digitized as of 2015

57

58 of 59

Example from Bob’s research (AD 2013)

Q: What do people eat when and where?

58

Paper with results and plots

59 of 59

Feedback

59

Give us feedback on this lecture here: https://go.epfl.ch/ada2020-lec2-feedback

Feedback form available for each lecture and lab session

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
Is it nicer to follow the lecture online or offline?
…