1 of 59

1

Applied Data Analysis (CS401)

Robert West

Lecture 2

Handling data

23 Sep 2020

2 of 59

Announcements

2

  • Register your teams (3 people) here by Wed 30 Sep
    • Each team member must individually complete form!
  • Project milestone P0 to be released this Fri 25 Sep, due on Fri 2 Oct
  • Friday’s lab session:
    • Intro to Pandas (very important for Homework 1 and rest of course)

3 of 59

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2020-lec2-feedback

Feedback form available for each lecture and lab session

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?
  • Is it nicer to follow the lecture online or offline?

4 of 59

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

5 of 59

Cooking with data

Part 2:

Data sources

Part 3:

Data wrangling

Part 1:

Data models

6 of 59

Bob’s definition: A data model specifies how you think about the world

7 of 59

Part 1:

Data models

Q: “How do you think about the world?”

A: “See my entity–relationship diagram!”

entity

relationship

attributes

How to store my data on a computer?

8 of 59

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

  • “The world is simple: one type of entity, all with the same attributes”�→ Flat model
  • “The world contains many types of entities, connected by relationships”�→ Relational model
  • “The world is a hierarchy of entities”�→ Document model
  • “The world is a complex network of entities”�→ Network model

9 of 59

Flat model

66.249.65.107 - - [08/Oct/2007:04:54:20 -0400] "GET /support.html HTTP/1.1" 200 11179 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

111.111.111.111 - - [08/Oct/2007:11:17:55 -0400] "GET / HTTP/1.1" 200 10801 "http://www.google.com/search?q=in+love+with+ada+lovelace+what+to+do&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"

  • Example: log files; e.g., Apache web server (httpd)
    • Entities = requests from clients to server

  • Another common format: CSV (“comma-separated vector”)

10 of 59

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

  • “The world is simple: one type of entity, all with the same attributes”�→ Flat model
  • “The world contains many types of entities, connected by relationships”�→ Relational model
  • “The world is a hierarchy of entities”�→ Document model
  • “The world is a complex network of entities”�→ Network model

11 of 59

Relational model

  • “The world contains many types of entities, connected by relationships”
  • The relational model is ubiquitous:
    • MySQL, PostgreSQL, Oracle, DB2, SQLite, …
    • You use it many times every day
  • Data represented as tables (“relations”) describing
    • entities,
    • relationships between entities
  • Most of the data we will use can be “reduced”�to the relational model

11

id

name

1

Bush

2

Trump

3

Obama

president

successor

1

3

3

2

12 of 59

Processing data in the relational model: SQL

  • Declarative language for core data manipulations
  • You think about what you want, not how to compute it

12

13 of 59

SQL

13

  • We expect you to know SQL (basics of DBs a class prerequisite)
  • Need a refresher? → Watch/do online tutorials!
  • Key concepts:
    • Select (!), update, delete
    • Unique keys
    • Joins (inner, left outer, right outer, full)
    • Sorting
    • Aggregation (group by, count, min, max, avg, etc.)

14 of 59

SQL implementations

etc.

14

15 of 59

SQL and “SQL”

  • The declarative-programming principles of SQL are widespread, even where it’s less obvious

15

16 of 59

“SQL”: Pandas (Python library)

  • Similar to SQL (declarative), with additional elements of functional programming (map(), filter(), etc.)
  • SQL “table” ←→ Pandas “DataFrame”
  • Need intro? Come to Friday’s lab session!

16

17 of 59

Pandas vs. SQL

+ Pandas is lightweight and fast.

+ Natively Python, i.e., full SQL expressiveness plus the expressiveness of Python, especially for function evaluation.

+ Integration with plotting functions like Matplotlib.

- In Pandas, tables must fit into memory.

- No post-load indexing functionality: indices are built when a table is created.

- No transactions, journaling, etc. (matters for parallel applications)

- Large, complex joins are slower.

17

18 of 59

“SQL”: Unix command line

18

cat users.txt \

| awk ‘$2 >= 18 && $2 <= 25’ \

| join -1 1 -2 1 pages.txt - \

| cut -f 4 \

| sort \

| uniq -c \

| sort -k 1,1 -n -r \

| head -n 5

19 of 59

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

  • “The world is simple: one type of entity, all with the same attributes”�→ Flat model
  • “The world contains many types of entities, connected by relationships”�→ Relational model
  • “The world is a hierarchy of entities”�→ Document model
  • “The world is a complex network of entities”�→ Network model

20 of 59

Document model

  • “The world is a hierarchy of entities”

address

street,

city,

country

  • XML format:

<contact><id>656</id><firstname>Chuck</firstname><lastname>Smith</lastname><phone>(123) 555-0178</phone><phone>(890) 555-0133</phone><address><street>Rue de l’Ale 8</street><city>Lausanne</city><zip>1007</zip><country>CH</country></address></contact>

contact

  • JSON format:

contact: { id: 656,firstname: “Chuck”,lastname: “Smith”,phones: [“(123) 555-0178”, (890) 555-0133”],address: {street: “Rue de l’Ale 8”,city: “Lausanne”,zip: 1007,country: “CH}}

21 of 59

  • Document model

<contact><id>656</id><firstname>Chuck</firstname><lastname>Smith</lastname><phone>(123) 555-0178</phone><phone>(890) 555-0133</phone><address><street>Rue de l’Ale 8</street><city>Lausanne</city><zip>1007</zip><country>CH</country></address></contact>

CHAT ROULETTE!

Think for 1 minute:

If we want to use a relational DB (e.g., MySQL)

instead of XML, how can we store

2 phone numbers for the same person?

  • Then: chat with a fellow student for 3 minutes
    • Rolex Forum: Talk to neighbor (priority: left, right)
    • Zoom: You’ll be randomized into small “breakout rooms”

22 of 59

Solution to chat roulette puzzle

id

first name

...

656

Chuck

...

...

...

...

id

phone

656

(123) 555-0178

656

(890) 555-0133

...

...

  • Same in relational model
  • Document model

<contact><id>656</id><firstname>Chuck</firstname><lastname>Smith</lastname><phone>(123) 555-0178</phone><phone>(890) 555-0133</phone><address><street>Rue de l’Ale 8</street><city>Lausanne</city><zip>1007</zip><country>CH</country></address></contact>

23 of 59

Processing XML and JSON

  • Document structure = tree
  • Processing via tree traversal (depth- or breadth-first search)
  • Or use proper query language, such as implemented by jq

23

24 of 59

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

  • “The world is simple: one type of entity, all with the same attributes”�→ Flat model
  • “The world contains many types of entities, connected by relationships”�→ Relational model
  • “The world is a hierarchy of entities”�→ Document model
  • “The world is a complex network of entities”�→ Network model

25 of 59

Network model

  • “The world is a complex network of entities”

26 of 59

“How should I store my data on a computer?”

—A word (or two) on binary formats

  • Binary format often the key to performance, avoiding expensive parsing
  • Modern binary formats support nested structures, various levels of schema enforcement, compression, etc.
  • Python pickle, Java Serializable, Protocol Buffers (Google), Avro (supports schema evolution), Parquet (column-oriented), etc.

→ Consider converting to a binary format at the beginning of your processing pipeline (especially when using “big data”)

26

27 of 59

Cooking with data

Part 3:

Data wrangling

Part 2:

Data sources

Part 1:

Data models

28 of 59

Data sources at Web companies

Examples from Facebook

    • Application databases
    • Web server logs
    • Event logs
    • API server logs
    • Ad server logs
    • Search server logs
    • Advertisement landing page content
    • Wikipedia
    • Images and video

28

Structured data (with clear schema)

Semi-structured data

(“self-describing” structure; CSV etc.)

Unstructured data

29 of 59

Another example: Wikipedia

  • 200+ languages
  • Over 50 million entities
  • Mind-boggling richness of data

29

30 of 59

30

31 of 59

Wikipedia

How to work with Wikipedia?

  • XML dumps with wiki markup, SQL database dumps
  • Issues: Unicode, size, recency, etc.
  • To make your life easier:
  • Find projects on GitHub to help you
  • Use more structured versions (p.t.o.)

31

32 of 59

Wikidata

  • “Database version” of Wikipedia
  • {fr:Suisse, de:Schweiz, it:Svizzera,�en:Switzerland, …} → Q39
  • Both API access and full database dumps
  • Available as
    • JSON (document model)
    • RDF (network model)

32

33 of 59

33

34 of 59

Crawling and processing webpages: HTML

Plenty of bulk-downloadable HTML data:

  • Common Crawl dataset, about 1.82 billion web pages -- huge!
  • (… but less than 0.1% of Google’s Web crawl, as of 2015)
  • 145 TB, hosted on Amazon S3, also available for download

… but if you need a specific website: use a crawler/spider: Apache Nutch, Storm, Heritrix 3, Scrapy, etc. (or simply wget…)

34

35 of 59

Useful HTML tools

Requests http://docs.python-requests.org/en/master/

An elegant and simple HTTP library for Python

Scrapy https://scrapy.org/

An open-source framework to build Web crawlers

Beautiful Soup http://www.crummy.com/software/BeautifulSoup/A Python API for handling real HTML

Plain ol’ /regular/expres*ion/s…

35

36 of 59

Schema.org: microformats for Web pages

  • Nuggets of structured information embedded�in (semantically) unstructured HTML

36

37 of 59

Web services

  • Most large web sites today actively discourage screen-scraping to get their content
  • Instead: Web service APIs, for interoperable machine-to-machine interaction over a network
  • The preferred way to get data from online sources
  • Most common framework: REST
    • You request a URL from the server via HTTP
    • The server responds with a text file (e.g., JSON, XML, plain text)

37

38 of 59

REST example

← This resource is a description of a user named Jane

  • Requested by sending GET request for the resource’s URL, e.g., via curl:�curl http://www.example.org/users/jane/
  • If they need to modify the resource, they GET it, modify it, and PUT it back
  • The href to the location resource allows savvy clients to get more information with another simple GET request
  • Implication: Clients cannot be too “thin”; need to understand resource formats!

38

{

"user": {

"name": "Jane",

"gender": "female",

"location": {

"href": "http://www.example.org/us/ny/new_york",

"text": "New York"

}

}

}

39 of 59

39

I said, “What’s that?” and he said, “That’s jazz.” “How do you write that?” And he spelt it out. Somehow I saw my name in there, and I liked this word.

Joe Zawinul

I said, “What’s that?” and he said, “That’s REST.” “How do you write that?” And he spelt it out. Somehow I saw my name in there, and I liked this word.

Robert West

40 of 59

Cooking with data

Part 3:

Data wrangling

Part 2:

Data sources

Part 1:

Data models

41 of 59

Working with raw data sucks

Data comes in all shapes and sizes�CSV files, PDFs, SQL dumps, .jpg, …

Different files have different formatting�– Empty string or space instead of NULL, extra header rows, character encoding, …

“Dirty” data�– Unwanted anomalies, duplicates

41

42 of 59

Raw data without thinking:

A recipe for disaster!

42

43 of 59

What is data wrangling?

  • Goal: extract and standardize the raw data
  • Combine multiple data sources
  • Clean data anomalies
  • Strategy: Combine automation with interactive visualizations to aid in cleaning
  • Outcome: Improve efficiency and scale of data importing

43

44 of 59

44

Wrangling takes between 50% and 80% of your time

[Source]

45 of 59

Types of data problems

  • Missing data
  • Incorrect data
  • Inconsistent representations of the same data
  • About 75% of data problems require human intervention (e.g., experts, crowdsourcing, etc.)
  • Tradeoff between cleaning data vs. over-sanitizing data

45

46 of 59

“Dirty data” horror stories

“Dear Idiot” letter

17,000 men are pregnant

As the crow flies

CHF 10,000 compute-cluster bill

[Source]

46

47 of 59

Diagnosing data problems

  • Visualizations and basic stats can convey issues in “raw” data
  • Different representations highlight different types of issues:
    • Outliers often stand out in the right kind of plot
    • Missing data will cause gaps or zero values in the right kind of plot
  • Becomes increasingly difficult as data gets larger
    • Sampling to the rescue!

47

48 of 59

Facebook graph

48

49 of 59

Matrix view (1)

Automatic permutation of rows and columns to highlight patterns of connectivity

49

50 of 59

Matrix view (2)

Rows and columns sorted in the order in which data was retrieved via the Facebook API

Can you guess what’s going on here?

50

51 of 59

51

CHAT ROULETTE!

Think for 1 minute:

What causes the white block�in the adjacency matrix?

  • Then: chat with a fellow student for 3 minutes
    • Rolex Forum: Talk to neighbor (priority: left, right)
    • Zoom: You’ll be randomized into small “breakout rooms”

52 of 59

Viz at scale? Careful!

52

53 of 59

Dealing with missing data

  • Set values to zero?
  • Interpolate based on existing data?
  • Omit missing data?

Knowledge about domain and data collection should drive your choice!

53

U.S. census counts of people working as ‘‘farm laborers’’; values from 1890 are missing due to records being burned in a fire

54 of 59

Inconsistent data: “My name is Willy”

54

First name

Last name

Willy

NULL

...

...

55 of 59

Before you start analyzing your data

  • “Do I have missing data?” “If data were missing, how could I know?”
  • “Do I have corrupted data?” (May arise from measurement errors, wrong sampling strategies, etc.)
  • Parse/transform data into appropriate format for your specific analysis (see “Part 1: Data models”)
  • Don’t be surprised if you need to come back to this stage!

55

56 of 59

Desiderata

It’s always ideal if you can put your hands on the code/documentation about the dataset you are analyzing (provenance)

It’s always ideal if the provided data format is nicely parseable (otherwise you need regexes, or maybe even pay humans)

56

57 of 59

Highly non-parseable data

Entire NY Times archive (since 1851) digitized as of 2015

57

58 of 59

Example from Bob’s research (AD 2013)

Q: What do people eat when and where?

58

59 of 59

Feedback

59

Give us feedback on this lecture here: https://go.epfl.ch/ada2020-lec2-feedback

Feedback form available for each lecture and lab session

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?
  • Is it nicer to follow the lecture online or offline?