1 of 62

1

Applied Data Analysis (CS401)

Maria Brbic / Robert West

Lecture 2

Handling data

18 Sep 2024

2 of 62

Announcements

2

  • Register your teams (5 people) here by Fri Sep 27th
    • Each team member must individually complete the form!
  • Project milestone P1 to be released this Fri, due Fri Oct 4th
  • First quiz (“Q1”*) will be released today after the lecture
    • Exercise for recap lecture materials
  • Friday’s lab session:
    • Intro to Pandas (very important for Homework H1 and rest of course)
    • Exercise 1 already on Github

 

3 of 62

Deadlines

  • Homeworks
    • Homework 1
      • Release Oct 4th 2024
      • Due Oct 18th 2024
    • Homework 2
      • Release Nov 15th 2024
      • Due Nov 29th 2024
  • Final exam
    • Date TBD
  • Project deliverables
    • Project milestone P1
      • Due Oct 4th 2024
    • Project milestone P2
      • Due Nov 15th 2024
    • Project milestone P3
      • Due Dec 20th 2024

3

All deadlines are 23:59 CET

4 of 62

Feedback

4

Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec2-feedback

Feedback form available for each lecture and lab session

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?
  • Is it nicer to follow the lecture online or offline?

5 of 62

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

6 of 62

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

7 of 62

Simple definition: A data model specifies how you think about the world

8 of 62

Q: “How do you think about the world?”

entity

relationship

attributes

A: “See my entity–relationship diagram!”

9 of 62

Part 1:

Data models

entity

relationship

attributes

How to store my data on a computer?

Q: “How do you think about the world?”

A: “See my entity–relationship diagram!”

10 of 62

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

  • “The world is simple: one type of entity, all with the same attributes”�→ Flat model
  • “The world contains many types of entities, connected by relationships”�→ Relational model
  • “The world is a hierarchy of entities”�→ Document model
  • “The world is a complex network of entities”�→ Network model

11 of 62

Flat model

66.249.65.107 - - [08/Oct/2007:04:54:20 -0400] "GET /support.html HTTP/1.1" 200 11179 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

111.111.111.111 - - [08/Oct/2007:11:17:55 -0400] "GET / HTTP/1.1" 200 10801 "http://www.google.com/search?q=in+love+with+ada+lovelace+what+to+do&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"

  • Example: log files; e.g., Apache web server (httpd)
    • Entities = requests from clients to server

  • Another common format: CSV (“comma-separated vector”)

“The world is simple: one type of entity, all with the same attributes”

12 of 62

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

  • “The world is simple: one type of entity, all with the same attributes”�→ Flat model
  • “The world contains many types of entities, connected by relationships”�→ Relational model
  • “The world is a hierarchy of entities”�→ Document model
  • “The world is a complex network of entities”�→ Network model

13 of 62

Relational model

  • The relational model is ubiquitous:
    • MySQL, PostgreSQL, Oracle, DB2, SQLite, …
    • You use it many times every day
  • Data represented as tables describing
    • entities
    • relationships between entities

13

id

name

1

Bush

2

Trump

3

Obama

president

successor

1

3

3

2

“The world contains many types of entities, connected by relationships.”

14 of 62

Processing data in the relational model: SQL

  • Declarative language for core data manipulations
  • You think about what you want, not how to compute it

14

SQL (Structured Query Language)

Declarative

(e.g., SQL)

Imperative

(e.g., Python)

15 of 62

SQL

15

  • You should know basics of SQL
  • Need a refresher? → Watch/do online tutorials!
  • Key operations:
    • Select (!), update, delete
    • Unique keys
    • Joins (inner, left outer, right outer, full)
    • Sorting
    • Aggregation (group by, count, min, max, avg, etc.)

16 of 62

16

POLLING TIME

  • “Have you worked with SQL joins?”
  • Scan QR code or go to�https://app.sli.do/event/jiFfkrP812UjtpwMbFJNxD

17 of 62

SQL implementations

etc.

17

18 of 62

SQL and “SQL”

  • The declarative-programming principles of SQL are widespread, even where it’s less obvious

18

19 of 62

“SQL”: Pandas (Python library)

  • Similar to SQL (declarative), with additional elements of functional programming (map(), filter(), etc.)
  • SQL “table” ←→ Pandas “DataFrame”
  • Need intro? Come to Friday’s lab session!

19

20 of 62

Pandas vs. SQL

+ Pandas is lightweight and fast

+ Natively Python, i.e., full SQL expressiveness plus the expressiveness of Python, especially for function evaluation

+ Integration with plotting functions like Matplotlib

- In Pandas, tables must fit into memory

- No post-load indexing functionality: indices are built when a table is created

- No transactions, journaling, etc. (matters for parallel applications)

- Large, complex joins are slower

20

21 of 62

“SQL”: Unix command line

21

cat users.txt \

| awk ‘$2 >= 18 && $2 <= 25’ \

| join -1 1 -2 1 url_visits.txt - \

| cut -f 4 \

| sort \

| uniq -c \

| sort -k 1,1 -n -r \

| head -n 5

user_id age

User145 33

User24 15

User5 76

user_id url

User2 ada.epfl.ch

User244 facebook.com

Goal: Find top 5 URLs visited most frequently by users between 18 and 25 years old

users.txt

url_visits.txt

22 of 62

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

  • “The world is simple: one type of entity, all with the same attributes”�→ Flat model
  • “The world contains many types of entities, connected by relationships”�→ Relational model
  • “The world is a hierarchy of entities”�→ Document model
  • “The world is a complex network of entities”�→ Network model

23 of 62

Document model

address

street,

city,

country

  • XML format:

<contact><id>656</id><firstname>Chuck</firstname><lastname>Smith</lastname><phone>(123) 555-0178</phone><phone>(890) 555-0133</phone><address><street>Rue de l’Ale 8</street><city>Lausanne</city><zip>1007</zip><country>CH</country></address></contact>

contact

  • JSON format:

contact: {id: 656,firstname: “Chuck”,lastname: “Smith”,phones: [“(123) 555-0178”,(890) 555-0133”],address: {street: “Rue de l’Ale 8”,city: “Lausanne”,zip: 1007,country: “CH}}

“The world is a hierarchy of entities”

24 of 62

“Think for a minute”

Think for a minute:

If we want to use a relational DB (e.g., MySQL)

instead of XML, how can we store

several phone numbers for the same person?

(Feel free to discuss with your neighbor.)

  • Document model

<contact><id>656</id><firstname>Chuck</firstname><lastname>Smith</lastname><phone>(123) 555-0178</phone><phone>(890) 555-0133</phone><address><street>Rue de l’Ale 8</street><city>Lausanne</city><zip>1007</zip><country>CH</country></address></contact>

25 of 62

Solution to “Think for a minute”

id

first name

...

656

Chuck

...

...

...

...

id

phone

656

(123) 555-0178

656

(890) 555-0133

...

...

  • Same in relational model
  • Document model

<contact><id>656</id><firstname>Chuck</firstname><lastname>Smith</lastname><phone>(123) 555-0178</phone><phone>(890) 555-0133</phone><address><street>Rue de l’Ale 8</street><city>Lausanne</city><zip>1007</zip><country>CH</country></address></contact>

26 of 62

Processing XML and JSON

  • Document structure = tree
  • Processing via tree traversal (depth- or breadth-first search)
  • Or use proper query language, such as XQuery or jq

26

27 of 62

Q1: “How should I store my data on a computer?”

Q2: “How do I think about the world?”

  • “The world is simple: one type of entity, all with the same attributes”�→ Flat model
  • “The world contains many types of entities, connected by relationships”�→ Relational model
  • “The world is a hierarchy of entities”�→ Document model
  • “The world is a complex network of entities”�→ Network model

28 of 62

Network model

“The world is a complex network of entities”

29 of 62

Commercial break

29

ADA: more than just rocket science!

30 of 62

“How should I store my data on a computer?”

—A word (or two) on binary formats

  • “Parsing” = converting strings (as stored in text files) to data types used by computer programs (e.g., int, float, boolean, array, list)
  • Possibly expensive, but can be avoided by using binary formats: store data to disk “as is”, without first converting to strings
  • Modern binary formats support nested structures, various levels of schema enforcement, compression, etc.
  • Python pickle, Java Serializable, Protocol Buffers (Google), Avro (supports schema evolution), Parquet (column-oriented), etc.

→ Consider converting to a binary format at the beginning of your processing pipeline (especially when using “big data”)

30

31 of 62

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

32 of 62

Data sources at Web companies

Examples from Facebook / Meta

    • Application databases
    • Web server logs
    • Client-side event logs
    • API server logs
    • Ad server logs
    • Search server logs
    • Advertisement landing page content
    • Wikipedia
    • Images and video

32

Structured data (with clear schema)

Semi-structured data

(“self-describing” structure; CSV etc.)

Unstructured data

33 of 62

Another example: Wikipedia

  • 300+ languages
  • 42 million entities
  • Mind-boggling richness of data

33

34 of 62

34

35 of 62

Wikipedia

How to work with Wikipedia?

  • REST (cf. later) API
  • XML dumps with wiki markup, SQL database dumps
  • Issues: Unicode, size, recency, etc.
  • To make your life easier:
  • Find projects on GitHub to help you
  • Use more structured versions (p.t.o.)

35

36 of 62

Wikidata

  • “Database version” of Wikipedia
  • {fr:Suisse, de:Schweiz, it:Svizzera,�en:Switzerland, …} → Q39

36

37 of 62

37

38 of 62

Wikidata

  • “Database version” of Wikipedia
  • {fr:Suisse, de:Schweiz, it:Svizzera,�en:Switzerland, …} → Q39
  • Both API access and full database dumps
  • Available as
    • JSON (document model)
    • RDF (network model)

38

39 of 62

39

Wikipedia pageview logs

40 of 62

Crawling and processing webpages: HTML

Plenty of bulk-downloadable HTML data:

  • Common Crawl dataset, about 1.82 billion web pages -- huge!
  • (… but less than 0.1% of Google’s Web crawl, as of 2015)
  • 145 TB, hosted on Amazon S3, also available for download

… but if you need a specific website: use a crawler/spider: Apache Nutch, Storm, Heritrix 3, Scrapy, etc. (or simply wget…)

40

41 of 62

Useful HTML tools

Requests http://docs.python-requests.org/en/master/

An elegant and simple HTTP library for Python

Scrapy https://scrapy.org/

An open-source framework to build Web crawlers

Beautiful Soup http://www.crummy.com/software/BeautifulSoup/A Python API for handling real HTML

Plain ol’ /regular/expres*ion/s…

41

42 of 62

Schema.org: microformats for Web pages

  • Nuggets of structured information embedded�in (semantically) unstructured HTML

42

Avatar

Director: James Cameron (born August 16, 1954)

Text as rendered by browser:

HTML under the hood:

43 of 62

Web services

  • Most large web sites today actively discourage “screen-scraping” to get their content
  • Instead: Web service APIs, for interoperable machine-to-machine interaction over a network
  • The preferred way to get data from online sources
  • Most common framework: REST
    • You request a URL from the server via HTTP
    • The server responds with a text file (e.g., JSON, XML, plain text)

43

44 of 62

REST example

← This resource is a description of a user named Jane

  • Requested by sending GET request for the resource’s URL, e.g., via curl:�curl http://www.example.org/users/jane/
  • If users need to modify the resource, they GET it, modify it, and PUT it back
  • The href to the location resource allows savvy clients to get more information with another simple GET request
  • Implication: Clients cannot be too “thin”; need to understand resource formats!

44

{

"user": {

"name": "Jane",

"gender": "female",

"location": {

"href": "http://www.example.org/us/ny/new_york",

"text": "New York"

}

}

}

45 of 62

Cooking with data

Part 2:

Data sources

Part 1:

Data models

Part 3:

Data wrangling

46 of 62

Working with raw data sucks

Data comes in all shapes and sizes�– CSV files, PDFs, SQL dumps, .jpg, …

Different files have different formatting�– Empty string or space instead of NULL, extra header rows, character encoding, …

“Dirty” data�– Unwanted anomalies, duplicates

46

47 of 62

Raw data without thinking:

A recipe for disaster!

47

48 of 62

What is data wrangling?

  • Goal: extract and standardize the raw data
  • Combine multiple data sources
  • Clean data anomalies
  • Strategy: Combine automation with interactive visualizations to aid in cleaning

48

49 of 62

49

Wrangling takes between 50% and 80% of your time…

[Source]

50 of 62

Types of data problems

  • Missing data
  • Incorrect data
  • Inconsistent representations of the same data
  • About 75% of data problems require human intervention (e.g., experts, crowdsourcing, etc.)
  • Tradeoff between cleaning data vs. over-sanitizing data

50

51 of 62

“Dirty data” horror stories

“Dear Idiot” letter

17,000 men are pregnant

As the crow flies

[Source]

51

52 of 62

Diagnosing data problems

  • Visualizations and basic stats can convey issues in “raw” data
  • Different representations highlight different types of issues:
    • Outliers often stand out in the right kind of plot
    • Missing data will cause gaps or zero values in the right kind of plot
  • Becomes increasingly difficult as data gets larger
    • Sampling to the rescue!

52

53 of 62

Facebook graph

53

54 of 62

Matrix view (1)

Automatic permutation of rows and columns to highlight patterns of connectivity

54

55 of 62

Matrix view (2)

Rows and columns sorted in the order in which data was retrieved via the Facebook API

Can you guess what’s going on here?

55

56 of 62

Viz at scale? Careful!

56

57 of 62

Before you start analyzing your data

  • “Do I have missing data?” “If data were missing, how could I know?”
  • “Do I have corrupted data?” (May arise from measurement errors, processing bugs, etc.)
  • Parse/transform data into appropriate format for your specific analysis (see “Part 1: Data models”)
  • Don’t be surprised if you need to come back to this stage!

57

58 of 62

Desiderata

It’s always ideal if you can put your hands on the code/documentation about the dataset you are analyzing (provenance)

It’s always ideal if the provided data format is nicely parseable (otherwise you need regexes, or maybe even pay humans)

58

59 of 62

Highly non-parseable data

Entire NY Times archive (since 1851) digitized as of 2015

59

60 of 62

Feedback

60

Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec2-feedback

Feedback form available for each lecture and lab session

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?
  • Is it nicer to follow the lecture online or offline?

61 of 62

Dealing with missing data

  • Set values to zero?
  • Interpolate based on existing data?
  • Omit missing data?

Knowledge about domain and data collection should drive your choice!

61

U.S. census counts of people working as ‘‘farm laborers’’; values from 1890 are missing due to records being burned in a fire

62 of 62

Inconsistent data: “My name is Willy”

62

First name

Last name

Willy

NULL

...

...