1 of 34

1

Applied Data Analysis (CS401)

Maria Brbic

Lecture 14

ADA in action

17 Dec 2025

2 of 34

Announcements

  • Today: last lecture [click here for a famous last lecture]
  • No lab session this Friday
  • Final project milestone P3 due on Sun 21 Dec 2025
  • Final exam: Tue 13 Jan 2026, 15:15–18:15
    • Dry run, QA example released this week
    • Announcement re: exam protocol, room assignment, etc., to be made in early January

2

3 of 34

Course evaluation available on Moodle!

3

Course evaluation

4 of 34

Today we will

show how everything you’ve learned in lectures 1–13 comes together in one project

Paper available at https://doi.org/10.1073/pnas.2106152118

4

5 of 34

5

Philip Seymour Hoffman † 2014

Amy Winehouse † 2011

6 of 34

Questions

  • Who is remembered by society after death?
    • “Postmortem collective memory”
  • Are there prototypical patterns of postmortem collective memory?
  • Are certain kinds of people remembered in certain ways?
  • Are dead people remembered differently in news vs. social media?

6

#6 Observational studies

7 of 34

Why should we care about postmortem collective memory?

Fact: Humans care a lot about being remembered after death

7

Ara Pacis, Rome

[Wikipedia]

Damnatio memoriae[Wikipedia]

8 of 34

An ADA approach

Let’s use lots of data and count stuff!

  • Detect names of dead people in big corpus of news and social media
  • Build time series of name counts
  • Analyze the shape of time series
  • Correlate shapes with biographic info about dead people from knowledge base

8

Stuffed Count von Count counting stuff

9 of 34

Part 1:

Getting the data

9

10 of 34

The raw data:

  • “Spinn3r provides APIs for social media, weblogs, news, video, and live web content to our customers in any language and in large volumes.” (Source: spinn3r.com)
  • Firehose stored to disk for over 5 years
  • Several billions of documents
  • 40 terabytes

10

#12 Scaling to

massive data

11 of 34

Detecting news and social media in Spinn3r

  • Found a list of all 151K online news articles about Osama bin Laden’s killing (2 May 2011) indexed by Google News
  • Assume that every relevant news outlet had reported on bin Laden’s death, and that Google News is reasonably complete
  • News defined as documents from the 6,608 Web domains appearing in the “bin Laden list”

11

#2 Handling data

12 of 34

Data volume: Number of docs per day

#3 Visualizing data

13 of 34

Detecting mentions of people names

  • Easy: unambiguous names
    • e.g., “Amy Winehouse”
  • Hard: ambiguous names
    • e.g., “Michael Jackson”

Wikipedia-based solution:

  • Given: a name X (e.g., “Michael Jackson”)
  • Consider all N links in English Wikipedia with X as anchor text (cf. examples on the right)
  • Build distribution over the N link targets
  • If there is a target T to which ≥90% of all links point: consider X “sufficiently unambiguous” and assume that all mentions of X in Spinn3r refer to T
  • Else: consider X “too ambiguous” and ignore it in the analysis

13

The city of Gary is known as the birthplace of singer Michael Jackson.

The city of Gary is known as the birthplace of singer Michael Jackson.

The city of Gary is known as the birthplace of singer Michael Jackson.

The city of Gary is known as the birthplace of singer Michael Jackson.

Beer and whisky expert Michael Jackson was born in Leeds.

#2 Handling data

#13 Handling network data

#10+11 Handling text data

14 of 34

Recruiting the army of the dead:

  • Freebase, a once-popular knowledge graph
    • Now defunct, but still available
  • Contains information about >3M people
    • e.g., Philip Seymour Hoffman = /m/02qgqt
    • More info about some, less about others
  • Start from 33K people with death date 2009–2014
    • Further filtering → 2,362 people
  • Extract biographic information from Freebase

14

#2 Handling data

#13 Handling network data

15 of 34

Tools used

  • Counting names in Spinn3r dump: Hadoop (Java)
  • Extracting info from Freebase: Python, Perl
  • Once data was small enough: R (script, repo)
    • Statistical analyses
    • Plotting
    • Read data as CSV once (minutes), then serialized to binary format and deserialized in later runs (seconds)

15

#12 Scaling to

massive data

#2 Handling data

16 of 34

Our protagonist: Mention time series

  • t: relative time t, counting days since death
    • t = 0: day of death
  • Si(t): fraction of documents in which person i was mentioned, out of all documents published on day t
    • For mention time series, consider logarithms:�log10 Si(t)

16

#1–13 ADA loves logs!

17 of 34

Mention time series: examples

17

News

Smoothed via “Friedman’s

super smoother

Twitter

#3 Visualizing data

18 of 34

Part 2:

The shape of postmortem memory

18

19 of 34

Average mention time series

19

News

#4 Describing data

20 of 34

Curve characteristics

20

Median over people:1.98�95% CI [1.90, 2.03]

Median over people:0.00055�95% CI [-0.00091, 0.0017]

Pre-mortem mean: arithmetic mean of days 360 through 30 before death

Short-term boost: maximum of days 0 through 29 after death, minus the premortem mean

Long-term boost: arithmetic mean of days 30 through 360 after death, minus the premortem mean

Halving time: number of days required to accumulate half of the total area between the postmortem curve (including the day of death) and the minimum postmortem value

#4 Describing data

21 of 34

21

Commercial break

22 of 34

Are there prototypical curve shapes?

22

  • Each curve one data point
  • Represented via its 4 curve characteristics
    • ≈ manual dimensionality reduction
    • Values standardized via z-scores
  • Cluster via k-means

Q: How to find the best number k of clusters?

#8 Applied ML

#9 Unsupervised learning

23 of 34

Average silhouette width!

News

Twitter

#9 Unsupervised learning

24 of 34

Cluster analysis

24

“Blip”

“Silence”

“Rise”

“Decline”

#4 Describing data

25 of 34

Part 3:

Biographic correlates of postmortem memory

25

26 of 34

A first stab

  • Measure correlation coefficient between outcome (e.g., short-term memory boost) and each biographic properties (e.g., gender)
  • Higher correlation ⇒ higher outcome for the respective group of people (e.g., women)

26

27 of 34

Problem: Biographic properties are correlated

E.g., leaders (politicians, CEOs, etc.) are

  • more likely to have died old,
  • more likely to have died of a natural death,
  • more likely to be men,

compared to artists

Regression analysis allows us to compare averages across subgroups of the data while accounting for correlations among averaged values!

27

#5 Regression analysis

28 of 34

Linear regression

yi = β0 � + β1 premortem_mention_freqi

+ β2 age_at_deathi� + β3 manner_of_deathi� + β4 notability_typei

+ β5 languagei

+ β6 genderi

28

Outcome for person i:

  • short-term boost or
  • long-term boost

Rank-transformed, then linearly scaled/shifted to [–0.5, 0.5]; i.e., median has value 0

8 discrete levels (dummy-coded): 20–29, 30–39, …, 70–79, 80–89, 90–99

2 levels: natural, unnatural

6 levels: arts, sports, leadership, known for death, general fame, academia/engineering

3 levels: anglophone, non-anglophone, unknown

2 levels: male, female

Avg. outcome for “baseline persona”: male anglophone artist of median premortem popularity who died a natural death at age 70–79

29 of 34

Linear regression results

29

30 of 34

Age at death vs. postmortem memory

News plays two simultaneous roles (more so than Twitter):

  • Catering to public curiosity stirred by a young or unnatural death
  • But also when old person or accomplished leader dies

30

31 of 34

Part 4:

Discussion

31

32 of 34

Summary: The shape of postmortem memory

  • Sharp pulse of media attention with death:
    • Median: +9,400% in news, +28,000% on Twitter
  • Then sharp drop (around 1 month long) toward�premortem level
  • Cluster analysis revealed a set of four prototypical�memory patterns: “Blip”, “silence”, “rise”, “decline”
  • Same patterns in news and Twitter; same person�tends to fall into the same cluster across the two media

32

33 of 34

Summary: Biographic correlates

  • Notability types: all regression coefficients for long-term boosts negative�⇒ All types have lower average long-term boost than default type (artists)�⇒ Artists more present in collective memory
  • Low R2 (10–20%): human lives/legacies rich, hard to model
    • But all model fits highly significant (F-statistic, p-value)
    • Effects not only significant, but also large: e.g., short-term boost (on linear scale) for unnatural vs. natural death: 4x in news, 2x on Twitter
  • Largest boost:
    • Premortem popular anglophones who died a young, unnatural death
    • Long-term boosts largest for artists, smallest for leaders

33

34 of 34

34

Merry Christmas ADA happy New Year!