1 of 14

Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study

Published in Journal of Medical Internet Research (JMIR)

Presented at Population Association of America 2021 Annual Meeting

May 6, 2021

Shahan Ali Memon

New York University || Carnegie Mellon University

shahan@{nyu.edu, cmu.edu}

Ingmar Weber

Saquib Razak

in collaboration with

1

2 of 14

Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study

Lifestyle Diseases

Surveillance

Population Search Behavior

- Non-communicable diseases such as diabetes, cancer, etc.

- Causes include lack of physical activity, unhealthy eating, alcohol, substance use disorders and smoking tobacco

- Responsible for about 70% of all deaths globally every year

- Systematic collection, analysis,

and interpretation of health-related data to be used by those

responsible for preventing and controlling disease and injury

- Traditionally accomplished by surveys and reporting

- Target variables for this study: diabetes, obesity, exercise

- Spatio-temporal prevalence of a given disease or activity

2

3 of 14

Experimental Setup

X

Y

ŷ = b0 + b1x1 + b2x2 +...+ bnxn

x1

x2

xn

3

4 of 14

Data Collection

X

Y

x1

x2

xn

collected

from

Temporal Data

Spatial Data

4

5 of 14

Literature Gaps

- Hundreds of studies published�

- Surveyed 37 studies using Google Trends for Lifestyle Disease Surveillance

Gaps

Methodology

Evaluation

1. Ad-hoc keyword selection

2. Overfitted temporal analysis

3. Spatial analysis without appropriate denormalization

4. Insufficient predictive evaluation (in-sample)

5. Failure to compare to trivial baselines

6. Lack of evidence for generalization

Detailed survey of the literature can be found in the corresponding paper on “Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study” in the Journal of Medical Internet Research (JMIR): tinyurl.com/trends-jmir

5

6 of 14

Contributions

Methodology

Evaluation

1. Ad-hoc keyword selection

2. Overfitted temporal analysis

3. Spatial analysis without appropriate denormalization

4. Insufficient predictive evaluation (in-sample)

5. Failure to compare to trivial baselines

6. Lack of evidence for generalization

Bootstrapping keyword selection

Spatio-temporal analysis

Denormalization framework for Google Trends

Out-of-sample evaluation

Establish and beat trivial baselines

Cross-country generalizability of the model (US to Canada)

6

7 of 14

Keyword Selection

seed terms: diabetes, diabetic, obesity, obese, exercise

pruning: branded terms, nonsensical search terms

Google Trends Related Queries

Google Correlate

Semantic-Link

7

8 of 14

Curse of Normalization

Temporal Normalization

Spatial Normalization

- Unlike fast-moving phenomenon such as influenza, stock markets, etc., lifestyle diseases are slow-moving and yield sparse temporal resolution (i.e. mostly measured across months, seasons or years).

- Fitting a global temporal-only model using Google Trends is infeasible due to the sparse set of data points (at max equal to number of years from 2004 to present).

- Because spatial data is normalized separately for each year, fitting a spatial-only across time will not account for changes in overall search volume.

Not comparable

8

9 of 14

Denormalization

More about our Google Trends Denormalization explained with examples: tinyurl.com/trends-denormalized; other robust methods of denormalization include “Calibration of Google Trends Time Series” (g-tab) in CIKM 2020 by Robert West: https://arxiv.org/abs/2007.13861

{

{

{

{

New Spatio-Temporal Index

Spatial Data value from Google Trends

Ratio of Temporal Increase from year r to year y

Ratio of the sum of spatial data in year r to that of year y

{

Ratio of population size of state n for the reference year r to the year y

{

Ratio of internet penetration of state n for the reference year r to the year y

9

10 of 14

Modeling: Using Linear Regression w/ L1-norm

Trivial Baseline

Spatial Model

SpatioTemporal Model

Multivariate SpatioTemporal Model

Lagged Multivariate SpatioTemporal Model

Hierarchical Lagged Multivariate SpatioTemporal Model

ŷt,s = yt-1,s

where s represents state, and t represents the year

ŷt,s = b0 + b1x1t,s + b2x2t,s + … + bnx3t,s

where xmt,s represents the spatial index/intensity of the feature or keyword m for the year t and state s and bm represents the coefficient for this feature or keyword

where x’mt,s represents the denormalized SpatioTemporal index for keyword m

ŷt,s = b0 + b1x’1t,s + b2x’2t,s + … + bnx’3t,s

ŷt,s = b0 + b1x’1t,s + b2x’2t,s + … + bnx’3t,s + bn+1yt-1,s

ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s

ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s

+ bn+2I(state 1) + bn+3I(state 2) + … + bn+51I(state 50)

where I(state 1), I(state 2), ... , I(state 50) represent the terms capturing the state-level hierarchies in the data

10

11 of 14

Evaluation: Mean Absolute Error

Trivial Baseline

Spatial Model

SpatioTemporal Model

Multivariate SpatioTemporal Model

Lagged Multivariate SpatioTemporal Model

Hierarchical Lagged Multivariate SpatioTemporal Model

Diabetes

Obesity

Exercise

ŷt,s = yt-1,s

ŷt,s = b0 + b1x1t,s + b2x2t,s + … + bnx3t,s

ŷt,s = b0 + b1x’1t,s + b2x’2t,s + … + bnx’3t,s

ŷt,s = b0 + b1x’1t,s + b2x’2t,s + … + bnx’3t,s + bn+1yt-1,s

ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s

ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s

+ bn+2I(state 1) + bn+3I(state 2) + … + bn+51I(state 50)

✓ beats the previous method

✓ beats the trivial baseline

best overall

11

12 of 14

Generalizability: Applying US model to Canada

does not beat the trivial baseline, but

> 0.60 Pearson correlation for Diabetes

> 0.91 Pearson correlation for Obesity

ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s

12

13 of 14

In this work

  • we test the feasibility of using Google Trends for web-based lifestyle disease surveillance for diabetes, obesity, and exercise.�
  • we use a corrective approach to overcome the methodological and evaluation-related shortcomings in the literature.�
  • we propose a novel spatio-temporal denormalization scheme to effectively undo Google’s normalization to correctly juxtapose the data across different years.�
  • we use time lagged-models with creative debiasing strategies to beat the trivial baselines.�
  • we empirically show cross-country generalizability of models especially relevant for countries with no surveillance data.�
  • we conclude low-to-moderate validity of google trends for the surveillance of (slow-moving) lifestyle diseases.

Details about the paper and the (de)normalization in “Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study” in the Journal of Medical Internet Research (JMIR): tinyurl.com/trends-jmir

13

14 of 14

Shahan Ali Memon

@shahanalimemon

shahan@nyu.edu

Ingmar Weber

@ingmarweber

iweber@hbku.edu.qa

Saquib Razak

srazak@cmu.edu

Thanks!

Questions?

Comments..

14