Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
Published in Journal of Medical Internet Research (JMIR)
Presented at Population Association of America 2021 Annual Meeting
May 6, 2021
Shahan Ali Memon
New York University || Carnegie Mellon University
shahan@{nyu.edu, cmu.edu}
Ingmar Weber
Saquib Razak
in collaboration with
1
Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
Lifestyle Diseases
Surveillance
Population Search Behavior
- Non-communicable diseases such as diabetes, cancer, etc.
- Causes include lack of physical activity, unhealthy eating, alcohol, substance use disorders and smoking tobacco
- Responsible for about 70% of all deaths globally every year
- Systematic collection, analysis,
and interpretation of health-related data to be used by those
responsible for preventing and controlling disease and injury
- Traditionally accomplished by surveys and reporting
- Target variables for this study: diabetes, obesity, exercise
- Spatio-temporal prevalence of a given disease or activity
2
Experimental Setup
X
Y
ŷ = b0 + b1x1 + b2x2 +...+ bnxn
x1
x2
xn
3
Data Collection
X
Y
x1
x2
xn
collected
from
Temporal Data
Spatial Data
4
Literature Gaps
- Hundreds of studies published�
- Surveyed 37 studies using Google Trends for Lifestyle Disease Surveillance
Gaps
Methodology
Evaluation
1. Ad-hoc keyword selection
2. Overfitted temporal analysis
3. Spatial analysis without appropriate denormalization
4. Insufficient predictive evaluation (in-sample)
5. Failure to compare to trivial baselines
6. Lack of evidence for generalization
Detailed survey of the literature can be found in the corresponding paper on “Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study” in the Journal of Medical Internet Research (JMIR): tinyurl.com/trends-jmir
5
Contributions
Methodology
Evaluation
1. Ad-hoc keyword selection
2. Overfitted temporal analysis
3. Spatial analysis without appropriate denormalization
4. Insufficient predictive evaluation (in-sample)
5. Failure to compare to trivial baselines
6. Lack of evidence for generalization
Bootstrapping keyword selection
Spatio-temporal analysis
Denormalization framework for Google Trends
Out-of-sample evaluation
Establish and beat trivial baselines
Cross-country generalizability of the model (US to Canada)
6
Keyword Selection
seed terms: diabetes, diabetic, obesity, obese, exercise
pruning: branded terms, nonsensical search terms
Google Trends Related Queries
Google Correlate
Semantic-Link
7
Curse of Normalization
Temporal Normalization
Spatial Normalization
- Unlike fast-moving phenomenon such as influenza, stock markets, etc., lifestyle diseases are slow-moving and yield sparse temporal resolution (i.e. mostly measured across months, seasons or years).
- Fitting a global temporal-only model using Google Trends is infeasible due to the sparse set of data points (at max equal to number of years from 2004 to present).
- Because spatial data is normalized separately for each year, fitting a spatial-only across time will not account for changes in overall search volume.
Not comparable
8
Denormalization
More about our Google Trends Denormalization explained with examples: tinyurl.com/trends-denormalized; other robust methods of denormalization include “Calibration of Google Trends Time Series” (g-tab) in CIKM 2020 by Robert West: https://arxiv.org/abs/2007.13861
{
{
{
{
New Spatio-Temporal Index
Spatial Data value from Google Trends
Ratio of Temporal Increase from year r to year y
Ratio of the sum of spatial data in year r to that of year y
{
Ratio of population size of state n for the reference year r to the year y
{
Ratio of internet penetration of state n for the reference year r to the year y
9
Modeling: Using Linear Regression w/ L1-norm
Trivial Baseline
Spatial Model
SpatioTemporal Model
Multivariate SpatioTemporal Model
Lagged Multivariate SpatioTemporal Model
Hierarchical Lagged Multivariate SpatioTemporal Model
ŷt,s = yt-1,s
where s represents state, and t represents the year
ŷt,s = b0 + b1x1t,s + b2x2t,s + … + bnx3t,s
where xmt,s represents the spatial index/intensity of the feature or keyword m for the year t and state s and bm represents the coefficient for this feature or keyword
where x’mt,s represents the denormalized SpatioTemporal index for keyword m
ŷt,s = b0 + b1x’1t,s + b2x’2t,s + … + bnx’3t,s
ŷt,s = b0 + b1x’1t,s + b2x’2t,s + … + bnx’3t,s + bn+1yt-1,s
ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s
ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s
+ bn+2I(state 1) + bn+3I(state 2) + … + bn+51I(state 50)
where I(state 1), I(state 2), ... , I(state 50) represent the terms capturing the state-level hierarchies in the data
10
Evaluation: Mean Absolute Error
Trivial Baseline
Spatial Model
SpatioTemporal Model
Multivariate SpatioTemporal Model
Lagged Multivariate SpatioTemporal Model
Hierarchical Lagged Multivariate SpatioTemporal Model
Diabetes
Obesity
Exercise
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
ŷt,s = yt-1,s
ŷt,s = b0 + b1x1t,s + b2x2t,s + … + bnx3t,s
ŷt,s = b0 + b1x’1t,s + b2x’2t,s + … + bnx’3t,s
ŷt,s = b0 + b1x’1t,s + b2x’2t,s + … + bnx’3t,s + bn+1yt-1,s
ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s
ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s
+ bn+2I(state 1) + bn+3I(state 2) + … + bn+51I(state 50)
✓ beats the previous method
✓ beats the trivial baseline
best overall
11
Generalizability: Applying US model to Canada
does not beat the trivial baseline, but
> 0.60 Pearson correlation for Diabetes
> 0.91 Pearson correlation for Obesity
ŷt,s = b0 + b1x’1t-1,s + b2x’2t-1,s + … + bnx’3t-1,s + bn+1yt-1,s
12
In this work
Details about the paper and the (de)normalization in “Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study” in the Journal of Medical Internet Research (JMIR): tinyurl.com/trends-jmir
13
Shahan Ali Memon
@shahanalimemon
shahan@nyu.edu
Ingmar Weber
@ingmarweber
iweber@hbku.edu.qa
Saquib Razak
srazak@cmu.edu
Thanks!
Questions?
Comments..
Paper: tinyurl.com/trends-jmir
14