1 of 42

Covid-19 Severe Outcome Risk Prediction

Changrong Ji

Dr. Mahesh Shukla

Dr. David Patton

Dr. Xue Yang

Dr. Xingguo Zhang

Antonio Linari

Premdutt Gaur

Vance Degen

A3.AI

Private Machine Learning on Medical Records & Social Data

1

2 of 42

Topics

  • About Us
  • Aims
  • Data
  • Approach
  • Early Findings
  • Future Work

2

3 of 42

3

Nonprofit Applied R&D

We are:

Data Scientists

Physicians

Engineers

Advocates

Researchers

Privacy Specialists

Game Developer

Venture & Social Capitalist

A3.AI

4 of 42

4

Projects

5 of 42

5

The data, technology, and services used in the generation of these research findings were generously supplied pro bono by the COVID-19 Research Database partners

https://covid19researchdatabase.org/

COVID-19 Project Team

6 of 42

Aims

  1. COVID-19 Severe Outcome Risk Prediction

  • Social Determinants of Health and Risk Factors of COVID-19

  • Privacy-preserving Machine Learning

  • Building Clinical Concept Embeddings under Computing Resource Constraints

6

7 of 42

Multiple research aims are addressed in detail in the following working papers respectively as of 09/2020. Future versions will be published as the research progresses:

  1. Early findings from machine learning baseline models to predict an individual’s risk of hospitalization if infected with COVID-19, based on medical claims and EHR, respectively: Predicting Risk of Hospitalization among COVID-19 Patients, Mahesh Shukla et al.

  • Clinical Concept embedding is a feature engineering technique to enhance the accuracy of AI models. To build embeddings from large claims data typically requires high computing power. We use a novel approach to efficiently build embeddings under resource constraints in the COVID-19 Research DB environment : Building Clinical Concept Embeddings under Computing Resource Constraints, Antonio Liniari et al.

7

8 of 42

Data

As of 08/21/2020, with new data added with 1 week delay

  • Claims
    • 98 million patients 7 years of medical claims history of over 3 billion claim lines
      • Key attributes: ICD diagnosis codes, CPT procedure codes
    • 200,000+ COVID patients
  • Electronic Health Record (Outpatient)
    • 36 million patient’s outpatient EHR records;
      • Diagnoses, Procedures; Encounters, Medications; Allergy, Social History, etc.
    • 16,000 confirmed COVID patients , and 75,000 possible COVID patients
  • Social - Claims - Death linked
    • 242 million people’s Social Data
      • People (demographics, finance, credit, housing, jobs, lifestyle)
      • Behaviors (interests, purchasing, social network activity, charitable giving, health lifestyle)
      • Predictors (motivator, travel, auto, in-market, and economic stats)
    • Death Registry of 80% of US population
      • Died_in_2020
    • 95,000 COVID patients

8

9 of 42

Attributions to Data Providers

  • AnalyticsIQ

AnalyticsIQ is s a leading predictive data and analytics innovator that leverages a blend of publicly available data and custom algorithms informed by cognitive psychology concepts to describe consumers across three areas - People, Behaviors, and Predictors. Headquartered in Atlanta and recently named one of Georgia’s Top 10 most innovative companies, AnalyticsIQ’s team of data analysts, scientists, and cognitive psychologists have over 100 years of collective analytical experience and expertise.

  • Health Jump

Electronic Health Record data including diagnosis, procedures, labs, vitals, medications and histories sourced from participating members of the Healthjump network.

  • De-identified claims data was contributed by a claims clearinghouse.

9

10 of 42

Machine Learning for Clinical Prognosis

10

11 of 42

Aim 1

Create machine learning models to predict a patient’s risk of severe clinical outcomes if infected with COVID-19.

  • Hospitalization
  • ICU
  • Intubation
  • Ventilation
  • ECMO (heart-lung bypass)
  • Death, etc

These personalized risk scores and associated risk factors analysis can

  • Help citizens make informed work and lifestyle choices
  • Augment clinical prognosis by physicians
  • Help health care organizations coordinate care and optimize resources
  • Help public health agencies with planning, responding and reopening.

11

12 of 42

ML Model Development

12

13 of 42

Feature Engineering

13

14 of 42

Embedding in NLP

14

Lower-dimensional space:

  • Words of similar meaning are located near each other in the embedded vector

  • Relative location of two words in the space could encode a meaningful relationship

15 of 42

Clinical Concepts Embedding

15

16 of 42

Clinical Concepts Embedding

  • Low-dimensional vector representations of medical concepts

16

17 of 42

17

15.7%

6.5%

0.4%

0.01%

3.1%

18 of 42

Top 20 Procedures for COVID Patients

18

19 of 42

Top 20 Co-occurring Diagnosis with COVID-19

19

20 of 42

Baseline Prediction with Claims Data

Goal: Prediction of hospitalization for COVID-19 patient

Data:

  • Hand-crafted features (~100)
  • Total patients with COVID-19 infection: 170,241
  • Hospitalized patients: 17% of above
  • Train/Test/Validation set split: 60/20/20

Model:

  • Random Forest Classifier
  • Balanced weights
  • 400 estimators with a max_depth of 20

algorithm, model performance, feature importance, future improvements

20

21 of 42

Precision & Recall Refresher

21

16%

Actually

Hospitalized

84%

Actually Not Hospitalized

Predicted Hospitalized

TP

5026

FP

5531

FN

775

TN

22727

48%

87%

Minimize

22 of 42

Hospitalization Results

Classification Report:

22

23 of 42

23

24 of 42

Social Determinants and Risk Factors

About 90 attributes of social data from Analytics IQ are available for over 34 million patients. Over 95,000 are COVID-19 patients.

We examined:

  • Impact of different demographics
  • Behaviors as risk factors
  • Predictors (likelihood) as risk factors

24

25 of 42

Occupation*

25

26 of 42

Specific Occupations

26

27 of 42

Ethnicity

27

28 of 42

Ethnicity

28

29 of 42

Ethnicity

29

30 of 42

Ethnicity

30

31 of 42

Assimilation into US Culture

31

Least likely Most likely

32 of 42

Future Work: SDOH

Many additional attributes available in the dataset:

BMI, Diet, Location, Profession, Access to Healthcare, Risky behavior, etc

  • Population Health study
  • Predictive Models
  • Personalized risk scoring

32

33 of 42

Future Work: Current Use Case

  • Improve model performance
    • Add clinical concept embeddings
    • Add temporal features (time sequence of clinical events)
    • Other machine learning and (lighter weight) deep learning models
  • Add more classification categories (other severe outcome types)
  • Incorporate Electronic Health Record data for prediction

33

34 of 42

Future Work: Population Health Dashboard

34

35 of 42

Future Work: New Use Cases As the Pandemic Progresses

  • Drug and COVID-19 vaccine effectiveness and safety
  • Identify clinical trial candidates
  • Long term consequences of COVID-19
    • Personal
    • Societal
    • Health
    • Economical

35

36 of 42

Private AI

Collaborative

Learning with

Obfuscation

Aggregation &

Knowledge Transfer

Changrong Ji

David Patton, PhD

Vance Degen

Ben Carroll

Greg Ewing, JD

A3.AI

37 of 42

Data Sharing & Healthcare AI Challenges

  • Advanced analytics relies on data, often curated from multiple sources.
  • Data sharing and usage have a complex web of trust with multiple parties: data owners, analytics solutions providers and users.
  • Aggregating data and building models centrally raise concerns in
    • privacy and security,
    • single point of failure,
    • intellectual property and
    • misaligned incentives on the usage and value of of the combined data and models
  • CLOAK is a platform and toolkit that is under development to address some key challenges in collaborative privacy preserving machine learning.

Specific relevance to the COVID-19 Research DB projects: The personal level medical records and social data, while de-identified, are still vulnerable to attacks such as data linkage and model inversion that leaks private information. The following highlights a set of techniques to mitigate the privacy risks. A series of papers will be published on this topic. Starting with: Privacy-Preserving Machine Learning Techniques 2020, Changrong Ji et al.

37

38 of 42

Copyright © 2020 Changrong Ji

CLOAK PLATFORM (work in progress)

39 of 42

Copyright © 2020 Changrong Ji

ANALYST

DATA

THREATS

TRUST

COMPUTE

40 of 42

Copyright © 2020 Changrong Ji

PRIVACY PRESERVING TECHNIQUES 2020

41 of 42

Copyright © 2020 Changrong Ji

ARCHITECT

DESIGN TRADE-OFFS

EXAMPLES

42 of 42

Copyright © 2020 Changrong Ji

BUILDER