Disease detection at the speed of life: �Near real-time disease surveillance at population scale
Heidi A. Hanson, PhD�Senior Scientist, Advanced Computing �for Health Sciences |Computational Sciences �and Engineering Division
Oak Ridge, TN�March 05, 2024
ORNL is managed by UT-Battelle LLC for the US Department of Energy
1
1
Acknowledgements
2
2
Vision: Near real-time analytics at scale �and data driven clinical decision-making tools�
Conceptual Model: Heidi Hanson�Graphic Design: Brenna Kelley
3
3
Investors and �technical leads
NCI
National �Cancer �Institute
DOE
Department
of Energy
Cancer driving �computing �advances
Computing�driving cancer�advances
4
4
4
Joint Design of Advanced Computing for Cancer (JDACS4C): DOE-NCI Partnership
Enabling the most challenging deep learning problems in cancer research to run on the most capable supercomputers in the DOE
Pilot 3: Dr. Georgia Tourassi and Dr. Lynne Penberthy
5
5
Real world evidence: �Surveillance Epidemiology End-Results (SEER) Registries
Surveillance Epidemiology End-Results (SEER) �Registries > 850,000 Diagnoses Annually
https://seer.cancer.gov/registries/
Why is Population Level Cancer Surveillance Important?
On average, there is a two-year lag in National Cancer Reporting.
6
6
The MOSSAIC Challenge ��
Translational AI for better cancer surveillance
and ultimately better cancer care.
7
7
Why is there a two-year lag in cancer reporting?�
More than 90% of all cancers are histologically confirmed.
Site | Sub-site | Histology | Laterality | Behavior |
C50 | C501 | 8000 | 2 | 1 |
SEER*Data Management System
Standard Coding of Records
Abstracted information
https://training.seer.cancer.gov/casefinding/sources/pathology.html
Pathology Report
8
8
Co-designing innovative real-world solutions for automatic case classification
Standard data processing workflows across siloed systems for preparing unstructured text
Deep learning for extracting common data elements
https://github.com/DOE-NCI-MOSSAIC
Real World Solution
Available Deep Learning Tools
MT-CNN, MT-HiSAN, Clinical BigBird, PathBigBird, Deep Abstaining Classifier
Research and Development:
Multimodal/longitudinal DNN, Phrase Level Attention DNN, Tools for Bias Detection, Exposomic Data Linkage, Privacy Preserving FL, Synthetic Clinical Reports
9
9
BARDI for AI readiness of clinical data
Very firm nodularity 3 cm in circular diameter.
Two amorphous fibrofatty tissue masses
Many foci of calcification are present within
{‘input_ids’ : [34, 12, 11304, …]
‘attention_mask’ : [1,1,1,1,1,…]}
{‘input_ids’ : [14, 2, 204, …]
‘attention_mask’ : [1,1,1,0,0,…]}
{‘input_ids’ : [7, 92, 3, 67 …]
‘attention_mask’ : [1,1,1,1,1,…]}
Efficient Clinical Text Tokenization
Capability to train tokenizers from scratch on custom data (WordPiece, BPE and Unigram).
Streamlined utilization of publicly available,
pre-trained tokenizers.
Engineering Team: Patrycja Krawczuk and Dakota Murdock
https://github.com/DOE-NCI-MOSSAIC
10
10
FrESCO �
Malignancy
Yes/No
Phenotype Classification
Site, Subsite, Histology, Behavior, Laterality
Pediatric Cancers
ICCC Classification
Hierarchical Self-Attention �Networks
Case-Level Context
Development Team: Adam Spannaus, John Gounley, Zach Fox, Patrycja Krawczuk, Dakota Murdock, and Heidi Hanson. �HiSAN and Case-Level Context: Shang Gao
https://github.com/DOE-NCI-MOSSAIC
11
11
Real world evidence: AI for near real-time health surveillance covering 48% of the US population�
Surveillance Epidemiology End-Results (SEER) �Registries > 850,000 Diagnoses Annually
https://seer.cancer.gov/registries/
Auto-Extraction from Pathology Reports:
Accuracy: Auto-coding �of 23-27% of path reports �(N ~ 230,000) with > 98% accuracy across all �data elements
Abstention rates can be autotuned to fit the need of the research project.
Production implementation Hierarchical Self Attention Model (HiSAN) with Deep Abstention:
12
12
Lexington, KY
Pittsburgh, PA
Boise, ID
Providence, RI
Portland, OR
Washington DC
Fargo, ND
Fayetteville, AR
Reproducible Result: Near Real-Time Identification of Veteran’s with Cancer
Training Data 2006 - 2018
N=245,000
Testing Data 2019 - 2022
N=4200
J. Stringer, K. Rasmussen, V. Patel, C. Li, A. Halwani
Task | Accuracy | F1-Macro |
Site | 0.92 | 0.69 |
Histology | 0.81 | 0.34 |
Behavior | 0.96 | 0.59 |
13
13
Prevailing Challenges to Achieving Near Real-Time Disease Surveillance
14
14
Computational limitations prevent scaling algorithms to the population level and have hindered the development and deployment of population health research tools. �
Hospitals produce 50 petabytes of data in a year – 97% goes unused.
https://www.weforum.org/agenda/2019/12/four-ways-data-is-improving-healthcare/
Challenge 1
15
15
15
Charting a path to near real-time data analytics �at population scale and data driven clinical �decision-making tools: DOE leadership computing�
2021-2022 �ASCR Leadership Computing Challenge allocation
Title: “Next-Generation Scalable Deep Learning for Medical Natural Language Processing”. 130,000 node hours �on OLCF Summit
2022-2023 �ASCR Leadership Computing Challenge allocation
“Privacy-preserving Transformer models for clinical natural language processing”. 150,000 node hours on OLCF Summit and 30,000 node hours on OLCF Frontier
2023-2024 �ASCR Leadership Computing Challenge allocation
Title: “Privacy enabled tumor classification for near real time population health analytics”. �140,000 node hours �on OLCF Frontier
Sustained computing support from DOE �over MOSSAIC �project lifetime
Total of 560,000 Summit node hours through the ALCC program
Approximately 325,000 additional Summit node hours provided via the Exascale Computing Project, OLCF Director’s Discretionary, and the OLCF Summit Early �Science programs
CITADEL
16
16
Data Complexity: Heterogeneity in spades
Challenge 2
17
17
17
Harmonizing Unstructured Data with Common Data Models
NCI SEER Data
Other Common Data Models
Common Data Model Harmonization (CDMH) and Open Standards for Evidence Generation
https://aspe.hhs.gov/sites/default/files/private/pdf/259016/CDMH-Final-Report-14August2020.pdf
18
18
Automatic Classification for Common Data Models
CITADEL
Mayanka Chandrashekar et al.,Path-BigBird: An AI-Driven Transformer Approach to Classification of Cancer Pathology Reports. JCOClinCancerInform 8, e2300148(2024).DOI:10.1200/CCI.23.00148
�
Path-BigBird
Architecture: BERT based BigBird �Data: 2,772,103 pathology reports from six SEER registries
Document Classification: North American Association of Central Cancer Registries
19
19
Automatic Classification for Common Data Models�Utilizing longitudinal multimodal patient information for near real-time phenotype classification
Multimodal Ensembles for Identifying Recurrent Disease
61,550 pathology reports
Zach Fox, Patrycja Krawczuk, Dakota Murdock, John Gounley, Heidi Hanson
20499
10079
12504
18469
Grouped Classes
# Reports
unknown | recurrent | existing disease | new diagnosis |
20
20
Automatic Classification for Common Data Models�Utilizing longitudinal multimodal patient information for near real-time phenotype classification
Multimodal Ensemble: Trained using oncology specific data from SEER registries
Off the Shelf: Clinical Longformer fine-tuned on the Recurrence Classification Task
Overall Accuracy: 73.40%
Overall Accuracy: 53.40%
| unknown | recurrent | existing disease | new diagnosis | support |
unknown | 75.89% | 6.58% | 5.42% | 12.11% | 1825 |
recurrent | 7.29% | 66.70% | 10.89% | 15.12% | 946 |
existing disease | 8.94% | 7.92% | 64.47% | 18.67% | 1275 |
New diagnosis | 8.77% | 4.81% | 6.08% | 80.34% | 1892 |
| unknown | recurrent | existing disease | new diagnosis | support |
unknown | 62.08% | 7.95% | 9.26% | 20.71% | 1825 |
recurrent | 15.01% | 40.06% | 13.53% | 31.40% | 946 |
existing disease | 13.88% | 14.59% | 39.22% | 32.31% | 1275 |
new diagnosis | 17.97% | 8.30% | 12.47% | 61.26% | 1892 |
Identifying patients with recurrent disease
Zach Fox, Patrycja Krawczuk, Dakota Murdock, John Gounley, Heidi Hanson
21
21
Regulatory hurdles prevent pooling of data across health care institutions in the United States
Challenge 3
22
22
22
Why do we need to pool data?
23
23
Near real-time health data analytics with privacy aware federated learning �
The National AI Research Resource (NAAIR) Task Force�National Childhood Cancer Registry�DOE ASCR Biopreparedness (BRaVE) Funding
Our current implementation
Citadel
Utah
Seattle
New Jersey
New Mexico
Louisiana
Kentucky
Trusted host
Differential privacy
High privacy
High accuracy
Privacy/accuracy tradeoff
Innovation: SEER data make us completely innovative in this space. We are able to design and test solutions for real-world application at population scale.
24
24
Integration of diverse types of social and environmental determinants of health data across space and time requires
Number of Deaths and Percentage of Disability-Adjusted Life-Years Related to the 17 Leading Risk Factors in the United States, 2016Negative values (where bars extend left of zero) indicate a protective effect.�
The State of US Health, 1990-2016: Burden of Diseases, Injuries, and Risk Factors Among US States. JAMA. 2018;319(14):1444-1472. doi:10.1001/jama.2018.0158
Challenge 4
25
25
25
Air pollution exposure affects spontaneous pregnancy loss
Leiser CL, Hanson HA, et al. Acute effects of air pollutants on spontaneous pregnancy loss: a case-crossover study. Fertility and Sterility 111, 341-347 (2019)
1,398 women with a spontaneous pregnancy 2007 – 2015
https://www.theguardian.com/environment/2019/jan/11/air-pollution-as-bad-as-smoking-in-increasing-risk-of-miscarriage
26
26
Exposure to industrial air pollution is associated with decreased male fertility
https://www.epa.gov/rsei/rsei-geographic-microdata-rsei-gm
estimated average daily concentration of PCBs (μg/m3) along the Wasatch Front in 2005
Azoospermia
Ramsay JM, Fendereski K, Horns JJ, VanDerslice JA, Hanson HA, Emery BR, Halpern JA, Aston KI, Ferlic E, Hotaling JM. Environmental exposure to industrial air pollution is associated with decreased male fertility. Fertil Steril. 2023 May 15:S0015-0282(23)00514-9. doi: 10.1016/j.fertnstert.2023.05.143. Epub ahead of print. PMID: 37196750.
27
27
SEER Residential History Data
LexusNexus Residential History data
Zaria Tatalovich, Lynne Penberthy, Kevin Henry, David Stinchcomb
estimated average daily concentration of PCBs (μg/m3) along the Wasatch Front in 2005
RSEI Microdata
Air Pollution Data
Indoor Radon
28
28
Near real-time analytics at scale �and data driven clinical decision-making tools ��Medium Range Goals
ORNL is managed by UT-Battelle LLC for the US Department of Energy
29
29
Investors and technical leads
EHRLICH�Electronic Health Record �informed LagrangIan method �for preCision public Health��HPC for Biopreparedness
The U.S. Department of Energy, Office of Science, Office of Science Advanced Scientific Computing Research (ASCR) as part of the Biopreparedness Research Virtual Environment (BRaVE) initiative.
Program Manager: Dr. Margaret Lentz
30
30
30
Advancing epidemiological modeling capabilities, near real-time disease surveillance and scenario forecasting
Best case scenario
Worst case scenario
ABM for modeling �hospital capacity
Implement data-driven epidemiological modeling for near-real-time disease �projection and rapid response to large-scale public health threats
31
31
ENABLE: EHRLICH National-scale Agent-Based Learning Environment
Modular and Interoperable Software for Biopreparedness Decision Support
ChatGPT
Everything should be made as simple as possible, but not simpler ~ Einstein
FrESCO
�Near Real-Time Information Retrieval
C-HER��Contextualizing Environments
ARC��Extracting Information from Scientific Publication
UrbanPop��Contact Networks for Contagion Models
UQ for Data��UQ for ABMs
Agent Based Modeling for Rapid Response
32
32
Interoperable Systems for AI at Scale: �Bringing Compute Closer to the Source Rapid Data Integration while Protecting Private Information�
Standard data processing workflows across siloed systems for preparing unstructured text
Federated Learning for extracting common data elements
Citadel
A
B
C
D
Silo A
Silo B
Silo C
Silo D
Problem: Health Data Silos
Past Solution: Centralization of data or Federated Data Structures
- Slow and costly
- Designed for a single purpose
- Rule based ETL for harmonization
Our Solution:
FrESCO
�Near Real-Time Information Retrieval
33
33
Improving the accuracy of information retrieval models
Attention Mechanisms in Clinical Text Classification
Structure of the general document classification attention model. The text-encoder segment represents a CNN-, RNN-, or transformer-based architecture. The label-attention layer utilizes implicit (i.e., random embeddings) or explicit (e.g., embedded textual code descriptions) auxiliary information. The multi-label classification models use a sigmoid activation function after the output layer and the binary-cross-entropy objective function.
FrESCO
�Near Real-Time Information Retrieval
Christoph Metzner, Shang Gao, Drahomira Hermoannova, Elia Lima-Walton, Heidi Hanson
Accepted IEEE Journal of Biomedical and Health Informatics – Selected for highlight and journal cover
34
34
Centralized Health and Environmental Repository� (C-HER)
C-HER��Contextualizing Environments
35
35
Measures are “spatially joined” to a polygon (hex)
While not perfect… Hexes can scale in and out across different geographies and measures can be aggregated accordingly.
RSEI Microdata converted to Uber hexes
hexagon size: ~36km2 – 1,770km2
Joemy Ramsay, PhD
36
36
Centralized Health and Environmental Repository� (C-HER)
C-HER��Contextualizing Environments
Currently processing 73 social and environmental determinants of health datasets
37
37
2030��Disease detection at the speed of life: �Near real-time disease surveillance at population scale��
ORNL is managed by UT-Battelle LLC for the US Department of Energy
38
38
Integrated Health Security Surveillance Response Tools�
ChatGPT
Dual Purpose
Rapid Clinical Decision Making
Rapid Response for Population Health
39
39
Team science��Computational Scientists, Engineers, Biostatisticians, Biomedical Informaticists, Geneticists, Subject Matter Experts�
https://xkcd.com/793/
Challenge 5
40
40
40
MOSSAIC
National Cancer Institute
Lynne Penberthy (PI)
Elizabeth Hsu (Technical Lead)
Valentina Petkov
Serban Negoita
Ola Adeyemi
Sylkk Ansah
Sarah Bonds
IMS
Linda Coyle
Jennifer Stevens
Scott Depuy
Rusty Sheilds
Gary Beverungen
Los Alamos National Laboratory
Jamaludin Mohd Yusof
Sayera Dhaubhadel
Tanmoy Bhattacharya
Oak Ridge National Laboratory
Heidi Hanson (PI)
John Gounley (Technical Lead)
Shelaine Curd (Program Manager)
Georgia Tourassi
Joe Lake
Adam Spannaus
Dakota Murdock
Zachary Fox
Patrycja Krawczuck
Dakotah Maguire
Jordan Miller
Mayanka Chandra Shekar
Noah Schaefferkoetter
Sajal Dash
Isaac Lyngaas
Abhishek Shivanna
Robert Bridges
Christopher Stanley
Vandy Tombs
Christoph Metzner
41
41
EHRLICH
Oak Ridge National Laboratory
Heidi Hanson (PI)
John Gounley (Technical Lead)
Shelaine Curd (Program Manager)
Adam Spannaus
Zachary Fox
Patrycja Krawczuck
Dakotah Maguire
Mayanka Chandra Shekar
Robert Bridges
Christopher Stanley
Vandy Tombs
Sudip Seal
James Nutaro
Sifat Moon
Christoph Metzner
Los Alamos National Laboratory
Jamal Mohd-Yusof
Cristina Garcia Cardona
Argonne National Laboratory
Rick Stevens
Thomas Brettin
University of Utah
James VanDerslice
Joemy Ramsay
University of Arizona
Nirav Merchant
Ravi Tandon
University of Southern California
Rima Habre
Duke
Amanda Randles
University of Chicago/Morehouse School of Medicine
Lilly Immergluck
42
42
Discussion
43
43