1 of 43

Disease detection at the speed of life: �Near real-time disease surveillance at population scale

Heidi A. Hanson, PhD�Senior Scientist, Advanced Computing �for Health Sciences |Computational Sciences �and Engineering Division

Oak Ridge, TN�March 05, 2024

ORNL is managed by UT-Battelle LLC for the US Department of Energy

1

1

2 of 43

Acknowledgements

  • This material is based upon work supported by the following:
    • The U.S. Department of Energy, Office of Science, Office of Science Advanced Scientific Computing Research (ASCR) as part of Dr. Margaret Lentz’s Biopreparedness Research Virtual Environment (BRaVE)initiative.
    • UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy.
    • National Cancer Institute (NCI) – DOE Collaboration und the Cancer MoonshotSM initiative.

2

2

3 of 43

Vision: Near real-time analytics at scale �and data driven clinical decision-making tools�

Conceptual Model: Heidi Hanson�Graphic Design: Brenna Kelley

3

3

4 of 43

Investors and �technical leads

  • Principal Investigators
    • Dr. Lynne Penberthy, Associate Director �of the Surveillance Research Program, NCI
    • Dr. Heidi Hanson, Group Leader, �Biostatistics and Biomedical Informatics, ORNL
  • Technical Leads
    • Dr. Elizabeth Hsu, Chief, �Surveillance Informatics Branch, NCI
    • Dr. John Gounley, Group Leader, �Scalable Biomedical Modeling, ORNL

NCI

National �Cancer �Institute

DOE

Department

of Energy

Cancer driving �computing �advances

Computing�driving cancer�advances

4

4

4

5 of 43

Joint Design of Advanced Computing for Cancer (JDACS4C): DOE-NCI Partnership

Enabling the most challenging deep learning problems in cancer research to run on the most capable supercomputers in the DOE

Pilot 3: Dr. Georgia Tourassi and Dr. Lynne Penberthy

5

5

6 of 43

Real world evidence: �Surveillance Epidemiology End-Results (SEER) Registries

Surveillance Epidemiology End-Results (SEER) �Registries > 850,000 Diagnoses Annually

https://seer.cancer.gov/registries/

Why is Population Level Cancer Surveillance Important?

  • Health Security
  • Cancer is the second leading cause of death in the US
  • Supports research that improves cancer outcomes.

On average, there is a two-year lag in National Cancer Reporting.

  • Manual extraction of data elements from clinical text
  • Coding to North American Association of Central Cancer Registries (NAACCR standards)
  • Centralized data management

6

6

7 of 43

The MOSSAIC Challenge ��

Translational AI for better cancer surveillance

and ultimately better cancer care.

7

7

8 of 43

Why is there a two-year lag in cancer reporting?�

More than 90% of all cancers are histologically confirmed.

Site

Sub-site

Histology

Laterality

Behavior

C50

C501

8000

2

1

SEER*Data Management System

Standard Coding of Records

Abstracted information

https://training.seer.cancer.gov/casefinding/sources/pathology.html

Pathology Report

8

8

9 of 43

Co-designing innovative real-world solutions for automatic case classification

Standard data processing workflows across siloed systems for preparing unstructured text

Deep learning for extracting common data elements

https://github.com/DOE-NCI-MOSSAIC

Real World Solution

Available Deep Learning Tools

MT-CNN, MT-HiSAN, Clinical BigBird, PathBigBird, Deep Abstaining Classifier

Research and Development:

Multimodal/longitudinal DNN, Phrase Level Attention DNN, Tools for Bias Detection, Exposomic Data Linkage, Privacy Preserving FL, Synthetic Clinical Reports

9

9

10 of 43

BARDI for AI readiness of clinical data

  • BARDI: our AI-readiness package for clinical data
  • Increasingly, de-identification and generative AI workflows operate on token-level data
  • To better support these workflows, we added built-in tokenization support in BARDI
  • Uses standard libraries such as Hugging Face 🤗
  • Now ready for NAIRR research in de-identification and synthetic data generation

Very firm nodularity 3 cm in circular diameter.

Two amorphous fibrofatty tissue masses

Many foci of calcification are present within

{‘input_ids’ : [34, 12, 11304, …]

‘attention_mask’ : [1,1,1,1,1,…]}

{‘input_ids’ : [14, 2, 204, …]

‘attention_mask’ : [1,1,1,0,0,…]}

{‘input_ids’ : [7, 92, 3, 67 …]

‘attention_mask’ : [1,1,1,1,1,…]}

Efficient Clinical Text Tokenization

Capability to train tokenizers from scratch on custom data (WordPiece, BPE and Unigram).

Streamlined utilization of publicly available,

pre-trained tokenizers.

Engineering Team: Patrycja Krawczuk and Dakota Murdock

https://github.com/DOE-NCI-MOSSAIC

10

10

11 of 43

FrESCO �

Malignancy

Yes/No

Phenotype Classification

Site, Subsite, Histology, Behavior, Laterality

Pediatric Cancers

ICCC Classification

Hierarchical Self-Attention �Networks

Case-Level Context

Development Team: Adam Spannaus, John Gounley, Zach Fox, Patrycja Krawczuk, Dakota Murdock, and Heidi Hanson. �HiSAN and Case-Level Context: Shang Gao

https://github.com/DOE-NCI-MOSSAIC

11

11

12 of 43

Real world evidence: AI for near real-time health surveillance covering 48% of the US population�

Surveillance Epidemiology End-Results (SEER) �Registries > 850,000 Diagnoses Annually

https://seer.cancer.gov/registries/

Auto-Extraction from Pathology Reports:

Accuracy: Auto-coding �of 23-27% of path reports �(N ~ 230,000) with > 98% accuracy across all �data elements

  • Phenotype classifications:
    • Site = 70 categories
    • Sub-site = 324 categories
    • Histology = 626 categories
    • Laterality = 7 categories
    • Behavior = 4 categories

Abstention rates can be autotuned to fit the need of the research project.

Production implementation Hierarchical Self Attention Model (HiSAN) with Deep Abstention:

  • Total 16 registries (~31% �of US population)
  • Default as part �of any new Data Management System installation, regardless �of SEER affiliation
  • 5+ new registries anticipated in 2024
  • Testing phase �with the Veteran’s �Health Administration

12

12

13 of 43

Lexington, KY

Pittsburgh, PA

Boise, ID

Providence, RI

Portland, OR

Washington DC

Fargo, ND

Fayetteville, AR

Reproducible Result: Near Real-Time Identification of Veteran’s with Cancer

Training Data 2006 - 2018

N=245,000

Testing Data 2019 - 2022

N=4200

J. Stringer, K. Rasmussen, V. Patel, C. Li, A. Halwani

Task

Accuracy

F1-Macro

Site

0.92

0.69

Histology

0.81

0.34

Behavior

0.96

0.59

13

13

14 of 43

Prevailing Challenges to Achieving Near Real-Time Disease Surveillance

  • Computational limitations prevent scaling algorithms to the population level and have hindered the development and deployment of population health research tools. 
  • Data complexity and regulatory hurdles prevent pooling of data across health care institutions in the US
  • Integration of diverse types of social and environmental determinants of health data across space and time requires advanced analytical methods and computational workflows
  • Team science: Computational Scientists, Engineers, Biostatisticians, Biomedical Informaticists, Subject Matter Experts

14

14

15 of 43

Computational limitations prevent scaling algorithms to the population level and have hindered the development and deployment of population health research tools. �

Hospitals produce 50 petabytes of data in a year – 97% goes unused.

https://www.weforum.org/agenda/2019/12/four-ways-data-is-improving-healthcare/

Challenge 1

15

15

15

16 of 43

Charting a path to near real-time data analytics �at population scale and data driven clinical �decision-making tools: DOE leadership computing�

2021-2022 �ASCR Leadership Computing Challenge allocation

Title: “Next-Generation Scalable Deep Learning for Medical Natural Language Processing”. 130,000 node hours �on OLCF Summit

2022-2023 �ASCR Leadership Computing Challenge allocation

“Privacy-preserving Transformer models for clinical natural language processing”. 150,000 node hours on OLCF Summit and 30,000 node hours on OLCF Frontier

  • Use CITADEL, �the OLCF secure computing capability
  • Secure file system �and compute node access: runs normal job on Summit compute nodes in a HIPAA compliant manner

2023-2024 �ASCR Leadership Computing Challenge allocation

Title: “Privacy enabled tumor classification for near real time population health analytics”. �140,000 node hours �on OLCF Frontier

Sustained computing support from DOE �over MOSSAIC �project lifetime

Total of 560,000 Summit node hours through the ALCC program

Approximately 325,000 additional Summit node hours provided via the Exascale Computing Project, OLCF Director’s Discretionary, and the OLCF Summit Early �Science programs

CITADEL

16

16

17 of 43

Data Complexity: Heterogeneity in spades

  • Heterogenous data
    • Structured data (tabular information)
    • Unstructured data (clinical notes and diagnostic reports)
    • Imaging
    • Biomarkers
  • Heterogenous systems
    • Cerner, Epic, and others
  • Heterogenous data quality

Challenge 2

17

17

17

18 of 43

Harmonizing Unstructured Data with Common Data Models

NCI SEER Data

  • North American Association of Central Cancer Registries (NAACCR)
    • Data standards for hospital and central cancer registries
    • Developed collaboratively by the CDC NPCR, NCI SEER, American College of Surgeons, and the Canadian Council of Cancer Registries

Other Common Data Models

  • Sentinel
  • Patient-Centered Outcomes Research Network (PCORnet)
  • Informatics for Integrating Biology & the Bedside (i2b2)
  • Observational Medical Outcomes Partnership (OMOP)

Common Data Model Harmonization (CDMH) and Open Standards for Evidence Generation

https://aspe.hhs.gov/sites/default/files/private/pdf/259016/CDMH-Final-Report-14August2020.pdf

18

18

19 of 43

Automatic Classification for Common Data Models

  • Histology: Compared to our production level HiSAN model, we see a small increase in micro F1 (80.69 vs 79.25) and moderate increase in macro F1 (37.04 vs 33.22). HiSAN F1 is higher than Clinical BigBird
  • This suggests that models trained from scratch have better performance for rare and hard �to predict classes
  • Less computationally expensive models trained on domain specific data have utility in resource constrained environments

CITADEL

Mayanka Chandrashekar et al.,Path-BigBird: An AI-Driven Transformer Approach to Classification of Cancer Pathology Reports. JCOClinCancerInform 8,  e2300148(2024).DOI:10.1200/CCI.23.00148

Path-BigBird

Architecture: BERT based BigBird �Data: 2,772,103 pathology reports from six SEER registries

Document Classification: North American Association of Central Cancer Registries

19

19

20 of 43

Automatic Classification for Common Data Models�Utilizing longitudinal multimodal patient information for near real-time phenotype classification

Multimodal Ensembles for Identifying Recurrent Disease

61,550 pathology reports

Zach Fox, Patrycja Krawczuk, Dakota Murdock, John Gounley, Heidi Hanson

20499

10079

12504

18469

Grouped Classes

# Reports

unknown

recurrent

existing disease

new diagnosis

20

20

21 of 43

Automatic Classification for Common Data Models�Utilizing longitudinal multimodal patient information for near real-time phenotype classification

 Multimodal  Ensemble: Trained using oncology specific data from SEER registries

Off the Shelf: Clinical Longformer fine-tuned on the Recurrence Classification Task

Overall Accuracy: 73.40%

Overall Accuracy: 53.40%

unknown

recurrent

existing disease

new diagnosis

support

unknown

75.89%

6.58%

5.42%

12.11%

1825

recurrent

7.29%

66.70%

10.89%

15.12%

946

existing disease

8.94%

7.92%

64.47%

18.67%

1275

New diagnosis

8.77%

4.81%

6.08%

80.34%

1892

unknown

recurrent

existing disease

new diagnosis

support

unknown

62.08%

7.95%

9.26%

20.71%

1825

recurrent

15.01%

40.06%

13.53%

31.40%

946

existing disease

13.88%

14.59%

39.22%

32.31%

1275

new diagnosis

17.97%

8.30%

12.47%

61.26%

1892

Identifying patients with recurrent disease

Zach Fox, Patrycja Krawczuk, Dakota Murdock, John Gounley, Heidi Hanson

21

21

22 of 43

Regulatory hurdles prevent pooling of data across health care institutions in the United States

Challenge 3

22

22

22

23 of 43

Why do we need to pool data?

  • Rare diseases
  • Situational awareness during a pandemic
  • Data and algorithm bias
  • Real world solutions to health care and surveillance

23

23

24 of 43

Near real-time health data analytics with privacy aware federated learning �

The National AI Research Resource (NAAIR) Task Force�National Childhood Cancer Registry�DOE ASCR Biopreparedness (BRaVE) Funding

Our current implementation

  • Cross Silo
  • Horizontal
  • Model Centric – Need �for Cooperative Agreement between institutions/�participants

Citadel

Utah

Seattle

New Jersey

New Mexico

Louisiana

Kentucky

Trusted host

Differential privacy

  • “Epsilon Indistinguishability”

 

High privacy

High accuracy

Privacy/accuracy tradeoff

Innovation: SEER data make us completely innovative in this space. We are able to design and test solutions for real-world application at population scale.

24

24

25 of 43

Integration of diverse types of social and environmental determinants of health data across space and time requires

Number of Deaths and Percentage of Disability-Adjusted Life-Years Related to the 17 Leading Risk Factors in the United States, 2016Negative values (where bars extend left of zero) indicate a protective effect.�

The State of US Health, 1990-2016: Burden of Diseases, Injuries, and Risk Factors Among US States. JAMA. 2018;319(14):1444-1472. doi:10.1001/jama.2018.0158

Challenge 4

25

25

25

26 of 43

Air pollution exposure affects spontaneous pregnancy loss

Leiser CL, Hanson HA, et al. Acute effects of air pollutants on spontaneous pregnancy loss: a case-crossover study. Fertility and Sterility 111, 341-347 (2019)

1,398 women with a spontaneous pregnancy 2007 – 2015

https://www.theguardian.com/environment/2019/jan/11/air-pollution-as-bad-as-smoking-in-increasing-risk-of-miscarriage

26

26

27 of 43

Exposure to industrial air pollution is associated with decreased male fertility

https://www.epa.gov/rsei/rsei-geographic-microdata-rsei-gm

estimated average daily concentration of PCBs (μg/m3) along the Wasatch Front in 2005

Azoospermia

Ramsay JM, Fendereski K, Horns JJ, VanDerslice JA, Hanson HA, Emery BR, Halpern JA, Aston KI, Ferlic E, Hotaling JM. Environmental exposure to industrial air pollution is associated with decreased male fertility. Fertil Steril. 2023 May 15:S0015-0282(23)00514-9. doi: 10.1016/j.fertnstert.2023.05.143. Epub ahead of print. PMID: 37196750.

27

27

28 of 43

SEER Residential History Data

LexusNexus Residential History data

  • 11 SEER registries have been linked (3.2 million individuals diagnosed from 2005 - 2022)
  • 15 SEER registries should be linked by the end of 2024
  • Residential history constructed for Louisiana and expected to be expanded to all linked registries this year
  • High quality data from 1995 – 2020
  • 83% are geocoded to the point location
  • IRB approved studies may gain access to the information

Zaria Tatalovich, Lynne Penberthy, Kevin Henry, David Stinchcomb

estimated average daily concentration of PCBs (μg/m3) along the Wasatch Front in 2005

RSEI Microdata

Air Pollution Data

Indoor Radon

28

28

29 of 43

Near real-time analytics at scale �and data driven clinical decision-making tools ��Medium Range Goals

ORNL is managed by UT-Battelle LLC for the US Department of Energy

29

29

30 of 43

Investors and technical leads

  • Principal Investigator: Dr. Heidi Hanson
    • Group Leader, Biostatistics and Biomedical Informatics, ORNL
  • Technical Lead: Dr. John Gounley
    • Group Leader, �Scalable Biomedical Modeling, ORNL

EHRLICHElectronic Health Record �informed LagrangIan method �for preCision public Health��HPC for Biopreparedness

The U.S. Department of Energy, Office of Science, Office of Science Advanced Scientific Computing Research (ASCR) as part of the Biopreparedness Research Virtual Environment (BRaVE) initiative.

Program Manager: Dr. Margaret Lentz

30

30

30

31 of 43

Advancing epidemiological modeling capabilities, near real-time disease surveillance and scenario forecasting

Best case scenario

Worst case scenario

ABM for modeling �hospital capacity

Implement data-driven epidemiological modeling for near-real-time disease �projection and rapid response to large-scale public health threats

31

31

32 of 43

ENABLE: EHRLICH National-scale Agent-Based Learning Environment

Modular and Interoperable Software for Biopreparedness Decision Support

ChatGPT

Everything should be made as simple as possible, but not simpler ~ Einstein

FrESCO

Near Real-Time Information Retrieval

C-HER��Contextualizing Environments

ARC��Extracting Information from Scientific Publication

UrbanPop��Contact Networks for Contagion Models

UQ for Data��UQ for ABMs

Agent Based Modeling for Rapid Response

32

32

33 of 43

Interoperable Systems for AI at Scale: �Bringing Compute Closer to the Source Rapid Data Integration while Protecting Private Information�

Standard data processing workflows across siloed systems for preparing unstructured text

Federated Learning for extracting common data elements

Citadel

A

B

C

D

Silo A

Silo B

Silo C

Silo D

Problem: Health Data Silos

Past Solution: Centralization of data or Federated Data Structures

- Slow and costly

- Designed for a single purpose

- Rule based ETL for harmonization

Our Solution:

FrESCO

Near Real-Time Information Retrieval

33

33

34 of 43

Improving the accuracy of information retrieval models

Attention Mechanisms in Clinical Text Classification

  • We examine the effect of different attention mechanisms on CNN-, RNN-, and transformer-based text-encoder architecture-specific document embeddings.

  • Initializing an attention mechanism's reference information with explicit label-specific auxiliary information improves classification performance.

  • Incorporating information on code hierarchy solely via the attention mechanism improves accuracy with minimal increase in compute time and without requiring an additional neural network.

Structure of the general document classification attention model. The text-encoder segment represents a CNN-, RNN-, or transformer-based architecture. The label-attention layer utilizes implicit (i.e., random embeddings) or explicit (e.g., embedded textual code descriptions) auxiliary information. The multi-label classification models use a sigmoid activation function after the output layer and the binary-cross-entropy objective function.

FrESCO

Near Real-Time Information Retrieval

Christoph Metzner, Shang Gao, Drahomira Hermoannova, Elia Lima-Walton, Heidi Hanson

Accepted IEEE Journal of Biomedical and Health Informatics – Selected for highlight and journal cover

34

34

35 of 43

Centralized Health and Environmental Repository� (C-HER)

  • Spatially indexed social and environmental determinants of health data
  • Reproducible data of a FAIR approach.
  • Standardized Meta-data
  • ML and DL for derivative datasets
    • In development: Air pollution and indoor radon

C-HER��Contextualizing Environments

35

35

36 of 43

Measures are “spatially joined” to a polygon (hex)

While not perfect… Hexes can scale in and out across different geographies and measures can be aggregated accordingly.

RSEI Microdata converted to Uber hexes

hexagon size: ~36km2 – 1,770km2

Joemy Ramsay, PhD

36

36

37 of 43

Centralized Health and Environmental Repository� (C-HER)

C-HER��Contextualizing Environments

Currently processing 73 social and environmental determinants of health datasets

37

37

38 of 43

2030��Disease detection at the speed of life: �Near real-time disease surveillance at population scale��

ORNL is managed by UT-Battelle LLC for the US Department of Energy

38

38

39 of 43

Integrated Health Security Surveillance Response Tools�

  • Data Driven Decision Support Needs
    • Easy management of large amounts of heterogenous and siloed sensitive information

    • Early warning system for impending outbreaks that could become a public health emergency

    • Data-driven informed decisions for incident handling and response

    • Data-driven tools to guide the execution of mitigation actions in incident response

ChatGPT

Dual Purpose

Rapid Clinical Decision Making

Rapid Response for Population Health

39

39

40 of 43

Team science��Computational Scientists, Engineers, Biostatisticians, Biomedical Informaticists, Geneticists, Subject Matter Experts�

https://xkcd.com/793/

Challenge 5

40

40

40

41 of 43

MOSSAIC

National Cancer Institute

Lynne Penberthy (PI)

Elizabeth Hsu (Technical Lead)

Valentina Petkov

Serban Negoita

Ola Adeyemi

Sylkk Ansah

Sarah Bonds

IMS

Linda Coyle

Jennifer Stevens

Scott Depuy

Rusty Sheilds

Gary Beverungen

Los Alamos National Laboratory

Jamaludin Mohd Yusof

Sayera Dhaubhadel

Tanmoy Bhattacharya

Oak Ridge National Laboratory

Heidi Hanson (PI)

John Gounley (Technical Lead)

Shelaine Curd (Program Manager)

Georgia Tourassi

Joe Lake

Adam Spannaus

Dakota Murdock

Zachary Fox

Patrycja Krawczuck

Dakotah Maguire

Jordan Miller

Mayanka Chandra Shekar

Noah Schaefferkoetter

Sajal Dash

Isaac Lyngaas

Abhishek Shivanna

Robert Bridges

Christopher Stanley

Vandy Tombs

Christoph Metzner

41

41

42 of 43

EHRLICH

Oak Ridge National Laboratory

Heidi Hanson (PI)

John Gounley (Technical Lead)

Shelaine Curd (Program Manager)

Adam Spannaus

Zachary Fox

Patrycja Krawczuck

Dakotah Maguire

Mayanka Chandra Shekar

Robert Bridges

Christopher Stanley

Vandy Tombs

Sudip Seal

James Nutaro

Sifat Moon

Christoph Metzner

Los Alamos National Laboratory

Jamal Mohd-Yusof

Cristina Garcia Cardona

Argonne National Laboratory

Rick Stevens

Thomas Brettin

University of Utah

James VanDerslice

Joemy Ramsay

University of Arizona

Nirav Merchant

Ravi Tandon

University of Southern California

Rima Habre

Duke

Amanda Randles

University of Chicago/Morehouse School of Medicine

Lilly Immergluck

42

42

43 of 43

Discussion

43

43