2 of 21

Background

Machine Learning (ML) in healthcare has immense potential, given the vast amount of electronic health records (EMRs). There were around 250k papers at the time of the research [1].
Yet, many research-based models end up in “model-graveyard” because of a lack of a deployment framework in clinical settings.
Moreover, high predictive power isn’t sufficient alone for real-world impact.
For such impact, models must show:

Clear clinical actions that follow from predicts.
Improvement in patient outcomes.
Fairness

Benefits should be applicable to people belonging to all demographics.

Leading institutions were therefore creating frameworks for ML deployment with a focus of model safety, reliability and usefulness across sections of the model life-cycle.

3 of 21

Background

This has met with many technical barriers such as:

Legacy infrastructure in hospitals.
High overhead of building and maintaining ML-EHR integration.

During the time of research, there were two main approaches:

EMR-vendor native platforms such as Epic Nebula

This provides a benefit of low overhead and ease in sharing models across hospitals with same EMR systems.
However, the downside is that one could integrate custom - research built models which were made on hospital data.

Custom frameworks built by institutions

Used daily EMR refreshes (not online).
Provide flexibility which isn’t there in vendor platforms.
Required collaboration between data scientists for model development and hospital’s IT staff who knew all about EMR.
Yet, solutions were ad-hoc, siloed and poorly-shared leaving other hospitals behind.

4 of 21

Background

Given this state, data scientists at Stanford School of Medicine along with IT professionals at Stanford Health Care and Stanford Children’s Health proposed DEPLOYR.
It is a framework to deploy researcher-built ML models directly into EMR, without any dependence on vendor specific platform that works with real-time EMR data streams, supports real-time ML applications and provides a blueprint for other institutions.
Its core functionality includes:

It runs a model when triggered by a clinician (closed-loop).
Pulls the appropriate patient data from EMR.
Runs the ML model in real-time for inference, going beyond retrospective to prospective evaluation.
Outputs of the model are integrated directly into the EMR for clinicians to view.
Continuous monitoring.

For demonstration, they silently deployed 12 models that had been validated retrospectively [2].

5 of 21

Concepts & Creative Insights

Figure 1 [3]

Core functions of DEPLOYR are -

Data Sourcing
Inference Triggers
EMR Integration
Monitoring module
Mechanisms that enable silent deployment and prospective evaluation of models.

Fig 1 shows an overview of the system.

The demonstration is tied to Stanford Health Care EMR vendor (Epic).
Thus, the integration logic is vendor dependent.

However, the other subsystems can be applied to any vendor given that they adopt the specific integration logic.
The 3 software applications are

DEPLOYR-dev : a python packages for model development and validation.
DEPLOYR-serve : a python Azure function application [4] to expose trained models as APIs.
DEPLOYR-dash : a dashboard implemented using streamlit python package [5].

6 of 21

Concepts & Creative Insights

Figure 1 [3]

Blue : research side (Stanford School of Medicine), Orange : clinical side (Stanford Health Care)

De-identified EMR data from STARR used for model training & validation via DEPLOYR-dev.
Model deployed as REST API with DEPLOYR-serve.
EMR triggers inference by sending HTTPS request to the model.
Model retrieves real-time features from EMR transactional database via REST/FHIR APIs, runs inference, and returns results to EMR.
Predictions and metadata stored in an inference database.
DEPLOYR-dash monitors performance continuously using these stored results.

7 of 21

Concepts & Creative Insights

Figure 2 [6]

Training Data Source

Stanford’s clinical data warehouse - STARR.
STARR has de-identified data of over 2.4M unique patients from 2009 to 2021 who visited Stanford Hospital (academic medical center in Palo Alto, CA), ValleyCare hospital (community hospital in Pleasanton, CA) and Stanford University Healthcare Alliance affiliated ambulatory clinics [7].

Inference Data Source

In production, DEPLOYR uses EMR’s transactional database, Epic Chronicles for real-time patient data.

Note that warehouse database is several transactions (Extract, transform, loads) removed from EHR data. Hence, data mapping is necessary at inference to match the features used during training which is handled by DEPLOYR-serve as shown in Fig 2.

8 of 21

Concepts & Creative Insights

Figure 3 [8]

Model Inference Triggers decide when and for whom models run, shaping both workflow integration and the target population.
There are two types of triggers in DEPLOYR:

Event-based: Activated by a clinical action (e.g., ordering a lab, signing a note, admission/discharge). Implemented as REST APIs via DEPLOYR-serve, triggered by EMR alerts, rules, or button-clicks.
Time-based: Run at scheduled intervals (e.g., monitoring for sepsis every 15 min). Implemented with Azure Function cron timers. Patients are selected through EMR vendor APIs (e.g., all patients in a unit).

Fig 3 : A to D

A clinician action in the EMR sends an HTTPS request to DEPLOYR-serve, which gathers features via Epic APIs, runs the model, and returns the inference back into the EMR for clinical use.

Fig 3 : E to H

With time-based triggers, DEPLOYR-serve runs at set intervals (e.g., every 15 min), retrieves patient IDs via REST APIs, collects feature vectors, runs inference, and sends results back to the EMR.

9 of 21

Concepts & Creative Insights

Finally, to close the loop, model outputs must return to clinicians inside the EMR. DEPLOYR (via serve) does it directly into the EMR, not as a separate application, reducing overhead.
These integrations are of two types:

Passive (non-interruptive):

Write scores to EMR columns in patient lists/schedules (probabilities, flags, feature contributions).
Store in flowsheets or Smart Data Values (can trigger downstream support).
Send messages through EMR’s internal inbox.
Handled via background display.

Active (interruptive):

Trigger alerts during clinician actions (e.g., lab order entry).
Implemented via Epic Best Practice Advisory web-services.
Uses CDA web-services (XML responses) to exchange predictions with DEPLOYR-serve
There are alerts in the workflow.

10 of 21

Concepts & Creative Insights

Continuous Monitoring

Models can decay over time due to distribution shifts [9] such as :

Covariate Shift : change in input features
Label Shift : change in outcome distributions.
Concept Shift : change in feature label relation.

DEPLOYR uses LabelExtractors (in DEPLOYR-serve) to collect ground truth after predictions.

Inferences and metadata (IDs, timestamps, features) are stored in Azure Cosmos DB
Match predictions to outcomes once available via EMR APIs. Example: 30-day readmission : LabelExtractor checks if patient was readmitted.

After extracting labels, DEPLOYR tracks model performance.

Metrics include threshold dependent ones like accuracy and precision as well as threshold independent ones like AUROC.
These are stratified into sub-groups of demographics for fairness checks.

11 of 21

Concepts & Creative Insights

Figure 4 [10]

From Fig 4, it is evident that DEPLOYR also tracks feature, label and prediction distributions over time. This is visualized via streamlit dashboard which is DEPLOYR-dash.
Silent trials were conducted with models that run in the live EMR environment but outputs are not shown to clinicians. Even and time based trigger run in the background. The purpose of this is to:

Validate that data pipelines work correctly.
Provide more realistic performance checks than retrospective tests.
Detect faulty cohort designs (e.g., exclusion criteria that aren’t observable at inference time).

12 of 21

Concepts & Creative Insights

Case Study of the silent trial includes:

12 binary classifiers predicting abnormal lab results with Random Forests (chosen as robust baseline for EMR data).
STARR warehouse (2015–2021), ~14,000 orders per task (sampled ~2000/year).
Cohorts:

CBC tests (4 models : hematocrit, hemoglobin, WBC, platelets).
Metabolic panel (7 models : albumin, BUN, calcium, carbon dioxide,, creatinine, potassium, sodium).
Magnesium test (1 model).

Prospective data: Real-time EMR orders (Jan–Feb 2023)
Features: demographics, diagnosis codes, medications, prior labs (represented as counts).
Compared retrospective vs prospective test sets over AUROC with 95% CI (bootstrap, 1000 samples).

13 of 21

Concepts & Creative Insights

Observations:

AUROC values in prospective tests were several points lower than retrospective results across all 12 models. Example: Hemoglobin AUROC : 0.88 (retrospective) vs 0.83 (prospective) [11]
Similar declines observed in ROC, PR, and calibration curves.
This was because of issues such as data drift [12], inaccessible features at inference time or imperfect mappings.
For subgroup analysis, model performance showed variability with the number of available features. Complete feature vector gave better results.

Overall, Silent trials revealed gaps that retrospective testing misses due to its inherent nature of working on offline data.
The continuous monitoring system helps detect performance decay due to drifts.
With feedback mechanisms, when a model influences the care, it changes the data generated in the future.

Naive retraining on this biased data can worsen performance.
DEPLOYR allows randomization in inference delivery

So, some predictions are shown and others are withheld creating control groups that separate true model effects from feedback artifacts.

It supports causal evaluation like randomized control trials inside the EMR.

14 of 21

Critical Analysis

Earlier vendor-platforms locked hospitals into their ecosystem. Eg: systems only work on Epic and not other EMR. However, DEPLOYR is built with open tools such as Python, Azure functions, REST APIs and Streamlit. This makes it more portable and flexible as it can be adopted across institutions.
Vendors solutions are optimized for pre-packages vendor approved models only. However, DEPLOYR eliminates this limitation.
DEPLOYR provides online inferences whereas vendor services only provided daily refreshes of EMR at best.
Predictions are included directly into the workflow. However, there is a performance drop observed due to data drift and imperfect mappings with EMR in prospective settings when compared to retrospective settings.
It provisions dashboards that track metrics such as AUROC, subgroup fairness, and distribution shifts.

15 of 21

Critical Analysis

Clearly, to build such a system requires immense commitment from people in various backgrounds such as IT professionals, data scientists and clinicians; an overall institutional commitment. This presents a scalability challenge to adopt this framework.
The cost of the silent trial indicates that it is cost-efficient cloud setup (<$300 in Azure storage for a month [13]). However, this does not include the cost for the GPUs needed for inferences of large-scale deep learning models.
Coming to models, it is agnostic in nature. However, there’s no multi-model support as the focus of the paper was stuck on EMR data alone.
It provides more transparency than vendor models are components are open sourced. [14] [15] [16]

16 of 21

Relevance

4 Vs of Big Data:

Volume: Millions of patients in STARR (~2.4M)
Velocity: Real-time EMR streams
Variety: Structured EMR data (demographics, labs, medications)
Veracity: Data quality challenges, ETL pipelines, and mappings between research and production data.

It is built on cloud-native tools.

Azure-functions - serverless compute which is similar to AWS instances.
APIs (REST) for interoperability, which parallels to Hadoop/YARN where distributed systems manage pipelines.

It supports core-analytics such as regression, classification, clustering along with monitoring.
It tracks AUROC, calibration curves, fairness across demographic subgroups and monitors drift. Thereby, turning raw analytics into actionable insights.

17 of 21

Summary

DEPLOYR is Stanford’s in-house framework for integrating research-based ML models into EMRs in real-time.
It avoids vendor-lock in.

Extensible to any institution with a clinical data warehouse + FHIR (fast healthcare interoperability resources) APIs.
Epic-specific APIs were used, but similar tools exist in other EMRs.

Supports live deployment of models.
Integrates into clinician workflows.
Framework is model-agnostic (binary, multi-class, regression).
Enables prospective testing of models, thereby bridging the gap between research and clinical practice.
Currently focused on structured EMR data. Future work needed for unstructured notes and multimodal data.

18 of 21

References

[1] E. J. Odisho, et al., “Background and significance,” NPJ Digit. Med., Aug. 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10436147/. [Accessed: Sept. 28, 2025].

[2] E. J. Odisho, et al., “Abstract,” NPJ Digit. Med., Aug. 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10436147/. [Accessed: Sept. 28, 2025].

[3] E. J. Odisho, et al., “Summary of a DEPLOYR enabled model deployment,” NPJ Digit. Med., Fig. 1, Aug. 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10436147/figure/ocad114-F1/. [Accessed: Sept. 28, 2025].

[4] Microsoft, “Azure Functions—Serverless Functions in Computing—Microsoft Azure,” 2023. [Online]. Available: https://azure.microsoft.com/en-us/products/functions/. [Accessed: Sept. 28, 2025].

19 of 21

References

[5] Streamlit, “Streamlit: a faster way to build and share data apps,” 2023. [Online]. Available: https://streamlit.io/. [Accessed: Sept. 28, 2025].

[6] E. J. Odisho, et al., “Mappings and Inferences in DEPLOYR-serve,” NPJ Digit. Med., Fig. 2, Aug. 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10436147/figure/ocad114-F2/. [Accessed: Sept. 28, 2025].

[7] E. J. Odisho et al., “Training Data Source,” NPJ Digital Medicine, section “Training Data Source,” in: PMC10436147, 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10436147/#ocad114-B27. [Accessed: Sept. 28, 2025].

[8] E. J. Odisho, et al., “Triggering Mechanism,” NPJ Digit. Med., Fig. 3, Aug. 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10436147/figure/ocad114-F3/. [Accessed: Sept. 28, 2025].

�

20 of 21

References

[9] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, “Environment and Distribution shifts,” Dive into Deep Learning. [Online]. Available: https://d2l.ai/chapter_linear-classification/environment-and-distribution-shift.html. [Accessed: Sept. 28, 2025].

[10] E. J. Odisho, et al., “DEPLOYR performance monitoring,” NPJ Digit. Med., Fig. 5, Aug. 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10436147/figure/ocad114-F5/. [Accessed: Sept. 28, 2025].

[11] E. J. Odisho, et al., “Results Table 2,” NPJ Digit. Med., Aug. 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10436147/table/ocad114-T2/. [Accessed: Sept. 28, 2025].

[12] Evidently AI, “What is data drift in ML, and how to detect and handle it,” Evidently AI, Jan. 9, 2025. [Online]. Available: https://www.evidentlyai.com/ml-in-production/data-drift. [Accessed: Sept. 28, 2025].

21 of 21

References

[13] E. J. Odisho, et al., “Discussion,” NPJ Digit. Med., Aug. 2023. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10436147/. [Accessed: Sept. 28, 2025].

[14] HealthRex Lab, “DEPLOYR-dev,” GitHub repository. [Online]. Available: https://github.com/HealthRex/deployr-dev. [Accessed: Sept. 28, 2025].

[15] HealthRex Lab, “DEPLOYR-dash,” GitHub repository. [Online]. Available: https://github.com/HealthRex/deployr-dash. [Accessed: Sept. 28, 2025].

[16] HealthRex Lab, “DEPLOYR-serve,” GitHub repository. [Online]. Available: https://github.com/HealthRex/deployr-serve. [Accessed: Sept. 28, 2025].