1 of 52

Data Pre-processing

  • Applied IQR-based quartile segmentation to balance all five TAT_Label categories (0–4) equally.�
  • Learning rate → 1e-4 (stable convergence)

  • Batch size → 256 (memory efficient)

  • Dropout → 0.3 (to prevent overfitting)

  • Hidden layers → [256, 128, 64] (deep enough for non-linear feature interaction)

  • Used EarlyStopping (patience = 5) to prevent overfitting.

  • Data split: 70% training, 15% validation, 15% testing.

2 of 52

Data Extraction & Turnaround-Time (TAT) Feature Engineering

IQR Based

*target variable

  • Derived two analytical columns to characterize workflow efficiency and anomalies:�TAT_Label (Categorical Performance Indicator)
    • 0 – Low: TAT_Hours ≤ Q1 (≤ 25th percentile) Very short or same-day prescriptions.
    • 1 – Medium-Low: Q1 < TAT_Hours ≤ Q2 (25th–50th percentile) Slightly below average turnaround — generally efficient.
    • 2 – Medium-High: Q2 < TAT_Hours ≤ Q3 (50th–75th percentile) Slightly slower than average — moderate delay zone.
    • 3 – High: TAT_Hours > Q3 (≥ 75th percentile) Slowest 25 % of prescriptions — potential workflow bottlenecks.
    • 4 – Incomplete Workflow: Missing Ph_verification date.

3 of 52

Prediction Models For IQR Based feature with ANN model

ADASYN (Adaptive Synthetic Sampling) + ANN

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

ANN- Train

0.60

0.59

0.58

0.73

0.58

0.83

ANN- Test

0.54

0.54

0.54

0.70

0.54

0.83

Although the accuracy (~54%) suggests moderate classification ability, the ROC-AUC scores (0.83) indicate strong ranking and separability power the ANN understands relative order between TAT classes but struggles at precise boundary classification due to class overlap in the IQR-based method.

My rule of thumb is 70% training- 15% validation- 15% testing

ADASYN generates more samples near decision boundaries this is done to avoid overlap between data’s.

It improves class separation without overfitting to dense areas.

4 of 52

Data Pre-processing

  • Applied RandomOverSampler (SMOTE-like oversampling) to balance all five TAT_Label categories (0–4) equally.�
  • Learning rate → 1e-4 (stable convergence)

  • Batch size → 256 (memory efficient)

  • Dropout → 0.3 (to prevent overfitting)

  • Hidden layers → [256, 128, 64] (deep enough for non-linear feature interaction)

  • Used EarlyStopping (patience = 5) to prevent overfitting.

  • Data split: 70% training, 15% validation, 15% testing (stratified).

5 of 52

Data Extraction & Turnaround-Time (TAT) Feature Engineering(Hour based)

*target variable

  • Derived two analytical columns to characterize workflow efficiency and anomalies:�TAT_Label (Categorical Performance Indicator)
    • 0 – Low: Under 24 hrs
    • 1 – Medium-Low: 1 day up to 1 month (24-719 hrs)
    • 2 – Medium-High:1 month up to 6 months (720-4319 hrs)
    • 3 – High: Anything 6 months or longer
    • 4 – Incomplete Workflow: Missing Ph_verification�

Is_TAT_Anomaly (Anomaly Flag)

    • 0 → Normal range
    • 1 → Outlier / unusually long or short TAT
    • 2 → Neutral / Missing data

6 of 52

Prediction Models For Hour Based feature with ANN model

At the time of entry (with RandomOverSampler)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

ANN- Train

0.755

0.765

0.755

0.820

0.752

0.942

ANN- Test

0.775

0.786

0.775

0.880

0.773

0.945

The ANN model achieved 77.5% test accuracy with a balanced F1-score of 0.77, indicating robust learning across all TAT categories.The High-TAT category remains challenging (F1 ≈ 0.42) and could improve with additional long-tail samples or engineered time-series features.

My rule of thumb is 70% training- 15% validation- 15% testing

The RandomOverSampler was used to balance the minority class like High TAT or Incomplete workflow.

7 of 52

8 of 52

SHAP Diagram

Y-axis:

Features are sorted by their overall importance (top = most influential).

X-axis:

Negative SHAP → pushed prediction toward lower TAT / faster workflow

Positive SHAP → pushed prediction toward higher TAT / possible delay or anomaly�The farther from zero, the stronger the feature’s effect.

Color:

> Blue: low value of feature

> Red: high value of feature

9 of 52

Prediction Models For binary Class with missing Verification time ANN model Class B

At the time of entry (with RandomOverSampler)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

ANN- Test

0.67

0.6476

0.4644

0.6100

0.5249

0.5653

The ANN model achieved 67% test accuracy with a weighted F1-score of 0.52 indicating below average predictive performance.

Precision (0.6476) and Recall (0.46) suggest a not highly confident classification performance across three classes.

The ROC AUC of 0.56 indicates average discrimination between normal and anomalous entries.

Specificity (0.61) shows that the model correctly identifies around 61% of non-anomalous workflows

My rule of thumb is 70% training- 15% validation- 15% testing

The RandomOverSampler was used to balance the minority class like High TAT or Incomplete workflow.

10 of 52

11 of 52

SHAP Diagram

Y-axis:

Features are sorted by their overall importance (top = most influential).

X-axis:

Negative SHAP → pushed prediction toward lower TAT / faster workflow

Positive SHAP → pushed prediction toward higher TAT / possible delay or anomaly�The farther from zero, the stronger the feature’s effect.

Color:

> Blue: low value of feature

> Red: high value of feature

12 of 52

Adding New_Features

Backlog_SameDay_Before

  • For each Rx, at the moment it’s entered, count how many other Rxs were already entered earlier the same day. Bigger number =>more work ahead => likely longer TAT.
  • Take each day’s entered Rxs, sort by entry time, and assign the position index (0,1,2,…) to each Rx. That index is Backlog_SameDay_Before for that Rx.
  • That index is literally the number of Rxs that completed entry before this Rx on the same day (i.e., the queue already in front of it).
  • Example: At 2:15 PM, this Rx gets entered and 37 Rxs were entered earlier today -> Backlog_SameDay_Before = 37 -> expect longer turnaround time.
  • Alert if this value is consistently high vs your store’s typical level for that hour (e.g., > mean + 3×std for ≥10 min).

Batch_IsLarge_5m

  • In the last 5 minutes, did we see an unusually large number of Rx entries?” Output is a simple Yes/No flag.
  • Floor each entry timestamp to a 5-minute bucket, count entries per bucket, and set Batch_IsLarge_5m = 1 if the count ≥ your threshold (3 for a simple rule).
  • A short, fixed window captures arrival/entry bursts. Comparing the 5-minute count to a threshold tells you if the spike is meaningfully above normal rather than random noise.
  • Example: Typical 3–4 PM bucket ≤ 14; current 5-min count = 19 → Batch_IsLarge_5m = 1 (Yes, large batch). If backlog is also rising, it’s volume-driven; if backlog stays flat, the system is absorbing the surge.

13 of 52

Removing the relevant features correlation between feature and the Target (|r| < 0.05)

PatientVisitFrequency (|r|≈0.0486)

RxTotalRxAmount (|r|≈0.0444)

RxRefillsRemaining (|r|≈0.0276)

RxTotalPrice (|r|≈0.0123)

Price_vs_Total_Ratio (|r|≈0.002)

DAW_Required (|r|≈0.0002)

Positive r: as the feature goes up, label tends to go up.

Negative r: as the feature goes up, label tends to go down.

Remove it if it’s near zero (|r| < 0.05)

14 of 52

Prediction Models for after removing the features (|r| < 0.05)

The ANN model achieved 58% test accuracy with a weighted F1-score of 0.56 indicating below average predictive performance.

Precision (0.565) and Recall (0.564) suggest a not highly confident classification performance across three classes.

The ROC AUC of 0.813 indicates average discrimination between normal and anomalous entries.

Specificity (0.81) shows that the model correctly identifies around 81% of non-anomalous workflows

My rule of thumb is 70% training- 15% validation- 15% testing

The RandomOverSampler was used to balance the minority class like High TAT or Incomplete workflow.

At the time of entry (with RandomOverSampler)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

ANN- Train

0.578

0.565

0.565

0.866

0.551

0.813

ANN- Test

0.578

0.564

0.564

0.866

0.551

0.814

15 of 52

16 of 52

SHAP Diagram

Y-axis:

Features are sorted by their overall importance (top = most influential).

X-axis:

Negative SHAP → pushed prediction toward lower TAT / faster workflow

Positive SHAP → pushed prediction toward higher TAT / possible delay or anomaly�The farther from zero, the stronger the feature’s effect.

Color:

> Blue: low value of feature

> Red: high value of feature

17 of 52

Adding New_Features

Backlog_SameDay_Before

  • For each Rx, at the moment it’s entered, count how many other Rxs were already entered earlier the same day. Bigger number =>more work ahead => likely longer TAT.
  • Take each day’s entered Rxs, sort by entry time, and assign the position index (0,1,2,…) to each Rx. That index is Backlog_SameDay_Before for that Rx.
  • That index is literally the number of Rxs that completed entry before this Rx on the same day (i.e., the queue already in front of it).
  • Example: At 2:15 PM, this Rx gets entered and 37 Rxs were entered earlier today -> Backlog_SameDay_Before = 37 -> expect longer turnaround time.
  • Alert if this value is consistently high vs your store’s typical level for that hour (e.g., > mean + 3×std for ≥10 min).

Batch_IsLarge_8h : (Working hours)

  • In the last 8 hours, did we see an unusually large number of Rx entries?” Output is a simple Yes/No flag.
  • Floor each entry timestamp to a 8 hours bucket, count entries per bucket, and set Batch_IsLarge_8h = 1 if the count ≥ your threshold (3 for a simple rule).
  • A short, fixed window captures arrival/entry bursts. Comparing the 8-hours count to a threshold tells you if the spike is meaningfully above normal rather than random noise.

Batch_IsLarge_24h : (Last Day).

Batch_IsLarge_38h : (Average Workflow).

18 of 52

Adding New_Features

Workload_Ratio_24h

  • Workload_Ratio_24h = Backlog_24h / PrescriptionsPerStaff_AtEntry
  • This describes how many recent (last 24h) Rxs each staff member is carrying on average.
  • Bigger ratio ⇒ heavier load per person → use it to flag Isworkloadonstaff_High.��Isworkloadonstaff_High
  • Binary flag from workload—1 if staff load is high, else 0.
  • Isworkloadonstaff_High = 1 when Average_Workload_24h ≥ threshold (Rx/hour) otherwise 0.

19 of 52

correlation between feature and the Target (Numeric)

\

20 of 52

correlation between feature and the Target (Categorical)

\

21 of 52

correlation between feature and the Target with the association ( < 0.05)

\

I have decided to remove the categorical feature manually since the numerical feature has some correlation with other columns that has good correlation with the target variable.

I also removed the Insurance_Rejection_Count', 'Pharmacy_Rejection_Count',

'Peak_Hours_Rejection', 'Weekend_Rejection’ since all values are zero.

22 of 52

Prediction Models for after removing the features with the association ( < 0.05)

The model is consistently around 62–63% accurate on both train and test (no overfitting).

Recall & Precision (macro ~0.62) show balanced performance across classes, while Specificity ~0.88 means it reliably avoids false alarms.

ROC-AUC ~0.88 indicates strong overall ranking/separation of classes. This results are good than the previous results but not good yet.

My rule of thumb is 70% training- 15% validation- 15% testing

The Focal Loss with class weight is used to balance batches which improves minority detection without creating invalid samples.

At the time of entry (Focal Loss with class weight)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

ANN- Train

0.6267

0.6231

0.6328

0.8841

0.5905

0.8785

ANN- Test

0.6255

0.6235

0.6322

0.8838

0.5905

0.8776

23 of 52

Sorting Rx Rx_Entered Date

2020 – 2024

I sorted all records by Rx RxEntered Date instead of mixed data that is already Existing.

Reason: pharmacy workflows change over days/weeks—so training in time order lets the model see natural patterns (weekday vs. weekend, month-start rush, batch entries, shift changes).

It also prevents future→past leakage that can happen with random splits.

�This gives us some better results than the one with the mixed data�

Rx RxEntered Date

02/01/20 8:48

04/01/20 9:49

04/01/20 :49

04/01/20 11:13

04/01/20 11:42

04/01/20 12:01

04/01/20 12:09

04/01/20 12:12

04/01/20 12:15

04/01/20 12:16

04/01/20 12:16

04/01/20 12:19

04/01/20 13:09

04/01/20 14:22

05/01/20 8:48

05/01/20 9:05

05/01/20 9:08

05/01/20 10:32

05/01/20 10:35

24 of 52

MLP and Class Weight 5 cross validation (Random)

Good separation (high ROC-AUC) but threshold metrics are mid (0.59 F1).

For operations, this is useful but not final we can tune thresholds to trade off recall vs. precision based on cost.

ROC-AUC ~0.87 indicates strong overall ranking/separation of classes. This results are good than the previous results but not good yet.

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights to handle imbalance

At the time of entry (Focal Loss with class weight)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

MLP- Train

0.6220

0.6222

0.6290

0.8829

0.5900

0.8756

MLP- Test

0.6226

0.6189

0.6286

0.8831

0.5907

0.8747

25 of 52

MLP and Class Weight 10 cross validation (Nonrandom)

Low accuracy (0.46–0.48) and only moderate F1 (0.52–0.54) even though Precision/Recall 0.60 many mistakes at the default threshold.

Strength: ROC-AUC (0.83–0.85) and Specificity 0.82 show the model separates classes reasonably well and avoids too many false alarms.

Cross design: Time-blocked folds (no shuffling). Each fold uses 3 months train → 3 months validation → 3 months test.�My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights to handle imbalance

At the time of entry (Focal Loss with class weight)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

MLP- Train

0.4601

0.5894

0.5936

0.8190

0.5210

0.8346

MLP- Test

0.4784

0.6065

0.6096

0.8196

0.5407

0.8473

26 of 52

Model 1 and Model 4

Area

Model1 (Code 1)

Model4 (Code 2)

Why Model4 helped

Data split

Likely single train/val/test (possibly random)

Time-blocked folds (train past → predict future)

Matches the real use case, reduces leakage, more reliable generalization.

Imbalance handling

Class weights only

Focal Loss + class weights

Focal focuses learning on hard/rare cases → better recall/precision at the same threshold.

Batching

Standard batches

WeightedRandomSampler (or balanced mini-batches)

Keeps minority class visible every step → steadier gradients, less bias to majority.

Thresholds

Default 0.5

Validated threshold

Converts good ranking into better F1/ops metrics by picking a smarter cut-off.

Regularization

Dropout/L2 (basic)

Dropout + AdamW + early stopping tuned

Lowers overfit; val and test get closer → more stable metrics.

Feature pipeline

Basic scaling/encodings

Cleaner timestamps + consistent TZ + better encodings

Less noise → the model sees clearer patterns (weekday/month, backlog effects).

Architecture details

MLP with ReLU

Same MLP but better tuned (lr, layers, hidden size, patience)

Small tuning often lifts F1/AUC without making the net bigger.

27 of 52

MLP and Class Weight Latest Model

Consistent and balanced: Accuracy ~0.673 with Precision ≈ Recall ≈ 0.64 and F1 ≈ 0.638 across folds by this we can say there no obvious overfitting.

Strong separation, conservative alerts: ROC-AUC ~0.884 and Specificity ~0.894 show solid ranking and low false-positive rate.

Limitation: Performance plateaus around F1- 0.64.

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights to handle imbalance

At the time of entry (Focal Loss with class weight)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

MLP- Train

0.6724

0.6385

0.6455

0.8936

0.6381

0.8851

MLP- Test

0.6729

0.6381

0.6446

0.8938

0.6378

0.8840

28 of 52

MLP and Class Weight Latest Model

Consistent and balanced: Accuracy ~0.60 with Precision ≈ Recall ≈ 0.55 and F1 ≈ 0.55 across folds by this we can say there no obvious overfitting but not a good performance overall.

Strong separation, conservative alerts: ROC-AUC ~0.884 and Specificity ~0.87 show solid ranking and low false-positive rate.

Limitation: Performance plateaus around F1- 0.55 and low accuracy of 0.60.

10-Fold Rolling window Validation methodStability Period From 2022 to 2024�My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights to handle imbalance

At the time of entry (Focal Loss with class weight)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

MLP- VAL

0.6002

0.5521

0.5495

0.8669

0.5504

0.8200

MLP- Test

0.5109

0.4910

0.4675

0.8485

0.4570

0.7870

29 of 52

Fold 1:

  • Train: Months 1, 2
  • Validate: Month 3
  • Test: Month 4

Fold 2: (Slide the window forward by 1 month)

  • Train: Months 2, 3
  • Validate: Month 4
  • Test: Month 5

Fold 3: (Slide the window forward by 1 month)

  • Train: Months 3, 4
  • Validate: Month 5
  • Test: Month 6

30 of 52

Task1:�Train the model over specific time period before the pipe or after.

31 of 52

Data Extraction & Turnaround-Time (TAT) Feature Engineering

In the stability period 2022 to 2023

*target variable

�The data consisted of Rows: 29367 for around 1.2 year of data �that is 2022 to 2023�Derived two analytical columns to characterize workflow efficiency and anomalies:

0 – Low: Under 38 hrs

1 –High: More than 38hrs�2 – Incomplete Workflow: Missing Ph_verification��Task2:�Recalculate the feature after dividing the dataset to prevent the data leakage from training and testing.(≤ t.)�

32 of 52

MLP-Model

Stability Period From 2022 to 2023 (Before outlier)

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights

Task3: 3 months for train and 1 month (fixed) for testing

At the time of entry (Focal Loss with class weight)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

MLP- Train

0.7499

0.7351

0.7324

0.8677

0.7290

0.8843

MLP- Test

0.5952

0.5129

0.5868

0.7908

0.5208

0.6943

33 of 52

MLP-Model

Stability Period From 2022 to 2023 (Before outlier)

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights

Task3: 4 months for train and 1 month (fixed) for testing

At the time of entry (Focal Loss with class weight)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

MLP- Train

0.7355

0.7265

0.7411

0.8654

0.7319

0.8803

MLP- Test

0.6384

0.6725

0.6467

0.8112

0.6543

0.8214

34 of 52

MLP-Model

Stability Period From 2022 to 2023 (Before outlier)

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights

Task3: 6 months for train and 1 month (fixed) for testing

At the time of entry (Focal Loss with class weight)

Model

Accuracy

Precision

Recall

Specificity

F1 Score

ROC AUC

MLP- Train

0.7248

0.7470

0.6999

0.8474

0.7118

0.8832

MLP- Test

0.6608

0.6997

0.6509

0.7932

0.6711

0.8120

35 of 52

Shap-Diagram

36 of 52

37 of 52

Added Features:

1.Daily_Avg_RxEntered

  • For each prescription, we count how many prescriptions were entered on that same calendar day.
  • Same value for all rows with the same entry date.
  • Example: If 320 prescriptions were entered on 2025-01-10, then�Daily_Avg_RxEntered = 320 for every Rx with that entry date.

38 of 52

Added Features: Staff based trend feature

2.Trend-Based Staff Efficiency Lag Features (1d, 7d, 30d)

  • Content:
  • For each pharmacist, we compute:
    • SE_1d = average Rx/day verified in the last 1 day
    • SE_7d = average Rx/day verified in the last 7 days
    • SE_30d = average Rx/day verified in the last 30 days
    • SE_overall = average Rx/day verified over the pharmacist’s entire history
  • We then define trend-based lag features as ratios:
    • Lag_1d = SE_1d / SE_overall
    • Lag_7d = SE_7d / SE_overall
    • Lag_30d = SE_30d / SE_overall
  • Interpretation:
    • Value ≈ 1 → current performance is normal vs long-term
    • Value > 1 → current performance is better / faster than usual
    • Value < 1 → current performance is worse / slower than usual

39 of 52

Added Features: Timed - based trend feature

4. Delay from Entry to Verification (7d vs Overall)

  • We first compute how many hours it takes for each prescription to go from:�Rx RxEntered Date → Rx PhVerifDate.
  • We keep only reasonable delays (0 to 7 days) and ignore missing verification dates.
  • For each day, we calculate the average delay, then:
    • Overall average delay across all days
    • 7-day rolling average delay (last 7 days)
  • Our feature is:�Delay_EntryToVerif_7d_vsOverall = (7-day avg delay) / (overall avg delay)

5. Daily Volume Trend (7d vs Overall)

  • For each day, we count how many prescriptions were enteredDaily_RxCount.
  • We then compute:
    • Overall average daily volume
    • 7-day rolling average daily volume (last 7 days)
  • Our feature is:�DailyVolume_7d_vsOverall = (7-day avg volume) / (overall avg volume)
  • Interpretation for both of these:
    • ≈ 1 → this week’s volume is normal
    • 1 → this week is busier than usual
    • < 1 → this week is quieter than usual

40 of 52

MLP-Model (Class weight + Focal loss)

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights

Task3: 3 months for train and 1 month (fixed) for testing

From the performance of the model there is minor overfitting between the train and val ,test if that is sorted the overall model performance might be better than the present one.

41 of 52

Each class performance:

42 of 52

Shap-Diagram

43 of 52

Why Staff – based appears in the SHAP?

  • In our data, TAT mostly follows staff pressure.
  • When staff are efficient / not overloaded → TAT is usually short (good).
  • When staff are less efficient / overloaded → TAT is usually long (expected).
  • We created many staff features:
  • Staff efficiency and workload over 1 day, 7 days, and 30 days
  • Plus versions like “vs overall average”, lags, and rolling stats
  • Because these staff features strongly explain how TAT normally behaves, the model uses them a lot, so they show up in the Top-10 SHAP features.

44 of 52

45 of 52

46 of 52

Approach:

  • In previous we have created a lot of staff features.
  • Many of these staff features are very similar to each other (they move almost the same over time).
  • Even if they are not >0.8 correlated with the target, some of them are >0.8 correlated with each other, so they are basically duplicates.
  • If two staff features are >0.8 correlated with each other, we keep one and drop the other.
  • All non-staff features stay as they are.

47 of 52

Shap-Diagram

48 of 52

After removing the >0.8 Correlation between features.

After

Before:

49 of 52

50 of 52

Update

  • Time period:

Window1:�2022-01-03 to 2023-02-27

Window2:�2023-06-01 to 2024-12-31

Total number of Prescription:

64551

51 of 52

The model trained for all 100 epochs as validation metrics showed continuous small improvements, with final test accuracy of 68% exceeding validation performance, confirming good generalization to unseen data.

52 of 52

Each class result: