1 of 52

Data Pre-processing

Applied IQR-based quartile segmentation to balance all five TAT_Label categories (0–4) equally.�
Learning rate → 1e-4 (stable convergence)

Batch size → 256 (memory efficient)

Dropout → 0.3 (to prevent overfitting)

Hidden layers → [256, 128, 64] (deep enough for non-linear feature interaction)

Used EarlyStopping (patience = 5) to prevent overfitting.

Data split: 70% training, 15% validation, 15% testing.

2 of 52

Data Extraction & Turnaround-Time (TAT) Feature Engineering

IQR Based

*target variable

Derived two analytical columns to characterize workflow efficiency and anomalies:�TAT_Label (Categorical Performance Indicator)

0 – Low: TAT_Hours ≤ Q1 (≤ 25th percentile) Very short or same-day prescriptions.
1 – Medium-Low: Q1 < TAT_Hours ≤ Q2 (25th–50th percentile) Slightly below average turnaround — generally efficient.
2 – Medium-High: Q2 < TAT_Hours ≤ Q3 (50th–75th percentile) Slightly slower than average — moderate delay zone.
3 – High: TAT_Hours > Q3 (≥ 75th percentile) Slowest 25 % of prescriptions — potential workflow bottlenecks.
4 – Incomplete Workflow: Missing Ph_verification date.

3 of 52

Prediction Models For IQR Based feature with ANN model

ADASYN (Adaptive Synthetic Sampling) + ANN
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
ANN- Train	0.60	0.59	0.58	0.73	0.58	0.83
ANN- Test	0.54	0.54	0.54	0.70	0.54	0.83

Although the accuracy (~54%) suggests moderate classification ability, the ROC-AUC scores (0.83) indicate strong ranking and separability power the ANN understands relative order between TAT classes but struggles at precise boundary classification due to class overlap in the IQR-based method.

My rule of thumb is 70% training- 15% validation- 15% testing

ADASYN generates more samples near decision boundaries this is done to avoid overlap between data’s.

It improves class separation without overfitting to dense areas.

4 of 52

Data Pre-processing

Applied RandomOverSampler (SMOTE-like oversampling) to balance all five TAT_Label categories (0–4) equally.�
Learning rate → 1e-4 (stable convergence)

Batch size → 256 (memory efficient)

Dropout → 0.3 (to prevent overfitting)

Hidden layers → [256, 128, 64] (deep enough for non-linear feature interaction)

Used EarlyStopping (patience = 5) to prevent overfitting.

Data split: 70% training, 15% validation, 15% testing (stratified).

5 of 52

Data Extraction & Turnaround-Time (TAT) Feature Engineering(Hour based)

*target variable

Derived two analytical columns to characterize workflow efficiency and anomalies:�TAT_Label (Categorical Performance Indicator)

0 – Low: Under 24 hrs
1 – Medium-Low: 1 day up to 1 month (24-719 hrs)
2 – Medium-High:1 month up to 6 months (720-4319 hrs)
3 – High: Anything 6 months or longer
4 – Incomplete Workflow: Missing Ph_verification�

Is_TAT_Anomaly (Anomaly Flag)

0 → Normal range
1 → Outlier / unusually long or short TAT
2 → Neutral / Missing data

6 of 52

Prediction Models For Hour Based feature with ANN model

At the time of entry (with RandomOverSampler)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
ANN- Train	0.755	0.765	0.755	0.820	0.752	0.942
ANN- Test	0.775	0.786	0.775	0.880	0.773	0.945

The ANN model achieved 77.5% test accuracy with a balanced F1-score of 0.77, indicating robust learning across all TAT categories.The High-TAT category remains challenging (F1 ≈ 0.42) and could improve with additional long-tail samples or engineered time-series features.

My rule of thumb is 70% training- 15% validation- 15% testing

The RandomOverSampler was used to balance the minority class like High TAT or Incomplete workflow.

7 of 52

8 of 52

SHAP Diagram

Y-axis:

Features are sorted by their overall importance (top = most influential).

X-axis:

Negative SHAP → pushed prediction toward lower TAT / faster workflow

Positive SHAP → pushed prediction toward higher TAT / possible delay or anomaly�The farther from zero, the stronger the feature’s effect.

Color:

> Blue: low value of feature

> Red: high value of feature

9 of 52

Prediction Models For binary Class with missing Verification time ANN model Class B

At the time of entry (with RandomOverSampler)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
ANN- Test	0.67	0.6476	0.4644	0.6100	0.5249	0.5653

The ANN model achieved 67% test accuracy with a weighted F1-score of 0.52 indicating below average predictive performance.

Precision (0.6476) and Recall (0.46) suggest a not highly confident classification performance across three classes.

The ROC AUC of 0.56 indicates average discrimination between normal and anomalous entries.

Specificity (0.61) shows that the model correctly identifies around 61% of non-anomalous workflows

My rule of thumb is 70% training- 15% validation- 15% testing

The RandomOverSampler was used to balance the minority class like High TAT or Incomplete workflow.

10 of 52

11 of 52

SHAP Diagram

Y-axis:

Features are sorted by their overall importance (top = most influential).

X-axis:

Negative SHAP → pushed prediction toward lower TAT / faster workflow

Positive SHAP → pushed prediction toward higher TAT / possible delay or anomaly�The farther from zero, the stronger the feature’s effect.

Color:

> Blue: low value of feature

> Red: high value of feature

12 of 52

Adding New_Features

Backlog_SameDay_Before

For each Rx, at the moment it’s entered, count how many other Rxs were already entered earlier the same day. Bigger number =>more work ahead => likely longer TAT.
Take each day’s entered Rxs, sort by entry time, and assign the position index (0,1,2,…) to each Rx. That index is Backlog_SameDay_Before for that Rx.
That index is literally the number of Rxs that completed entry before this Rx on the same day (i.e., the queue already in front of it).
Example: At 2:15 PM, this Rx gets entered and 37 Rxs were entered earlier today -> Backlog_SameDay_Before = 37 -> expect longer turnaround time.
Alert if this value is consistently high vs your store’s typical level for that hour (e.g., > mean + 3×std for ≥10 min).

Batch_IsLarge_5m

In the last 5 minutes, did we see an unusually large number of Rx entries?” Output is a simple Yes/No flag.
Floor each entry timestamp to a 5-minute bucket, count entries per bucket, and set Batch_IsLarge_5m = 1 if the count ≥ your threshold (3 for a simple rule).
A short, fixed window captures arrival/entry bursts. Comparing the 5-minute count to a threshold tells you if the spike is meaningfully above normal rather than random noise.
Example: Typical 3–4 PM bucket ≤ 14; current 5-min count = 19 → Batch_IsLarge_5m = 1 (Yes, large batch). If backlog is also rising, it’s volume-driven; if backlog stays flat, the system is absorbing the surge.

13 of 52

Removing the relevant features correlation between feature and the Target (|r| < 0.05)

PatientVisitFrequency (|r|≈0.0486)

RxTotalRxAmount (|r|≈0.0444)

RxRefillsRemaining (|r|≈0.0276)

RxTotalPrice (|r|≈0.0123)

Price_vs_Total_Ratio (|r|≈0.002)

DAW_Required (|r|≈0.0002)

Positive r: as the feature goes up, label tends to go up.

Negative r: as the feature goes up, label tends to go down.

Remove it if it’s near zero (|r| < 0.05)

14 of 52

Prediction Models for after removing the features (|r| < 0.05)

The ANN model achieved 58% test accuracy with a weighted F1-score of 0.56 indicating below average predictive performance.

Precision (0.565) and Recall (0.564) suggest a not highly confident classification performance across three classes.

The ROC AUC of 0.813 indicates average discrimination between normal and anomalous entries.

Specificity (0.81) shows that the model correctly identifies around 81% of non-anomalous workflows

My rule of thumb is 70% training- 15% validation- 15% testing

The RandomOverSampler was used to balance the minority class like High TAT or Incomplete workflow.

At the time of entry (with RandomOverSampler)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
ANN- Train	0.578	0.565	0.565	0.866	0.551	0.813
ANN- Test	0.578	0.564	0.564	0.866	0.551	0.814

15 of 52

16 of 52

SHAP Diagram

Y-axis:

Features are sorted by their overall importance (top = most influential).

X-axis:

Negative SHAP → pushed prediction toward lower TAT / faster workflow

Positive SHAP → pushed prediction toward higher TAT / possible delay or anomaly�The farther from zero, the stronger the feature’s effect.

Color:

> Blue: low value of feature

> Red: high value of feature

17 of 52

Adding New_Features

Backlog_SameDay_Before

For each Rx, at the moment it’s entered, count how many other Rxs were already entered earlier the same day. Bigger number =>more work ahead => likely longer TAT.
Take each day’s entered Rxs, sort by entry time, and assign the position index (0,1,2,…) to each Rx. That index is Backlog_SameDay_Before for that Rx.
That index is literally the number of Rxs that completed entry before this Rx on the same day (i.e., the queue already in front of it).
Example: At 2:15 PM, this Rx gets entered and 37 Rxs were entered earlier today -> Backlog_SameDay_Before = 37 -> expect longer turnaround time.
Alert if this value is consistently high vs your store’s typical level for that hour (e.g., > mean + 3×std for ≥10 min).

Batch_IsLarge_8h : (Working hours)

In the last 8 hours, did we see an unusually large number of Rx entries?” Output is a simple Yes/No flag.
Floor each entry timestamp to a 8 hours bucket, count entries per bucket, and set Batch_IsLarge_8h = 1 if the count ≥ your threshold (3 for a simple rule).
A short, fixed window captures arrival/entry bursts. Comparing the 8-hours count to a threshold tells you if the spike is meaningfully above normal rather than random noise.

Batch_IsLarge_24h : (Last Day).

Batch_IsLarge_38h : (Average Workflow).

18 of 52

Adding New_Features

Workload_Ratio_24h

Workload_Ratio_24h = Backlog_24h / PrescriptionsPerStaff_AtEntry
This describes how many recent (last 24h) Rxs each staff member is carrying on average.
Bigger ratio ⇒ heavier load per person → use it to flag Isworkloadonstaff_High.��Isworkloadonstaff_High
Binary flag from workload—1 if staff load is high, else 0.
Isworkloadonstaff_High = 1 when Average_Workload_24h ≥ threshold (Rx/hour) otherwise 0.

19 of 52

correlation between feature and the Target (Numeric)

\

20 of 52

correlation between feature and the Target (Categorical)

\

21 of 52

correlation between feature and the Target with the association ( < 0.05)

\

I have decided to remove the categorical feature manually since the numerical feature has some correlation with other columns that has good correlation with the target variable.

I also removed the Insurance_Rejection_Count', 'Pharmacy_Rejection_Count',

'Peak_Hours_Rejection', 'Weekend_Rejection’ since all values are zero.

22 of 52

Prediction Models for after removing the features with the association ( < 0.05)

The model is consistently around 62–63% accurate on both train and test (no overfitting).

Recall & Precision (macro ~0.62) show balanced performance across classes, while Specificity ~0.88 means it reliably avoids false alarms.

ROC-AUC ~0.88 indicates strong overall ranking/separation of classes. This results are good than the previous results but not good yet.

My rule of thumb is 70% training- 15% validation- 15% testing

The Focal Loss with class weight is used to balance batches which improves minority detection without creating invalid samples.

At the time of entry (Focal Loss with class weight)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
ANN- Train	0.6267	0.6231	0.6328	0.8841	0.5905	0.8785
ANN- Test	0.6255	0.6235	0.6322	0.8838	0.5905	0.8776

23 of 52

Sorting Rx Rx_Entered Date

2020 – 2024

I sorted all records by Rx RxEntered Date instead of mixed data that is already Existing.

Reason: pharmacy workflows change over days/weeks—so training in time order lets the model see natural patterns (weekday vs. weekend, month-start rush, batch entries, shift changes).

It also prevents future→past leakage that can happen with random splits.

�This gives us some better results than the one with the mixed data�

Rx RxEntered Date
02/01/20 8:48
04/01/20 9:49
04/01/20 :49
04/01/20 11:13
04/01/20 11:42
04/01/20 12:01
04/01/20 12:09
04/01/20 12:12
04/01/20 12:15
04/01/20 12:16
04/01/20 12:16
04/01/20 12:19
04/01/20 13:09
04/01/20 14:22
05/01/20 8:48
05/01/20 9:05
05/01/20 9:08
05/01/20 10:32
05/01/20 10:35

24 of 52

MLP and Class Weight 5 cross validation (Random)

Good separation (high ROC-AUC) but threshold metrics are mid (0.59 F1).

For operations, this is useful but not final we can tune thresholds to trade off recall vs. precision based on cost.

ROC-AUC ~0.87 indicates strong overall ranking/separation of classes. This results are good than the previous results but not good yet.

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights to handle imbalance

At the time of entry (Focal Loss with class weight)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
MLP- Train	0.6220	0.6222	0.6290	0.8829	0.5900	0.8756
MLP- Test	0.6226	0.6189	0.6286	0.8831	0.5907	0.8747

25 of 52

MLP and Class Weight 10 cross validation (Nonrandom)

Low accuracy (0.46–0.48) and only moderate F1 (0.52–0.54) even though Precision/Recall 0.60 many mistakes at the default threshold.

Strength: ROC-AUC (0.83–0.85) and Specificity 0.82 show the model separates classes reasonably well and avoids too many false alarms.

Cross design: Time-blocked folds (no shuffling). Each fold uses 3 months train → 3 months validation → 3 months test.�My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights to handle imbalance

At the time of entry (Focal Loss with class weight)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
MLP- Train	0.4601	0.5894	0.5936	0.8190	0.5210	0.8346
MLP- Test	0.4784	0.6065	0.6096	0.8196	0.5407	0.8473

26 of 52

Model 1 and Model 4

Area	Model1 (Code 1)	Model4 (Code 2)	Why Model4 helped
Data split	Likely single train/val/test (possibly random)	Time-blocked folds (train past → predict future)	Matches the real use case, reduces leakage, more reliable generalization.
Imbalance handling	Class weights only	Focal Loss + class weights	Focal focuses learning on hard/rare cases → better recall/precision at the same threshold.
Batching	Standard batches	WeightedRandomSampler (or balanced mini-batches)	Keeps minority class visible every step → steadier gradients, less bias to majority.
Thresholds	Default 0.5	Validated threshold	Converts good ranking into better F1/ops metrics by picking a smarter cut-off.
Regularization	Dropout/L2 (basic)	Dropout + AdamW + early stopping tuned	Lowers overfit; val and test get closer → more stable metrics.
Feature pipeline	Basic scaling/encodings	Cleaner timestamps + consistent TZ + better encodings	Less noise → the model sees clearer patterns (weekday/month, backlog effects).
Architecture details	MLP with ReLU	Same MLP but better tuned (lr, layers, hidden size, patience)	Small tuning often lifts F1/AUC without making the net bigger.

27 of 52

MLP and Class Weight Latest Model

Consistent and balanced: Accuracy ~0.673 with Precision ≈ Recall ≈ 0.64 and F1 ≈ 0.638 across folds by this we can say there no obvious overfitting.

Strong separation, conservative alerts: ROC-AUC ~0.884 and Specificity ~0.894 show solid ranking and low false-positive rate.

Limitation: Performance plateaus around F1- 0.64.

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights to handle imbalance

At the time of entry (Focal Loss with class weight)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
MLP- Train	0.6724	0.6385	0.6455	0.8936	0.6381	0.8851
MLP- Test	0.6729	0.6381	0.6446	0.8938	0.6378	0.8840

28 of 52

MLP and Class Weight Latest Model

Consistent and balanced: Accuracy ~0.60 with Precision ≈ Recall ≈ 0.55 and F1 ≈ 0.55 across folds by this we can say there no obvious overfitting but not a good performance overall.

Strong separation, conservative alerts: ROC-AUC ~0.884 and Specificity ~0.87 show solid ranking and low false-positive rate.

Limitation: Performance plateaus around F1- 0.55 and low accuracy of 0.60.

10-Fold Rolling window Validation method�Stability Period From 2022 to 2024�My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights to handle imbalance

At the time of entry (Focal Loss with class weight)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
MLP- VAL	0.6002	0.5521	0.5495	0.8669	0.5504	0.8200
MLP- Test	0.5109	0.4910	0.4675	0.8485	0.4570	0.7870

29 of 52

Fold 1:

Train: Months 1, 2
Validate: Month 3
Test: Month 4

Fold 2: (Slide the window forward by 1 month)

Train: Months 2, 3
Validate: Month 4
Test: Month 5

Fold 3: (Slide the window forward by 1 month)

Train: Months 3, 4
Validate: Month 5
Test: Month 6

30 of 52

Task1:�Train the model over specific time period before the pipe or after.

31 of 52

Data Extraction & Turnaround-Time (TAT) Feature Engineering

In the stability period 2022 to 2023

*target variable

�The data consisted of Rows: 29367 for around 1.2 year of data �that is 2022 to 2023�Derived two analytical columns to characterize workflow efficiency and anomalies:

0 – Low: Under 38 hrs

1 –High: More than 38hrs�2 – Incomplete Workflow: Missing Ph_verification��Task2:�Recalculate the feature after dividing the dataset to prevent the data leakage from training and testing.(≤ t.)�

32 of 52

MLP-Model

Stability Period From 2022 to 2023 (Before outlier)

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights

Task3: 3 months for train and 1 month (fixed) for testing

�

At the time of entry (Focal Loss with class weight)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
MLP- Train	0.7499	0.7351	0.7324	0.8677	0.7290	0.8843
MLP- Test	0.5952	0.5129	0.5868	0.7908	0.5208	0.6943

33 of 52

MLP-Model

Stability Period From 2022 to 2023 (Before outlier)

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights

Task3: 4 months for train and 1 month (fixed) for testing

�

At the time of entry (Focal Loss with class weight)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
MLP- Train	0.7355	0.7265	0.7411	0.8654	0.7319	0.8803
MLP- Test	0.6384	0.6725	0.6467	0.8112	0.6543	0.8214

34 of 52

MLP-Model

Stability Period From 2022 to 2023 (Before outlier)

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights

Task3: 6 months for train and 1 month (fixed) for testing

�

At the time of entry (Focal Loss with class weight)
Model	Accuracy	Precision	Recall	Specificity	F1 Score	ROC AUC
MLP- Train	0.7248	0.7470	0.6999	0.8474	0.7118	0.8832
MLP- Test	0.6608	0.6997	0.6509	0.7932	0.6711	0.8120

35 of 52

Shap-Diagram�

36 of 52

37 of 52

Added Features:

1.Daily_Avg_RxEntered

For each prescription, we count how many prescriptions were entered on that same calendar day.
Same value for all rows with the same entry date.
Example: If 320 prescriptions were entered on 2025-01-10, then�Daily_Avg_RxEntered = 320 for every Rx with that entry date.

38 of 52

Added Features: Staff based trend feature

2.Trend-Based Staff Efficiency Lag Features (1d, 7d, 30d)

Content:
For each pharmacist, we compute:

SE_1d = average Rx/day verified in the last 1 day
SE_7d = average Rx/day verified in the last 7 days
SE_30d = average Rx/day verified in the last 30 days
SE_overall = average Rx/day verified over the pharmacist’s entire history

We then define trend-based lag features as ratios:

Lag_1d = SE_1d / SE_overall
Lag_7d = SE_7d / SE_overall
Lag_30d = SE_30d / SE_overall

Interpretation:

Value ≈ 1 → current performance is normal vs long-term
Value > 1 → current performance is better / faster than usual
Value < 1 → current performance is worse / slower than usual

39 of 52

Added Features: Timed - based trend feature

4. Delay from Entry to Verification (7d vs Overall)

We first compute how many hours it takes for each prescription to go from:�Rx RxEntered Date → Rx PhVerifDate.
We keep only reasonable delays (0 to 7 days) and ignore missing verification dates.
For each day, we calculate the average delay, then:

Overall average delay across all days
7-day rolling average delay (last 7 days)

Our feature is:�Delay_EntryToVerif_7d_vsOverall = (7-day avg delay) / (overall avg delay)

5. Daily Volume Trend (7d vs Overall)

For each day, we count how many prescriptions were entered → Daily_RxCount.
We then compute:

Overall average daily volume
7-day rolling average daily volume (last 7 days)

Our feature is:�DailyVolume_7d_vsOverall = (7-day avg volume) / (overall avg volume)
Interpretation for both of these:

≈ 1 → this week’s volume is normal
1 → this week is busier than usual
< 1 → this week is quieter than usual

40 of 52

MLP-Model (Class weight + Focal loss)

My rule of thumb is 70% training- 15% validation- 15% testing

Loss: Focal Loss + class weights

Task3: 3 months for train and 1 month (fixed) for testing

�

From the performance of the model there is minor overfitting between the train and val ,test if that is sorted the overall model performance might be better than the present one.

41 of 52

Each class performance:

42 of 52

Shap-Diagram�

43 of 52

Why Staff – based appears in the SHAP?

In our data, TAT mostly follows staff pressure.
When staff are efficient / not overloaded → TAT is usually short (good).
When staff are less efficient / overloaded → TAT is usually long (expected).
We created many staff features:
Staff efficiency and workload over 1 day, 7 days, and 30 days
Plus versions like “vs overall average”, lags, and rolling stats
Because these staff features strongly explain how TAT normally behaves, the model uses them a lot, so they show up in the Top-10 SHAP features.

44 of 52

45 of 52

46 of 52

Approach:

In previous we have created a lot of staff features.
Many of these staff features are very similar to each other (they move almost the same over time).
Even if they are not >0.8 correlated with the target, some of them are >0.8 correlated with each other, so they are basically duplicates.
If two staff features are >0.8 correlated with each other, we keep one and drop the other.
All non-staff features stay as they are.

47 of 52

Shap-Diagram�

48 of 52

After removing the >0.8 Correlation between features.

�

After

Before:

49 of 52

50 of 52

Update

Time period:

Window1:�2022-01-03 to 2023-02-27

Window2:�2023-06-01 to 2024-12-31

Total number of Prescription:

64551

51 of 52

The model trained for all 100 epochs as validation metrics showed continuous small improvements, with final test accuracy of 68% exceeding validation performance, confirming good generalization to unseen data.

52 of 52

Each class result: