Identifying Hidden Subgroups
of Chronic Kidney Disease Patients
Using Unsupervised Learning
Natalie Kam · BSTA 142
Dataset: UCI CKD Dataset — 400 patients, 24 clinical features
1
Why Does This Matter?
850M+
People affected
by CKD globally
10%
of global population
has CKD
~$50B
Annual US cost
of CKD care
Health Question:
Do CKD patients form clinically distinct subgroups with different disease profiles beyond the binary CKD / not-CKD label?
The Clinical Gap
Why Unsupervised Learning?
› CKD is staged 1–5, but staging uses only creatinine & GFR — ignoring anemia, glucose, blood pressure, and other markers
› Patients with the same stage can have very different comorbidity profiles (hypertension, diabetes, coronary artery disease)
› Personalized treatment requires understanding which patients cluster together, not just a single severity number
› Unsupervised learning discovers natural groupings from data without imposing clinical assumptions
› Clustering can reveal subtypes that predict treatment response, hospitalization risk, or mortality
› Finding these groups from routine lab data could guide targeted interventions at scale
2
The Dataset
UCI Chronic Kidney Disease Dataset (Rubini et al., 2015)
400
Patients
24
Features
11
Numeric
Features
13
Categorical
Features
~15
PCs for
90% Variance
Key Features Used
Feature | Type | Clinical Meaning |
Hemoglobin (hemo) | Numeric | Oxygen-carrying capacity → anemia indicator |
Serum Creatinine (sc) | Numeric | Kidney filtration efficiency |
Blood Glucose (bgr) | Numeric | Diabetes involvement |
Blood Pressure (bp) | Numeric | Hypertension marker |
Sodium / Potassium | Numeric | Electrolyte balance |
rbc / pc / pcc | Categorical | Urinalysis findings |
htn / dm / cad | Categorical | Comorbidity flags |
3
Preprocessing Pipeline
1
Type Standardisation
Strip whitespace from string fields. Convert sg / al / su to float. Force numeric columns to pd.to_numeric(), coercing errors to NaN.
2
Encoding Categorical Variables
Binary nominals (rbc, pc, htn, dm, cad, pe, ane, appet, pcc, ba) mapped to 0/1 integers using a lookup dictionary — no arbitrary ordinal assumption.
3
Missing Value Imputation
Numeric columns → median imputation (robust to outliers). Encoded binary columns → mode imputation. ~10% missingness per feature; 579 total missing cells.
4
Feature Scaling
StandardScaler (z-score) applied to all 24 features. Essential for K-Means which is distance-based — without scaling, high-variance features dominate cluster assignments.
⚠ Target label ('class': ckd/notckd) was held out entirely — not used in any clustering step. Used only post-hoc to evaluate cluster alignment.
4
Why K-Means + PCA?
K-Means Clustering
✓ Interpretable: each patient belongs to exactly one cluster with a clear centroid profile
✓ Scales well to 400 patients and 24 features
✓ Centroid coordinates are directly readable as clinical feature averages
✓ Widely used in medical subgrouping literature — results are communicable to clinicians
✓ K chosen objectively via Elbow method + Silhouette score, not arbitrary
PCA — Dimensionality Reduction
✓ 24 features cannot be visualised directly — PCA collapses to 2D for scatter plots
✓ Preserves maximum variance in fewest dimensions (29% in first 2 PCs; 15 components needed for 90%)
✓ Used only for visualisation, not for clustering — clustering runs on all 24 features
✓ Explained variance plot lets audience assess information retained
✓ Standard, transparent method — avoids the 'black box' criticism of t-SNE/UMAP
Alternatives Considered:
DBSCAN (no fixed k, but sensitive to epsilon); Hierarchical (good for dendrograms, less scalable); GMM (softer assignments, harder to interpret for clinicians)
5
How Much Information Does PCA Capture?
Blue bars: Each bar = how much one component contributes on its own. PC1 alone explains ~29% — the biggest single source of variation.
Red line: Running total. After 2 components we're at ~37%. You need ~15 components to reach 90%.
Why it matters: Our PCA scatter plot only uses 2 components (37% of info). It looks clean, but the real data is more complex — this is an important limitation.
6
Selecting the Optimal Number of Clusters
Decision: k = 2
Highest silhouette score (0.257) at k=2. Score oscillates at k=3–8 (range 0.196–0.233) with no stable improvement. k=2 chosen as most parsimonious: simplest structure with strongest cohesion.
7
Two Distinct Patient Subgroups Emerge
Cluster 0 — n = 183 patients (46%)
⚠ CKD dominant: 100% CKD cases
Feature | Mean Value | Interpretation |
Age | — | CKD-dominant group |
Hemoglobin | Low | Anemia present |
Creatinine | High | Impaired filtration |
Blood Urea | High | Elevated kidney burden |
Glucose | Elevated | Diabetic involvement |
Sodium | Low | Mild hyponatremia |
Cluster 1 — n = 217 patients (54%)
◑ Mixed: 31% CKD, 69% not-CKD
Feature | Mean Value | Interpretation |
CKD% | 31% | Majority not-CKD |
Hemoglobin | Higher | Better oxygen capacity |
Creatinine | Lower | Better kidney filtration |
Blood Urea | Lower | Less kidney burden |
Glucose | Lower | Closer to normal range |
Sodium | Higher | More normal electrolytes |
Silhouette Score (k=2) = 0.257 · Adjusted Rand Index vs. true label = 0.441 · n_init = 50 (stable solution)
8
PCA Projection — Clusters vs. True Labels
LEFT: K-Means separates two groups in 2D PCA space. Cluster 0 (blue/CKD) fans right along PC1; Cluster 1 (red/not-CKD) forms a tight cluster on the far left.
RIGHT: True labels confirm the compact not-CKD group (gray) vs. spread-out CKD patients (red) — validating that K-Means captured genuine clinical structure without using labels.
9
Feature Distributions Confirm Clinical Separation
Hemoglobin: C0 (blue/healthy) peaks ~15 g/dL; C1 (red/CKD) peaks ~13–15 g/dL but has a long left tail down to ~5 — showing anemia in the sickest patients
Creatinine: C1 (red/CKD) is almost entirely clustered near 0 but with extreme outliers to 70+ mg/dL — severely impaired kidney filtration in some patients
Glucose: Both clusters peak around 100 mg/dL but C0 (blue) has a longer right tail to 500 — higher glucose spread in the healthy cluster, likely due to diabetic outliers
Blood Pressure: Both clusters strongly overlap at 60–90 mmHg — BP alone cannot distinguish CKD from non-CKD patients
10
Cluster Profiles at a Glance
Reading the heatmap:
Red cells: = above-average value for that feature in that cluster (e.g., Cluster 1 has high creatinine & blood urea)
Blue cells: = below-average value (e.g., Cluster 0 has low creatinine and high hemoglobin relative to average)
Raw values: annotated inside each cell — z-scores used only for visual comparison, not for clinical thresholds
11
Validation, Assumptions & Weaknesses
0.257
Silhouette
Score
Higher = better
Max = 1.0
0.441
Adjusted
Rand Index
Agreement with
true CKD label
n=50
Random
Restarts
Stable, converged
solution
Missing Data (~10% per feature)
Median/mode imputation may mask true biological variation. Some features (sodium, WBC) had up to 13% missing.
K-Means Assumes Spherical Clusters
Real patient clusters may be elongated or non-convex. GMM or DBSCAN could reveal different structure.
Small Dataset (n=400)
Results may not generalise. A dataset of 10,000+ patients from multiple hospitals is needed for clinical validation.
2D PCA Loses 71% of Variance
PC1+PC2 explain only ~29% of variance; 90% threshold requires ~15 components. Visual separation looks clean, but full-dimensional structure is more complex.
ARI Uses True Labels Post-Hoc
We compared to labels after clustering — this is not data leakage, but results should not be used to claim supervised accuracy.
Alternative Explanation
The two groups may simply reflect CKD severity, not distinct clinical subtypes. Higher k (3-4) with more data could reveal finer subgroups.
12
Could Other Methods Give Cleaner Clusters?
Our ARI of 0.441 shows partial alignment — here's what alternative approaches could offer:
Gaussian Mixture
Models (GMM)
✓ K-Means forces hard boundaries — every patient belongs to exactly one cluster. GMM allows soft assignments, so a patient on the border gets a probability of belonging to each group. Better for overlapping clinical profiles.
✗ Harder to explain to clinicians. Assumes data follows a Gaussian distribution which may not hold.
DBSCAN
✓ Doesn't require you to pick k in advance. Finds clusters based on density — so tightly packed patient groups are detected automatically. Also labels outliers explicitly rather than forcing them into a cluster.
✗ Very sensitive to the epsilon parameter. Struggles with clusters of different densities, which is common in clinical data.
Hierarchical
Clustering
✓ Builds a tree (dendrogram) showing how patients merge into groups at different scales. You can cut the tree at any level — useful for exploring whether k=3 or k=4 reveals meaningful CKD subtypes.
✗ Doesn't scale well to large datasets. Merges are permanent — a bad early merge can't be undone.
t-SNE / UMAP
(Visualisation)
✓ Better at preserving local structure for visualisation than PCA. Would likely show tighter, more separated point clouds in 2D — making the clusters look visually cleaner and easier to interpret.
✗ Not a clustering method — only for visualisation. Results change with random seed and parameters, making them hard to reproduce.
13
KEY TAKEAWAY
What Did We Find?
We gave the algorithm 400 patients' blood test results and said "find me groups" — without ever telling it who had CKD. It found two groups with meaningful differences in lab profiles (silhouette = 0.257), with moderate alignment to true diagnoses (ARI = 0.441).
1
Cluster 0 captured the sickest patients
183 patients, 100% CKD — low hemoglobin, high creatinine, high blood urea. Lab values consistently point to impaired kidney function and anemia.
2
Cluster 1 was a mixed group
217 patients, only 31% CKD — better lab profiles overall. This tells us the algorithm found a real severity gradient, not just sick vs. healthy.
3
ARI of 0.441 — meaningful but not perfect
Moderate agreement with true labels. The clusters capture clinical structure but don't map 1-to-1 with diagnosis — which is expected and honest for unsupervised learning.
14
KEY TAKEAWAY
One Clear Message:
Unsupervised clustering recovers
the clinical CKD/not-CKD divide
from raw lab data alone —
and hints at finer subgroups within CKD patients defined
by anemia severity, glucose control, and kidney function.
Reproducible
Open dataset, public code, fixed seed — any researcher can replicate these results.
Clinically Grounded
Cluster differences align with known CKD pathophysiology (anemia, creatinine, glucose).
Honest
Limitations stated upfront: small n, imputation, spherical cluster assumption, 2D projection.
15