1 of 15

Identifying Hidden Subgroups

of Chronic Kidney Disease Patients

Using Unsupervised Learning

Natalie Kam · BSTA 142

Dataset: UCI CKD Dataset — 400 patients, 24 clinical features

1

2 of 15

Why Does This Matter?

850M+

People affected

by CKD globally

10%

of global population

has CKD

~$50B

Annual US cost

of CKD care

Health Question:

Do CKD patients form clinically distinct subgroups with different disease profiles beyond the binary CKD / not-CKD label?

The Clinical Gap

Why Unsupervised Learning?

CKD is staged 1–5, but staging uses only creatinine & GFR — ignoring anemia, glucose, blood pressure, and other markers

Patients with the same stage can have very different comorbidity profiles (hypertension, diabetes, coronary artery disease)

Personalized treatment requires understanding which patients cluster together, not just a single severity number

Unsupervised learning discovers natural groupings from data without imposing clinical assumptions

Clustering can reveal subtypes that predict treatment response, hospitalization risk, or mortality

Finding these groups from routine lab data could guide targeted interventions at scale

2

3 of 15

The Dataset

UCI Chronic Kidney Disease Dataset (Rubini et al., 2015)

400

Patients

24

Features

11

Numeric

Features

13

Categorical

Features

~15

PCs for

90% Variance

Key Features Used

Feature

Type

Clinical Meaning

Hemoglobin (hemo)

Numeric

Oxygen-carrying capacity → anemia indicator

Serum Creatinine (sc)

Numeric

Kidney filtration efficiency

Blood Glucose (bgr)

Numeric

Diabetes involvement

Blood Pressure (bp)

Numeric

Hypertension marker

Sodium / Potassium

Numeric

Electrolyte balance

rbc / pc / pcc

Categorical

Urinalysis findings

htn / dm / cad

Categorical

Comorbidity flags

3

4 of 15

Preprocessing Pipeline

1

Type Standardisation

Strip whitespace from string fields. Convert sg / al / su to float. Force numeric columns to pd.to_numeric(), coercing errors to NaN.

2

Encoding Categorical Variables

Binary nominals (rbc, pc, htn, dm, cad, pe, ane, appet, pcc, ba) mapped to 0/1 integers using a lookup dictionary — no arbitrary ordinal assumption.

3

Missing Value Imputation

Numeric columns → median imputation (robust to outliers). Encoded binary columns → mode imputation. ~10% missingness per feature; 579 total missing cells.

4

Feature Scaling

StandardScaler (z-score) applied to all 24 features. Essential for K-Means which is distance-based — without scaling, high-variance features dominate cluster assignments.

⚠ Target label ('class': ckd/notckd) was held out entirely — not used in any clustering step. Used only post-hoc to evaluate cluster alignment.

4

5 of 15

Why K-Means + PCA?

K-Means Clustering

Interpretable: each patient belongs to exactly one cluster with a clear centroid profile

Scales well to 400 patients and 24 features

Centroid coordinates are directly readable as clinical feature averages

Widely used in medical subgrouping literature — results are communicable to clinicians

K chosen objectively via Elbow method + Silhouette score, not arbitrary

PCA — Dimensionality Reduction

24 features cannot be visualised directly — PCA collapses to 2D for scatter plots

Preserves maximum variance in fewest dimensions (29% in first 2 PCs; 15 components needed for 90%)

Used only for visualisation, not for clustering — clustering runs on all 24 features

Explained variance plot lets audience assess information retained

Standard, transparent method — avoids the 'black box' criticism of t-SNE/UMAP

Alternatives Considered:

DBSCAN (no fixed k, but sensitive to epsilon); Hierarchical (good for dendrograms, less scalable); GMM (softer assignments, harder to interpret for clinicians)

5

6 of 15

How Much Information Does PCA Capture?

Blue bars: Each bar = how much one component contributes on its own. PC1 alone explains ~29% — the biggest single source of variation.

Red line: Running total. After 2 components we're at ~37%. You need ~15 components to reach 90%.

Why it matters: Our PCA scatter plot only uses 2 components (37% of info). It looks clean, but the real data is more complex — this is an important limitation.

6

7 of 15

Selecting the Optimal Number of Clusters

Decision: k = 2

Highest silhouette score (0.257) at k=2. Score oscillates at k=3–8 (range 0.196–0.233) with no stable improvement. k=2 chosen as most parsimonious: simplest structure with strongest cohesion.

7

8 of 15

Two Distinct Patient Subgroups Emerge

Cluster 0 — n = 183 patients (46%)

⚠ CKD dominant: 100% CKD cases

Feature

Mean Value

Interpretation

Age

CKD-dominant group

Hemoglobin

Low

Anemia present

Creatinine

High

Impaired filtration

Blood Urea

High

Elevated kidney burden

Glucose

Elevated

Diabetic involvement

Sodium

Low

Mild hyponatremia

Cluster 1 — n = 217 patients (54%)

◑ Mixed: 31% CKD, 69% not-CKD

Feature

Mean Value

Interpretation

CKD%

31%

Majority not-CKD

Hemoglobin

Higher

Better oxygen capacity

Creatinine

Lower

Better kidney filtration

Blood Urea

Lower

Less kidney burden

Glucose

Lower

Closer to normal range

Sodium

Higher

More normal electrolytes

Silhouette Score (k=2) = 0.257 · Adjusted Rand Index vs. true label = 0.441 · n_init = 50 (stable solution)

8

9 of 15

PCA Projection — Clusters vs. True Labels

LEFT: K-Means separates two groups in 2D PCA space. Cluster 0 (blue/CKD) fans right along PC1; Cluster 1 (red/not-CKD) forms a tight cluster on the far left.

RIGHT: True labels confirm the compact not-CKD group (gray) vs. spread-out CKD patients (red) — validating that K-Means captured genuine clinical structure without using labels.

9

10 of 15

Feature Distributions Confirm Clinical Separation

Hemoglobin: C0 (blue/healthy) peaks ~15 g/dL; C1 (red/CKD) peaks ~13–15 g/dL but has a long left tail down to ~5 — showing anemia in the sickest patients

Creatinine: C1 (red/CKD) is almost entirely clustered near 0 but with extreme outliers to 70+ mg/dL — severely impaired kidney filtration in some patients

Glucose: Both clusters peak around 100 mg/dL but C0 (blue) has a longer right tail to 500 — higher glucose spread in the healthy cluster, likely due to diabetic outliers

Blood Pressure: Both clusters strongly overlap at 60–90 mmHg — BP alone cannot distinguish CKD from non-CKD patients

10

11 of 15

Cluster Profiles at a Glance

Reading the heatmap:

Red cells: = above-average value for that feature in that cluster (e.g., Cluster 1 has high creatinine & blood urea)

Blue cells: = below-average value (e.g., Cluster 0 has low creatinine and high hemoglobin relative to average)

Raw values: annotated inside each cell — z-scores used only for visual comparison, not for clinical thresholds

11

12 of 15

Validation, Assumptions & Weaknesses

0.257

Silhouette

Score

Higher = better

Max = 1.0

0.441

Adjusted

Rand Index

Agreement with

true CKD label

n=50

Random

Restarts

Stable, converged

solution

Missing Data (~10% per feature)

Median/mode imputation may mask true biological variation. Some features (sodium, WBC) had up to 13% missing.

K-Means Assumes Spherical Clusters

Real patient clusters may be elongated or non-convex. GMM or DBSCAN could reveal different structure.

Small Dataset (n=400)

Results may not generalise. A dataset of 10,000+ patients from multiple hospitals is needed for clinical validation.

2D PCA Loses 71% of Variance

PC1+PC2 explain only ~29% of variance; 90% threshold requires ~15 components. Visual separation looks clean, but full-dimensional structure is more complex.

ARI Uses True Labels Post-Hoc

We compared to labels after clustering — this is not data leakage, but results should not be used to claim supervised accuracy.

Alternative Explanation

The two groups may simply reflect CKD severity, not distinct clinical subtypes. Higher k (3-4) with more data could reveal finer subgroups.

12

13 of 15

Could Other Methods Give Cleaner Clusters?

Our ARI of 0.441 shows partial alignment — here's what alternative approaches could offer:

Gaussian Mixture

Models (GMM)

K-Means forces hard boundaries — every patient belongs to exactly one cluster. GMM allows soft assignments, so a patient on the border gets a probability of belonging to each group. Better for overlapping clinical profiles.

Harder to explain to clinicians. Assumes data follows a Gaussian distribution which may not hold.

DBSCAN

Doesn't require you to pick k in advance. Finds clusters based on density — so tightly packed patient groups are detected automatically. Also labels outliers explicitly rather than forcing them into a cluster.

Very sensitive to the epsilon parameter. Struggles with clusters of different densities, which is common in clinical data.

Hierarchical

Clustering

Builds a tree (dendrogram) showing how patients merge into groups at different scales. You can cut the tree at any level — useful for exploring whether k=3 or k=4 reveals meaningful CKD subtypes.

Doesn't scale well to large datasets. Merges are permanent — a bad early merge can't be undone.

t-SNE / UMAP

(Visualisation)

Better at preserving local structure for visualisation than PCA. Would likely show tighter, more separated point clouds in 2D — making the clusters look visually cleaner and easier to interpret.

Not a clustering method — only for visualisation. Results change with random seed and parameters, making them hard to reproduce.

13

14 of 15

KEY TAKEAWAY

What Did We Find?

We gave the algorithm 400 patients' blood test results and said "find me groups" — without ever telling it who had CKD. It found two groups with meaningful differences in lab profiles (silhouette = 0.257), with moderate alignment to true diagnoses (ARI = 0.441).

1

Cluster 0 captured the sickest patients

183 patients, 100% CKD — low hemoglobin, high creatinine, high blood urea. Lab values consistently point to impaired kidney function and anemia.

2

Cluster 1 was a mixed group

217 patients, only 31% CKD — better lab profiles overall. This tells us the algorithm found a real severity gradient, not just sick vs. healthy.

3

ARI of 0.441 — meaningful but not perfect

Moderate agreement with true labels. The clusters capture clinical structure but don't map 1-to-1 with diagnosis — which is expected and honest for unsupervised learning.

14

15 of 15

KEY TAKEAWAY

One Clear Message:

Unsupervised clustering recovers

the clinical CKD/not-CKD divide

from raw lab data alone —

and hints at finer subgroups within CKD patients defined

by anemia severity, glucose control, and kidney function.

Reproducible

Open dataset, public code, fixed seed — any researcher can replicate these results.

Clinically Grounded

Cluster differences align with known CKD pathophysiology (anemia, creatinine, glucose).

Honest

Limitations stated upfront: small n, imputation, spherical cluster assumption, 2D projection.

15