1 of 25

XClusters: Explainability-first Clustering

Hyunseung Hwang and Steven Whang

Data Intelligence Lab, KAIST

2 of 25

Explaining Clusters

Clusters may not be easy to understand even if they are accurate
Example: Credit card transactions, Item purchases, Google search trends

(20s, Shopping, Online)
(20s, Car Sales, Online)
(30s, Shopping, Offline )
(40s, Households, Online)

…..

(20s, Shopping, Offline)
(30s, Car Sales, Offline)
(40s, Shopping, Offline)
(40s, Car Sales, Online)

…..

3 of 25

Explaining Clusters

Clusters may not be easy to understand even if they are accurate
Example: Credit card transactions, Item purchases, Google search trends

(20s, Shopping, Online)
(20s, Car Sales, Online)
(30s, Shopping, Offline )
(40s, Households, Online)

…..

(20s, Shopping, Offline)
(30s, Car Sales, Offline)
(40s, Shopping, Offline)
(40s, Car Sales, Online)

…..

Online

True

Car Sales

Shopping

False

Decision trees provide succinct summaries

4 of 25

Explainability-first Clustering

Explainable Clustering

Given a set of fixed clusters, fit tree
Only consider cluster quality (distortion)

5 of 25

Explainability-first Clustering

Explainable Clustering

Given a set of fixed clusters, fit tree
Only consider cluster quality (distortion)

Root

Online

20s

30s

40s

Offline

6 of 25

Explainability-first Clustering

Root

Online

Offline

Explainable Clustering

Given a set of fixed clusters, fit tree
Only consider cluster quality (distortion)

Explainability-first Clustering

Fit tree and clustering together
Balance cluster quality (distortion) and explainability (tree size)

Root

Online

20s

30s

40s

Offline

7 of 25

Two Types of Features

Accuracy Features: used in original clustering

Ex) Time-series trends

Explainability Features: used in decision tree training

Ex) Demographics

The two sets may overlap

8 of 25

Extend Clustering

Assumption: Clustering algorithm can adjust number of clusters k and uses a distance function for its clustering
A-dist: Original distance using accuracy features
E-dist: Another distance using explainability features

Clustering with E-dist results in clusters with similar explainability features, which in turn results in a smaller tree

New Distance Function = (1-α) x A-dist + α x E-dist

9 of 25

Holistic Optimization

Minimize Cluster distortion (D) + tree size (N)

Cluster distortion: sum of squares of distances between examples and clustroid
Tree size: number of nodes

Parameters

Number of clusters k
Ratio of A-dist and E-dist α

Iterating through all combinations of (k,α) is expensive
We instead identify and utilize monotonicity properties

10 of 25

Monotonic Properties

We observe monotonic relationships between k, α and N, D
Not guaranteed to perfectly hold, but we exploit as much as possible

k increases → N increases, D decreases

α increases → N decreases, D increases

11 of 25

Monotonic Optimization

Reformulate our problem to monotonic optimization

Changing k and α have opposite effects on D and N
Rewrite objective function to minimize F(k,α) - G(k,α)

F: Captures decreasing D and N when α and k increase, respectively
G: Captures decreasing D and N when α and k decrease, respectively

This enables us to use existing Branch-Reduce-and-Bound algorithm

12 of 25

Branch-Reduce-and-Bound

Compute lower and upper bounds of the objective for each block
Divide blocks into halves and compute bounds
Discard blocks with (lower bound + tolerance) > smallest upper bound
Terminate when no blocks are left to prune

13 of 25

Computing Upper, Lower Bounds

Lower bound

Lower bound is computed as follows
F(k,α) - G(k,α) = D(k,α) + λN(k,α)

≥ D(k2,α) + λN(k1,α)

≥ D(k2,α1) + λN(k1,α2)

Upper bound

Since we find min value, upper bound can be computed with any values in block
We use min{D(k1,α2) + λN(k1,α2), D(k2,α1) + λN(k2,α1)}

See our paper for overall algorithm complexity

(k1,α2)

(k2,α1)

α1

α2

14 of 25

Example

Let tolerance ε = 0.1

Box	Lower bound	Upper Bound
B2	0.3	0.4
B3	0.2	0.5

15 of 25

Example

Let tolerance ε = 0.1

B3 lower

< B2 lower

Box	Lower bound	Upper Bound
B2	0.3	0.4
B3	0.2	0.5

Box	Lower bound	Upper Bound
B2	0.3	0.4
B4	0.41	0.6
B5	0.5	0.7

16 of 25

Example

Let tolerance ε = 0.1

B3 lower

< B2 lower

Discard B4, B5.

B2 = best block!

Return B2’s Upper bound

Box	Lower bound	Upper Bound
B2	0.3	0.4
B3	0.2	0.5

Box	Lower bound	Upper Bound
B2	0.3	0.4
B4	0.41	0.6
B5	0.5	0.7

0.41+ε > 0.4

0.5+ε > 0.4

17 of 25

Experiments

We demonstrate:

How XClusters balances clustering quality and explainability
How XClusters improves on existing explainable clustering

18 of 25

Experiment Settings

Methods compared

2-step: Performs clustering, then decision tree training. Sets k using elbow method.
Grid Search (GS): Performs clustering for k = [3,11], α = [0, 0.05, 0.10, … 1.0]
Bayesian Optimization (BO): Performs hyperparameter tuning using 10~30 initial points and iterations
XClusters: Performs monotonic optimization. λ = 1.

Datasets

Credit Card dataset: 5.7B transactions = $120B USD, 2,551 demographics
Other real-world datasets with similar results: DS4C and Contracts

In our experiments we compare XClusters with 3 baseline methods.

The 2-Step baseline first performs clustering and then decision tree training. The number of clusters is determined using an elbow method and is fixed from then on.

The Grid Search baseline exhaustively going through possible k and alpha values. Here we set k to range between 3 and 11, and alpha to range between 0 and 1 with 0.05 increments.

Bayesian optimization is a state-of-the-art hyperparameter tuning method that works in a general optimization setting. We use 10~30 initial points and iterations.

XClusters is our method. In comparison to Bayesian optimization, the key difference is that it performs monotonic optimization based on the monotonicity properties.

We use a real-world credit card dataset with 5.7 billion transactions that is worth 120 billion us dollars for 2551 demographics.

Other real world datasets show similar results. Please refer to the paper for the details.

19 of 25

Monotonicity Observations

Averaged D and N values for all α and k values
Monotonicity properties mostly hold for all 3 datasets

k versus D

k versus N

α versus D

α versus N

20 of 25

Explainability and Distortion Results

Our algorithm achieves accurate and explainable results with a small runtime

Dataset	Method	D+λN	D	N	Runtime (s)
Credit Card	2-Step	1.005	0.166	0.840	0.426
	GS	0.625	0.391	0.234	64.455
	BO	0.657	0.364	0.293	16.595
	XClusters	0.656	0.385	0.271	6.336

21 of 25

Explainability and Distortion Results

Our algorithm achieves accurate and explainable results with a small runtime

High N, Low D

Dataset	Method	D+λN	D	N	Runtime (s)
Credit Card	2-Step	1.005	0.166	0.840	0.426
	GS	0.625	0.391	0.234	64.455
	BO	0.657	0.364	0.293	16.595
	XClusters	0.656	0.385	0.271	6.336

22 of 25

Explainability and Distortion Results

XClusters achieves a lowest objective value with a small runtime

High Runtime

Dataset	Method	D+λN	D	N	Runtime (s)
Credit Card	2-Step	1.005	0.166	0.840	0.426
	GS	0.625	0.391	0.234	64.455
	BO	0.657	0.364	0.293	16.595
	XClusters	0.656	0.385	0.271	6.336

23 of 25

Comparison with Explainable Clustering

XClusters performs better due to its holistic optimization

Dataset	Method	D+λN	D	N
Credit Card	Explainable k-Means [1]	1.081	0.253	0.828
Credit Card	XClusters with Explainable k-Means	0.708	0.565	0.143

[1] Moshkovitz et al. “Explainable k-Means and k-Medians Clustering,” 2020 ICML

24 of 25

Conclusion

We propose the new paradigm of explainability-first clustering

Holistic optimization on minimizing clustering distortion and decision tree size

We identify and utilize monotonicity properties for efficient optimization
Experiments show that XClusters outperforms explainable clustering

25 of 25

Conclusion

We propose the new paradigm of explainability-first clustering

Holistic optimization on minimizing clustering distortion and decision tree size

We identify and utilize monotonicity properties for efficient optimization
Experiments show that XClusters outperforms explainable clustering

Thank you!