1 of 25

1

XClusters: Explainability-first Clustering

Hyunseung Hwang and Steven Whang

Data Intelligence Lab, KAIST

2 of 25

Explaining Clusters

2

  • Clusters may not be easy to understand even if they are accurate
  • Example: Credit card transactions, Item purchases, Google search trends

  • (20s, Shopping, Online)
  • (20s, Car Sales, Online)
  • (30s, Shopping, Offline )
  • (40s, Households, Online)

…..

  • (20s, Shopping, Offline)
  • (30s, Car Sales, Offline)
  • (40s, Shopping, Offline)
  • (40s, Car Sales, Online)

…..

C1

C2

3 of 25

Explaining Clusters

3

  • Clusters may not be easy to understand even if they are accurate
  • Example: Credit card transactions, Item purchases, Google search trends

  • (20s, Shopping, Online)
  • (20s, Car Sales, Online)
  • (30s, Shopping, Offline )
  • (40s, Households, Online)

…..

  • (20s, Shopping, Offline)
  • (30s, Car Sales, Offline)
  • (40s, Shopping, Offline)
  • (40s, Car Sales, Online)

…..

C1

C2

Online

True

C2

Car Sales

Shopping

False

C1

C1

C2

Decision trees provide succinct summaries

4 of 25

Explainability-first Clustering

4

Explainable Clustering

  • Given a set of fixed clusters, fit tree
  • Only consider cluster quality (distortion)

5 of 25

Explainability-first Clustering

5

Explainable Clustering

  • Given a set of fixed clusters, fit tree
  • Only consider cluster quality (distortion)

Root

Online

20s

30s

30s

40s

Offline

6 of 25

Explainability-first Clustering

6

Root

Online

Offline

Explainable Clustering

  • Given a set of fixed clusters, fit tree
  • Only consider cluster quality (distortion)

Explainability-first Clustering

  • Fit tree and clustering together
  • Balance cluster quality (distortion) and explainability (tree size)

Root

Online

20s

30s

30s

40s

Offline

7 of 25

Two Types of Features

  • Accuracy Features: used in original clustering
    • Ex) Time-series trends
  • Explainability Features: used in decision tree training
    • Ex) Demographics
  • The two sets may overlap

7

8 of 25

Extend Clustering

  • Assumption: Clustering algorithm can adjust number of clusters k and uses a distance function for its clustering
  • A-dist: Original distance using accuracy features
  • E-dist: Another distance using explainability features
    • Clustering with E-dist results in clusters with similar explainability features, which in turn results in a smaller tree
  • New Distance Function = (1-α) x A-dist + α x E-dist

8

9 of 25

Holistic Optimization

  • Minimize Cluster distortion (D) + tree size (N)
    • Cluster distortion: sum of squares of distances between examples and clustroid
    • Tree size: number of nodes
  • Parameters
    • Number of clusters k
    • Ratio of A-dist and E-dist α
  • Iterating through all combinations of (k,α) is expensive
  • We instead identify and utilize monotonicity properties

9

10 of 25

Monotonic Properties

  • We observe monotonic relationships between k, α and N, D
  • Not guaranteed to perfectly hold, but we exploit as much as possible

10

c1

c2

c3

c1

c1

c2

c3

c1

c2

c3

c1

c2

c3

c1

k increases N increases, D decreases

α increases N decreases, D increases

11 of 25

Monotonic Optimization

  • Reformulate our problem to monotonic optimization
    • Changing k and α have opposite effects on D and N
    • Rewrite objective function to minimize F(k,α) - G(k,α)
      • F: Captures decreasing D and N when α and k increase, respectively
      • G: Captures decreasing D and N when α and k decrease, respectively

  • This enables us to use existing Branch-Reduce-and-Bound algorithm

11

12 of 25

Branch-Reduce-and-Bound

  1. Compute lower and upper bounds of the objective for each block
  2. Divide blocks into halves and compute bounds
  3. Discard blocks with (lower bound + tolerance) > smallest upper bound
  4. Terminate when no blocks are left to prune

12

B1

1

8

k

0

1

α

B2

B3

B3

B4

B5

13 of 25

Computing Upper, Lower Bounds

  • Lower bound
    • Lower bound is computed as follows
    • F(k,α) - G(k,α) = D(k,α) + λN(k,α)

D(k2,α) + λN(k1,α)

D(k2,α1) + λN(k1,α2)

  • Upper bound
    • Since we find min value, upper bound can be computed with any values in block
    • We use min{D(k1,α2) + λN(k1,α2), D(k2,α1) + λN(k2,α1)}
  • See our paper for overall algorithm complexity

13

(k1,α2)

(k2,α1)

B

k1

k2

k

α1

α2

α

14 of 25

Example

  • Let tolerance ε = 0.1

14

Box

Lower bound

Upper Bound

B2

0.3

0.4

B3

0.2

0.5

B2

B3

15 of 25

Example

  • Let tolerance ε = 0.1

15

B3 lower

< B2 lower

Box

Lower bound

Upper Bound

B2

0.3

0.4

B3

0.2

0.5

Box

Lower bound

Upper Bound

B2

0.3

0.4

B4

0.41

0.6

B5

0.5

0.7

B2

B3

B2

B4

B5

16 of 25

Example

  • Let tolerance ε = 0.1

16

B3 lower

< B2 lower

Discard B4, B5.

B2 = best block!

Return B2’s Upper bound

Box

Lower bound

Upper Bound

B2

0.3

0.4

B3

0.2

0.5

Box

Lower bound

Upper Bound

B2

0.3

0.4

B4

0.41

0.6

B5

0.5

0.7

0.41+ε > 0.4

0.5+ε > 0.4

B2

B3

B2

B4

B5

17 of 25

Experiments

We demonstrate:

  1. How XClusters balances clustering quality and explainability
  2. How XClusters improves on existing explainable clustering

17

18 of 25

Experiment Settings

  • Methods compared
    • 2-step: Performs clustering, then decision tree training. Sets k using elbow method.
    • Grid Search (GS): Performs clustering for k = [3,11], α = [0, 0.05, 0.10, … 1.0]
    • Bayesian Optimization (BO): Performs hyperparameter tuning using 10~30 initial points and iterations
    • XClusters: Performs monotonic optimization. λ = 1.
  • Datasets
    • Credit Card dataset: 5.7B transactions = $120B USD, 2,551 demographics
    • Other real-world datasets with similar results: DS4C and Contracts

18

19 of 25

Monotonicity Observations

  • Averaged D and N values for all α and k values
  • Monotonicity properties mostly hold for all 3 datasets

19

k versus D

k versus N

α versus D

α versus N

20 of 25

Explainability and Distortion Results

Our algorithm achieves accurate and explainable results with a small runtime

20

Dataset

Method

D+λN

D

N

Runtime (s)

Credit Card

2-Step

1.005

0.166

0.840

0.426

GS

0.625

0.391

0.234

64.455

BO

0.657

0.364

0.293

16.595

XClusters

0.656

0.385

0.271

6.336

21 of 25

Explainability and Distortion Results

Our algorithm achieves accurate and explainable results with a small runtime

21

High N, Low D

Dataset

Method

D+λN

D

N

Runtime (s)

Credit Card

2-Step

1.005

0.166

0.840

0.426

GS

0.625

0.391

0.234

64.455

BO

0.657

0.364

0.293

16.595

XClusters

0.656

0.385

0.271

6.336

22 of 25

Explainability and Distortion Results

XClusters achieves a lowest objective value with a small runtime

22

High Runtime

Dataset

Method

D+λN

D

N

Runtime (s)

Credit Card

2-Step

1.005

0.166

0.840

0.426

GS

0.625

0.391

0.234

64.455

BO

0.657

0.364

0.293

16.595

XClusters

0.656

0.385

0.271

6.336

23 of 25

Comparison with Explainable Clustering

XClusters performs better due to its holistic optimization

23

Dataset

Method

D+λN

D

N

Credit Card

Explainable k-Means [1]

1.081

0.253

0.828

XClusters with Explainable k-Means

0.708

0.565

0.143

[1] Moshkovitz et al. “Explainable k-Means and k-Medians Clustering,” 2020 ICML

24 of 25

Conclusion

  • We propose the new paradigm of explainability-first clustering
    • Holistic optimization on minimizing clustering distortion and decision tree size
  • We identify and utilize monotonicity properties for efficient optimization
  • Experiments show that XClusters outperforms explainable clustering

24

25 of 25

Conclusion

  • We propose the new paradigm of explainability-first clustering
    • Holistic optimization on minimizing clustering distortion and decision tree size
  • We identify and utilize monotonicity properties for efficient optimization
  • Experiments show that XClusters outperforms explainable clustering

25

Thank you!