1 of 46

��A Scalable mixed-integer programming based framework for optimal decision tree��Haoran Zhu�University of Wisconsin-Madison�August 16, 2019��

IBM Research

2 of 46

Introduction to optimal decision tree

Classification tree and regression tree are two main types of decision trees;
Optimal decision tree is known as NP-complete under several aspects of optimality;
Heuristic based algorithms are what people mainly used for training decision tree: CART, ID3, C4.5, etc.;
Interpretability is a huge advantage of decision tree.

IBM Research

2

IBM Research

3 of 46

Main considerations

Interpretability

Scalability

Training accuracy

Testing accuracy

IBM Research

3

IBM Research

4 of 46

Why “ODT”

Main goal

Main issue

Interpretability

Scalability

Training accuracy

Testing accuracy

IBM Research

4

IBM Research

5 of 46

Common methods for learning “ODT”

Due to the combinatorial structure of decision tree, the following two ways of training ODT are mainly used in literature:

Boolean formula-based model;

IBM Research

5

IBM Research

6 of 46

Common methods for learning “ODT”

Due to the combinatorial structure of decision tree, the following two ways of training ODT are mainly used in literature:

Boolean formula-based model;
MIP (mixed-integer programming) based formulation.

Starting from (Bertsimas and Dunn: Machine learning, 2017; Menickelly et al.: COR@L Technical Report, 2016), MIP formulation has been widely studied for training optimal decision tree.
Pros:

Has much freedom in formulating any specific tree structure, or model simplicity;
Can utilize the current State-of-art MIP solver;
Noted by (Bertsimas and Dunn: Machine learning, 2017): “solving the decision tree problem to optimality yields trees that better reflect the ground truth in the data, refuting the belief that such optimal methods will simply overfit to the training set and not generalize well.”

Cons:

Computational intractable for large scale data sets;
Generalization behavior of trained ODT is not significantly better than those from heuristic.

IBM Research

6

IBM Research

7 of 46

Common methods for learning “ODT”

Due to the combinatorial structure od decision tree, the following two ways of training ODT are mainly used in literature:

Boolean formula-based model;
MIP (mixed-integer programming) based formulation.

Starting from (Bertsimas and Dunn: Machine learning, 2017; Menickelly et al.: COR@L Technical Report, 2016), MIP formulation has been widely studied for training optimal decision tree.
Pros:

Has much freedom in formulating any specific tree structure, or model simplicity;
Can utilize the current State-of-art MIP solver;
Noted by (Bertsimas and Dunn: Machine learning, 2017): “solving the decision tree problem to optimality yields trees that better reflect the ground truth in the data, refuting the belief that such optimal methods will simply overfit to the training set and not generalize well.”

Cons:

Computational intractable for large scale data sets;
Generalization behavior of trained ODT is not significantly better than those from heuristic.

Do data-selection before training

A different formulation

IBM Research

7

IBM Research

8 of 46

Our motivation

Come up with a novel MIP formulation to obtain the ODT with great generalization ability:

ODT in (Bertsimas and Dunn: Machine learning, 2017) outperforms CART in only 60% datasets, with an average improvement around 1-5%.

Do data-selection before constructing the MIP formulation to train ODT:

In literature all MIP-based ODT training methods can handle at most around 5,000 data points, and in most cases for those datasets, the obtained ODT after 2 hours training is simply the same as the initially provided warm-start solution.

IBM Research

8

IBM Research

9 of 46

Our contribution

Novel MIP formulation to train ODT, which has much better testing accuracy (7-18% in average) compared to its counterparts in literature.

Under mild modification this MIP formulation can also be used to train dataset with categorical features directly, without transforming to numerical features via hot-encoding, for example;
Numerical tests suggest this formulation is more compact than most of its counterparts;
We also provide some cutting-planes to further increase the computational speed.

Novel LP-based data-selection method.

Each LP sub-problem is easy to solve, and can be solved in parallel;
Comparison results from other data-selection methods shows that the quality of selected data subset from our method is general better.

IBM Research

9

IBM Research

10 of 46

Related state-of-art

MIP formulation for training ODT:

(Bertsimas and Dunn: Machine learning, 2017);
(Gunluk et al.: CoRR, 2018);
(Verwer and Zhang, 2017); (Verwer and Zhang: AAAI, 2019);
(Aghaei, Azizi, and Vayanos: AAAI, 2019).
(Rhuggenaath et al.: IEEE, 2018).

Different scenarios when doing data-selection:

Computational geometry point of view: selecting point subsets (coreset) to approximate the extent measures of the whole point set;
How to efficiently select a subset of data for SVM training;
In query learning and active learning, how to label a subset of most informative data points;
Selecting a data subset to maximize some utility function (data likelihood function, information gain, etc.).

IBM Research

10

IBM Research

11 of 46

Tree structure

We focusing on the decision tree utilizing multivariate for splitting.

Our MIP formulation is constructed based on a balanced binary tree. However any pre-given tree topology can be easily formulated into MIP constraints.

The only characterization factors of our ODT is essentially given by those branching hyperplane at each branch node, and labelled class at each leaf node.

IBM Research

11

IBM Research

12 of 46

MIP-ODT formulation

IBM Research

12

IBM Research

13 of 46

MIP-ODT formulation

IBM Research

13

IBM Research

14 of 46

Heuristic of the formulation

IBM Research

14

IBM Research

15 of 46

Further strengthening the formulation

IBM Research

15

IBM Research

16 of 46

Speedup the computation

IBM Research

16

IBM Research

17 of 46

Modification for categorical feature

Adding constraints

Replace constraints

IBM Research

17

IBM Research

18 of 46

LP-based data-selection

IBM Research

18

IBM Research

19 of 46

LP-based data-selection

IBM Research

19

IBM Research

20 of 46

LP-based data-selection

General mixed-binary LP formulation:

IBM Research

20

IBM Research

21 of 46

Selecting extreme points

IBM Research

21

IBM Research

22 of 46

Soft Convex Combination Constraint (CCC)

In many cases, especially in high dimensional space or there exists random error, it’s uncommon to be able to express on data point as the convex combination of other points;

IBM Research

22

IBM Research

23 of 46

Balanced data-selection

Sometimes the data points inside the convex hull can better depict the whole picture of the entire dataset:

Example: f = 1, g = 1. However the previous equivalence to another simple LP is no longer true;

IBM Research

23

IBM Research

24 of 46

Initial clustering matters

IBM Research

24

IBM Research

25 of 46

Initial clustering matters

IBM Research

25

IBM Research

26 of 46

Iterative ODT training algorithm

IBM Research

26

IBM Research

27 of 46

Numerical results: medium-sized dataset

IBM Research

27

IBM Research

28 of 46

Numerical results: medium-sized dataset (contd.)

IBM Research

28

IBM Research

29 of 46

Numerical results: medium-sized dataset (bad instances)

IBM Research

29

IBM Research

30 of 46

Further comparison: one more formulation in (Verwer and Zhang, 2019)

IBM Research

30

IBM Research

31 of 46

Numerical result: large-sized dataset

IBM Research

31

IBM Research

32 of 46

Numerical result: even larger……

IBM Research

32

IBM Research

33 of 46

Tree depth D=3

IBM Research

33

IBM Research

34 of 46

Numerical reports from (Rhuggenaath et al.: IEEE, 2018)

IBM Research

34

IBM Research

35 of 46

Comparison of different data-selections

IBM Research

35

IBM Research

36 of 46

IBM data (timeseries_rca)

IBM Research

36

IBM Research

37 of 46

IBM data (FPAtable__oil_well__as__15D__60D)

1.6 : 1, 892, 6h	MIP-ODT-DS	CART	CART(entire)
Training acc.	0.76	0.70	0.97
Testing acc.	0.88	0.89	0.97
Recall	0.39	0.31	0
Precision	0.10	0.09	0

IBM Research

37

IBM Research

38 of 46

IBM data (FPAtable__oil_well__as__15D__60D)

3.2 : 1, 405, 1h	MIP-ODT-DS	CART	CART(entire)
Training acc.	0.82	0.76	0.97
Testing acc.	0.75	0.75	0.97
Recall	0.42	0.39	0
Precision	0.05	0.05	0

IBM Research

38

IBM Research

39 of 46

Thank you!

IBM Research

39

IBM Research

40 of 46

Appendix

IBM Research

40

IBM Research

41 of 46

Appendix

IBM Research

41

IBM Research

42 of 46

Appendix

IBM Research

42

IBM Research

43 of 46

Appendix

IBM Research

43

IBM Research

44 of 46

Appendix

IBM Research

44

IBM Research

45 of 46

Appendix (Segmentation; D=3)

IBM Research

45

IBM Research

46 of 46

Appendix (Dermatology; D=3)

IBM Research

46

IBM Research