1 of 63

BASIC MACHINE LEARNING I

2 of 63

WHAT IS MACHINE LEARNING

  • Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data
  • A part of artificial intelligence
  • Machine learning algorithms are used in a wide variety of applications
    • Medicine
    • Neuron science
    • Data science
    • Robot
    • Statistics

3 of 63

FRAMEWORK OF MACHINE LEARNING

Training data

Training

Model

Testing data

New query

Output

Black Box

4 of 63

TRAINING AND TESTING

  • Training set is for feeding the information (features) to the machine for building the model
  • Testing set is for evaluating the accuracy of the built model, it should be totally separated from the training set
  • The model is specific based on the selected features
  • If the other features may influence the new query, the training needs to be re-initiated with new such new feature

5 of 63

MACHINE LEARNING TYPES

  • Supervised learning
    • Need known labels for the input data
    • Majority for classification
  • Unsupervised learning
    • Do not need the labels
    • Majority for data grouping
  • Reinforcement learning
    • Learn from mistakes
    • Reward-based questions

6 of 63

MACHINE LEARNING TYPES

  • Supervised learning

7 of 63

MACHINE LEARNING TYPES

  • Unsupervised learning

8 of 63

MACHINE LEARNING TYPES

  • Reinforcement learning

9 of 63

Patient classification

A predictor assigns a new case to an existing group

Protein expression

Train

Predictor

Use predictor to classify new cases

Predictor

10 of 63

Proteins that best distinguish the groups of patients

Use predictor to find a prognostic signature

Protein expression

Train

Predictor

Protein A

Protein B

Predictor

11 of 63

INPUT DATA

  • Supervised learning
    • The data as a set of inputs (features)
    • X (Label): Y (Data)
  • Unsupervised learning
    • The input data directly go to the algorithm without preprocessing
  • Reinforcement learning
    • No raw data given, learning from trial and error

12 of 63

DATA BALANCE V.S IMBALANCE

  • Balanced Dataset — In our dataset, positive values which are approximately same as negative values
  • Imbalanced Dataset — High different between the positive values and negative values

13 of 63

EXAMPLE OF IMBALANCED DATASET

  • Transcriptional terminator: 2,000
  • Whole nucleotides: 4,000,000
  • P:2,000; N:3,998,000
  • After training, the results of new query are always negative

14 of 63

USING GOOD EVALUATION METRICS

  • Sensitivity

  • Specificity

  • Accuracy

  • Strongly influenced by the imbalanced dataset

15 of 63

USING GOOD EVALUATION METRICS

  • Precision

  • Recall

  • F1 score

  • Good for the imbalanced dataset (few positives)

16 of 63

Patients/P

Normal/N

Total

10

990

Prediction

1

999

True positive/negative

1

990

False positive/negative

0

9

Sensitivity

1/10 = 10%

Specificity

990/990 = 100%

Accuracy

1+990/10+990 = 99.1%

Precision

1/1 = 100%

Recall

1/10 = 10%

F1 Score

2x1/2x1+0+9 = 18.2%

17 of 63

OVER-SAMPLING (UP SAMPLING)

  • Balance by increasing the size of rare samples
  • Advantages
    • No loss of information
    • Mitigate overfitting caused by oversampling
  • Disadvantages
    • Overfitting

18 of 63

UNDER-SAMPLING (DOWN SAMPLING)

  • Balance by reducing the size of majority samples
  • Advantages
    • Run-time can be improved by decreasing the amount of training dataset.
    • Helps in solving the memory problems
  • Disadvantages
    • Losing critical information

19 of 63

FEATURE SELECTION

  • Need to select the features which is highly related with the target question
  • Learning the wrong lesson
    • an image classifier trained only on pictures of brown horses and black cats might conclude that all brown patches are likely to be horses
  • Different features may contribute the predictions differently

20 of 63

FEATURE FILTER

  • Based on the variation of the features and target to evaluate the importance of the features
  • Remove the feature with low variance
    • The training set is compose of all male, then gender is a useless features with low variance
  • Using the correlation to remove the features which are highly related
    • The age and the mobility are highly correlated

21 of 63

FEATURE WRAPPER

  • Based on a base model to train multiple times and remove the features having the lowest weight in each run
  • The training can remove the features which have less contribution

22 of 63

FEATURE EMBEDDING

  • Similar as feature wrapper
  • The feature removing occurs during the learning process
  • Remove the features which have less contribution and no relationship with the target

23 of 63

FEATURE WEIGHT

  • Some methods do produce the information of feature weighting, but some do not.

24 of 63

HYPERPARAMETERS

  • Hyperparameters are parameters set before the training process that influence how the model learns and performs.
    • Learning rate determines the speed at which the model updates weights.
    • Depth of a decision tree affects its complexity.
  • Optimization
    • Grid searching
    • Random searching
    • Bayesian searching

25 of 63

HYPERPARAMETER OPTIMIZATION

Fruit

Shape

Size

Smell

Color

Score

Circle

Middle

Scent

Red

Ellipse

Big

Scent

Yellow

Ellipse

Big

odor

Yellow

Circle

Middle

No smell

Green

Circle

Small

No smell

Purple

26 of 63

HYPERPARAMETER OPTIMIZATION

Fruit

Shape

Size

Smell

Color

Score

1

2

2

1

6

2

3

2

3

10

2

3

3

3

11

1

2

1

2

6

1

1

1

4

7

27 of 63

HYPERPARAMETER OPTIMIZATION

Fruit

Shape

Size

Smell

Color

Score

1 x a

2 x b

2 x c

1 x d

6

2 x a

3 x b

2 x c

3 x d

10

2 x a

3 x b

3 x c

3 x d

11

1 x a

2 x b

1 x c

2 x d

6

1 x a

1 x b

1 x c

4 x d

7

28 of 63

HYPERPARAMETER OPTIMIZATION

Fruit

Shape

Size

Smell

Color

Score

1 x 1

2 x 2

2 x 1

1 x 0.5

7.5

2 x 1

3 x 2

2 x 1

3 x 0.5

11.5

2 x 1

3 x 2

3 x 1

3 x 0.5

12.5

1 x 1

2 x 2

1 x 1

2 x 0.5

7

1 x 1

1 x 2

1 x 1

4 x 0.5

6

29 of 63

GRID SEARCHING

  • It systematically tests all possible combinations of specified hyperparameter values to find the best configuration for a model.
  • Each combination is evaluated using a chosen performance metric, often through cross-validation, and the one yielding the best result is selected.
  • While effective, grid search can be computationally expensive, especially when the hyperparameter space is large. And the accuracy also depends on how we split the data.

30 of 63

RANDOM SEARCHING

  • Random search selects random combinations within the specified ranges.
  • This approach allows for broader exploration of the hyperparameter space while often being more efficient, especially when only a few hyperparameters significantly impact model performance.
  • By testing a smaller, random subset of combinations, random search can achieve good results with less computational effort. But the global optimization may not be found.

31 of 63

BAYESIAN SEARCHING

  • Unlike grid or random search, which test combinations without considering past results, Bayesian search builds a probabilistic model of the objective function based on previous evaluations.
  • Bayesian search is particularly useful for expensive-to-train models, where each evaluation of a hyperparameter combination takes significant computational resources.

32 of 63

HYPERPARAMETER OPTIMIZATION

33 of 63

BAYESIAN OPTIMIZATION

ID

Age

Gender

Income

Mariage

Children

Buy

1

33

M

5.3W

Y

Y

Y

2

37

F

4.5W

Y

Y

Y

3

22

F

3W

N

N

N

4

35

M

7.8W

Y

Y

Y

5

60

M

6W

N

Y

N

6

63

F

5.5W

Y

N

N

7

55

F

6W

Y

Y

N

8

18

F

2.8W

Y

N

N

9

21

M

3.5W

N

N

N

10

46

F

7W

N

N

N

34 of 63

BAYESIAN OPTIMIZATION

ID

Age

Gender

Income

Mariage

Children

Buy

1

33

M

5.3W

Y

Y

Y

2

37

F

4.5W

Y

Y

Y

3

22

F

3W

N

N

N

4

35

M

7.8W

Y

Y

Y

5

60

M

6W

N

Y

N

6

63

F

5.5W

Y

N

N

7

55

F

6W

Y

Y

N

8

18

F

2.8W

Y

N

N

9

21

M

3.5W

N

N

N

10

46

F

7W

N

N

N

11

38

M

8.3W

Y

Y

3/10 ?

35 of 63

BAYESIAN OPTIMIZATION

ID

Age

Gender

Income

Mariage

Children

Buy

1

33

M

5.3W

Y

Y

Y

2

37

F

6.5W

Y

Y

Y

4

35

M

7.8W

Y

Y

Y

3

22

F

3W

N

N

N

5

60

M

5W

N

Y

N

6

63

F

5.5W

Y

N

N

7

55

F

6W

Y

Y

N

8

18

F

2.8W

Y

N

N

9

21

M

3.5W

N

N

N

10

46

F

7W

N

N

N

11

38

M

8.3W

Y

Y

Maybe Y

3/10 ?

36 of 63

COST-SENSITIVE LEARNING

  • Taking the misclassification costs (and possibly other types of cost) into consideration for learning
  • The goal of this type of learning is to minimize the total cost
  • This technique avoids pre-selection of parameters and auto-adjust the decision hyperplane

37 of 63

Actual class

Predict class

Positive class

Negative class

Positive class

TP = 990

FP = 9

Negative class

FN = 0

TN = 1

Actual class

Predict class

Positive class

Negative class

Positive class

TP = 990

Cost

FP = 9

Cost

Negative class

FN = 0

Cost

TN = 1

Cost

38 of 63

ENSEMBLE LEARNING

  • By split the dataset randomly to some subsets for balancing the dataset
  • By using many constructed models to recover the incorrect classifications
  • By using different machine learning models/methods for integrating the optimized results
  • Types
    • Stacking
    • Bagging
    • Boosting

39 of 63

STACKING

  • Using the identical input dataset
  • Performing different classifier such as KNN, SVM, etc.
  • Using the output of each classifier to decide the final output.

40 of 63

BAGGING

  • Split the whole training set into small subsets
  • Using subsets to construct classifiers
  • Majority voting one is the final output
  • Based one subset selection, the background noise can be eliminated

41 of 63

BOOSTING

42 of 63

BOOSTING

  • Combining multiple weak classifiers
  • The classifiers are connected
  • Assigning weight from the incorrect output to the next training
  • Keys
    • How to modify the weight for training
    • How to integrate the weak classifiers
  • Need to know the accuracy of the classifiers beforehand

43 of 63

A boy invite a girl for a date. She asks her friends that she should date with him or not. Afterwards, she decide to date with the boy (Stacking).

Several months past, the boy wants to buy a present to the girl, but he has no idea what present is the good one. So, he chooses some items – a ring, a necklet and a bag, and asks the options from this friends. Since most of them choose a bag, he buy a great bag to the girl (Bagging).

The girl hates the bag and feels the boy doesn’t love her. So she breaks up with the boy. Because of this, the boy will not buy a bag for a present to his girlfriend in the future (Boosting).

44 of 63

BAGGING V.S STACKING V.S BOOSTING

  • Bagging
    • Each training is from random selected subsets (not weighting differences)
    • Train the models independently
  • Stacking
    • Training set is the same
    • Train the models independently, but the output uses for the final training.
  • Boosting
    • Training set is the same
    • Each training is connected, the classifiers learn from the last training
    • Weighting is based on the error classifications

45 of 63

MULTIPLE CLASSIFIERS

Conditions:

  1. Accuracy of all models should be higher than 0.5
  2. The models should be quite different

46 of 63

ADABOOST

  • Reassign the weights of the incorrect classifications from the models for the next training

Initial stage

First classifier

Error rate = 0.25

 

Updated weight

47 of 63

ADABOOST

 

 

 

 

48 of 63

ADABOOST

  • Integration of the classifiers

49 of 63

OVERFITTING

  • Definition – the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably
  • The situations may cause over fitting
    • Data imbalanced
    • Dataset too small
    • Over training – too many features

50 of 63

HOW TO SOLVE OVERFITTING

  • Increasing the training sets
  • Reducing the features
  • Reducing the layers for NN
  • Regularization
  • Dropout

51 of 63

REGULARIZATION

  • Applying the penalty to the parameters/features which have large weight
  • Weight decay is based on loss function which is normally a ratio
    • Large weight will decrease strongly
    • Small weight will decrease slightly or even no influence

52 of 63

DROPOUT

  • Using Bernoulli function to randomly remove the neurons from the training for fix the over training

53 of 63

VALIDATION

  • In order to measure the real accuracy of the built model, the validation needs to be done.
  • Methods
    • K-fold cross validation
    • Leave-one-out cross validation (Jackknife)
    • Leave-one-group-out cross validation
    • Nested cross validation
    • Time series cross validation

54 of 63

HOLDOUT WITH TRAIN-TEST

  • Create an additional holdout set. This is often 10% of the data which you have not used in any of your processing/validation steps

55 of 63

K-FOLD CROSS VALIDATION

56 of 63

LEAVE-ONE-OUT CROSS VALIDATION

57 of 63

LEAVE-ONE-GROUP-OUT CROSS VALIDATION

58 of 63

NESTED CROSS VALIDATION

59 of 63

TIME SERIES CROSS VALIDATION

60 of 63

COMPARING MODELS

  • For evaluating performances of the models, the proper methods need to be used
  • Methods
    • Wilcoxon signed-rank test
    • McNemar’s Test
    • 5x2CV paired t-test

61 of 63

WILCOXON SIGNED-RANK TEST

  • Wilcoxon signed-rank test suits for the small sample size and the data does not follow a normal distribution
  • It normally applied for the comparison with k-fold cross validation

62 of 63

MCNEMAR’S TEST

  • Checking the predictions between one model and another match or not

63 of 63

5X2CV PAIRED T-TEST