JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 63

BASIC MACHINE LEARNING I

2 of 63

WHAT IS MACHINE LEARNING

Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data
A part of artificial intelligence
Machine learning algorithms are used in a wide variety of applications

Medicine
Neuron science
Data science
Robot
Statistics

3 of 63

FRAMEWORK OF MACHINE LEARNING

Training data

Training

Model

Testing data

New query

Output

Black Box

4 of 63

TRAINING AND TESTING

Training set is for feeding the information (features) to the machine for building the model
Testing set is for evaluating the accuracy of the built model, it should be totally separated from the training set
The model is specific based on the selected features
If the other features may influence the new query, the training needs to be re-initiated with new such new feature

5 of 63

MACHINE LEARNING TYPES

Supervised learning

Need known labels for the input data
Majority for classification

Unsupervised learning

Do not need the labels
Majority for data grouping

Reinforcement learning

Learn from mistakes
Reward-based questions

6 of 63

MACHINE LEARNING TYPES

Supervised learning

7 of 63

MACHINE LEARNING TYPES

Unsupervised learning

8 of 63

MACHINE LEARNING TYPES

Reinforcement learning

9 of 63

Patient classification

A predictor assigns a new case to an existing group

Protein expression

Train

Predictor

Use predictor to classify new cases

Predictor

10 of 63

Proteins that best distinguish the groups of patients

Use predictor to find a prognostic signature

Protein expression

Train

Predictor

Protein A

Protein B

Predictor

11 of 63

INPUT DATA

Supervised learning

The data as a set of inputs (features)
X (Label): Y (Data)

Unsupervised learning

The input data directly go to the algorithm without preprocessing

Reinforcement learning

No raw data given, learning from trial and error

12 of 63

DATA BALANCE V.S IMBALANCE

Balanced Dataset — In our dataset, positive values which are approximately same as negative values
Imbalanced Dataset — High different between the positive values and negative values

13 of 63

EXAMPLE OF IMBALANCED DATASET

Transcriptional terminator: 2,000
Whole nucleotides: 4,000,000
P:2,000; N:3,998,000
After training, the results of new query are always negative

14 of 63

USING GOOD EVALUATION METRICS

Sensitivity

Specificity

Accuracy

Strongly influenced by the imbalanced dataset

15 of 63

USING GOOD EVALUATION METRICS

Precision

Recall

F1 score

Good for the imbalanced dataset (few positives)

16 of 63

	Patients/P	Normal/N
Total	10	990
Prediction	1	999
True positive/negative	1	990
False positive/negative	0	9
Sensitivity	1/10 = 10%
Specificity	990/990 = 100%
Accuracy	1+990/10+990 = 99.1%
Precision	1/1 = 100%
Recall	1/10 = 10%
F1 Score	2x1/2x1+0+9 = 18.2%

17 of 63

OVER-SAMPLING (UP SAMPLING)

Balance by increasing the size of rare samples
Advantages

No loss of information
Mitigate overfitting caused by oversampling

Disadvantages

Overfitting

18 of 63

UNDER-SAMPLING (DOWN SAMPLING)

Balance by reducing the size of majority samples
Advantages

Run-time can be improved by decreasing the amount of training dataset.
Helps in solving the memory problems

Disadvantages

Losing critical information

19 of 63

FEATURE SELECTION

Need to select the features which is highly related with the target question
Learning the wrong lesson

an image classifier trained only on pictures of brown horses and black cats might conclude that all brown patches are likely to be horses

Different features may contribute the predictions differently

20 of 63

FEATURE FILTER

Based on the variation of the features and target to evaluate the importance of the features
Remove the feature with low variance

The training set is compose of all male, then gender is a useless features with low variance

Using the correlation to remove the features which are highly related

The age and the mobility are highly correlated

21 of 63

FEATURE WRAPPER

Based on a base model to train multiple times and remove the features having the lowest weight in each run
The training can remove the features which have less contribution

22 of 63

FEATURE EMBEDDING

Similar as feature wrapper
The feature removing occurs during the learning process
Remove the features which have less contribution and no relationship with the target

23 of 63

FEATURE WEIGHT

Some methods do produce the information of feature weighting, but some do not.

24 of 63

HYPERPARAMETERS

Hyperparameters are parameters set before the training process that influence how the model learns and performs.

Learning rate determines the speed at which the model updates weights.
Depth of a decision tree affects its complexity.

Optimization

Grid searching
Random searching
Bayesian searching

25 of 63

HYPERPARAMETER OPTIMIZATION

Fruit	Shape	Size	Smell	Color	Score
	Circle	Middle	Scent	Red
	Ellipse	Big	Scent	Yellow
	Ellipse	Big	odor	Yellow
	Circle	Middle	No smell	Green
	Circle	Small	No smell	Purple

26 of 63

HYPERPARAMETER OPTIMIZATION

Fruit	Shape	Size	Smell	Color	Score
	1	2	2	1	6
	2	3	2	3	10
	2	3	3	3	11
	1	2	1	2	6
	1	1	1	4	7

27 of 63

HYPERPARAMETER OPTIMIZATION

Fruit	Shape	Size	Smell	Color	Score
	1 x a	2 x b	2 x c	1 x d	6
	2 x a	3 x b	2 x c	3 x d	10
	2 x a	3 x b	3 x c	3 x d	11
	1 x a	2 x b	1 x c	2 x d	6
	1 x a	1 x b	1 x c	4 x d	7

28 of 63

HYPERPARAMETER OPTIMIZATION

Fruit	Shape	Size	Smell	Color	Score
	1 x 1	2 x 2	2 x 1	1 x 0.5	7.5
	2 x 1	3 x 2	2 x 1	3 x 0.5	11.5
	2 x 1	3 x 2	3 x 1	3 x 0.5	12.5
	1 x 1	2 x 2	1 x 1	2 x 0.5	7
	1 x 1	1 x 2	1 x 1	4 x 0.5	6

29 of 63

GRID SEARCHING

It systematically tests all possible combinations of specified hyperparameter values to find the best configuration for a model.
Each combination is evaluated using a chosen performance metric, often through cross-validation, and the one yielding the best result is selected.
While effective, grid search can be computationally expensive, especially when the hyperparameter space is large. And the accuracy also depends on how we split the data.

30 of 63

RANDOM SEARCHING

Random search selects random combinations within the specified ranges.
This approach allows for broader exploration of the hyperparameter space while often being more efficient, especially when only a few hyperparameters significantly impact model performance.
By testing a smaller, random subset of combinations, random search can achieve good results with less computational effort. But the global optimization may not be found.

31 of 63

BAYESIAN SEARCHING

Unlike grid or random search, which test combinations without considering past results, Bayesian search builds a probabilistic model of the objective function based on previous evaluations.
Bayesian search is particularly useful for expensive-to-train models, where each evaluation of a hyperparameter combination takes significant computational resources.

32 of 63

HYPERPARAMETER OPTIMIZATION

33 of 63

BAYESIAN OPTIMIZATION

ID	Age	Gender	Income	Mariage	Children	Buy
1	33	M	5.3W	Y	Y	Y
2	37	F	4.5W	Y	Y	Y
3	22	F	3W	N	N	N
4	35	M	7.8W	Y	Y	Y
5	60	M	6W	N	Y	N
6	63	F	5.5W	Y	N	N
7	55	F	6W	Y	Y	N
8	18	F	2.8W	Y	N	N
9	21	M	3.5W	N	N	N
10	46	F	7W	N	N	N

34 of 63

BAYESIAN OPTIMIZATION

ID	Age	Gender	Income	Mariage	Children	Buy
1	33	M	5.3W	Y	Y	Y
2	37	F	4.5W	Y	Y	Y
3	22	F	3W	N	N	N
4	35	M	7.8W	Y	Y	Y
5	60	M	6W	N	Y	N
6	63	F	5.5W	Y	N	N
7	55	F	6W	Y	Y	N
8	18	F	2.8W	Y	N	N
9	21	M	3.5W	N	N	N
10	46	F	7W	N	N	N
11	38	M	8.3W	Y	Y

3/10 ?

35 of 63

BAYESIAN OPTIMIZATION

ID	Age	Gender	Income	Mariage	Children	Buy
1	33	M	5.3W	Y	Y	Y
2	37	F	6.5W	Y	Y	Y
4	35	M	7.8W	Y	Y	Y
3	22	F	3W	N	N	N
5	60	M	5W	N	Y	N
6	63	F	5.5W	Y	N	N
7	55	F	6W	Y	Y	N
8	18	F	2.8W	Y	N	N
9	21	M	3.5W	N	N	N
10	46	F	7W	N	N	N
11	38	M	8.3W	Y	Y	Maybe Y

3/10 ?

36 of 63

COST-SENSITIVE LEARNING

Taking the misclassification costs (and possibly other types of cost) into consideration for learning
The goal of this type of learning is to minimize the total cost
This technique avoids pre-selection of parameters and auto-adjust the decision hyperplane

37 of 63

	Actual class
Predict class		Positive class	Negative class
	Positive class	TP = 990	FP = 9
	Negative class	FN = 0	TN = 1

	Actual class
Predict class		Positive class	Negative class
	Positive class	TP = 990 Cost	FP = 9 Cost
	Negative class	FN = 0 Cost	TN = 1 Cost

38 of 63

ENSEMBLE LEARNING

By split the dataset randomly to some subsets for balancing the dataset
By using many constructed models to recover the incorrect classifications
By using different machine learning models/methods for integrating the optimized results
Types

Stacking
Bagging
Boosting

39 of 63

STACKING

Using the identical input dataset
Performing different classifier such as KNN, SVM, etc.
Using the output of each classifier to decide the final output.

40 of 63

BAGGING

Split the whole training set into small subsets
Using subsets to construct classifiers
Majority voting one is the final output
Based one subset selection, the background noise can be eliminated

41 of 63

BOOSTING

42 of 63

BOOSTING

Combining multiple weak classifiers
The classifiers are connected
Assigning weight from the incorrect output to the next training
Keys

How to modify the weight for training
How to integrate the weak classifiers

Need to know the accuracy of the classifiers beforehand

43 of 63

A boy invite a girl for a date. She asks her friends that she should date with him or not. Afterwards, she decide to date with the boy (Stacking).

Several months past, the boy wants to buy a present to the girl, but he has no idea what present is the good one. So, he chooses some items – a ring, a necklet and a bag, and asks the options from this friends. Since most of them choose a bag, he buy a great bag to the girl (Bagging).

The girl hates the bag and feels the boy doesn’t love her. So she breaks up with the boy. Because of this, the boy will not buy a bag for a present to his girlfriend in the future (Boosting).

44 of 63

BAGGING V.S STACKING V.S BOOSTING

Bagging

Each training is from random selected subsets (not weighting differences)
Train the models independently

Stacking

Training set is the same
Train the models independently, but the output uses for the final training.

Boosting

Training set is the same
Each training is connected, the classifiers learn from the last training
Weighting is based on the error classifications

45 of 63

MULTIPLE CLASSIFIERS

Conditions:

Accuracy of all models should be higher than 0.5
The models should be quite different

46 of 63

ADABOOST

Reassign the weights of the incorrect classifications from the models for the next training

Initial stage

First classifier

Error rate = 0.25

Updated weight

47 of 63

ADABOOST

48 of 63

ADABOOST

Integration of the classifiers

49 of 63

OVERFITTING

Definition – the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably
The situations may cause over fitting

Data imbalanced
Dataset too small
Over training – too many features

50 of 63

HOW TO SOLVE OVERFITTING

Increasing the training sets
Reducing the features
Reducing the layers for NN
Regularization
Dropout

51 of 63

REGULARIZATION

Applying the penalty to the parameters/features which have large weight
Weight decay is based on loss function which is normally a ratio

Large weight will decrease strongly
Small weight will decrease slightly or even no influence

52 of 63

DROPOUT

Using Bernoulli function to randomly remove the neurons from the training for fix the over training

53 of 63

VALIDATION

In order to measure the real accuracy of the built model, the validation needs to be done.
Methods

K-fold cross validation
Leave-one-out cross validation (Jackknife)
Leave-one-group-out cross validation
Nested cross validation
Time series cross validation

54 of 63

HOLDOUT WITH TRAIN-TEST

Create an additional holdout set. This is often 10% of the data which you have not used in any of your processing/validation steps

55 of 63

K-FOLD CROSS VALIDATION

56 of 63

LEAVE-ONE-OUT CROSS VALIDATION

57 of 63

LEAVE-ONE-GROUP-OUT CROSS VALIDATION

58 of 63

NESTED CROSS VALIDATION

59 of 63

TIME SERIES CROSS VALIDATION

60 of 63

COMPARING MODELS

For evaluating performances of the models, the proper methods need to be used
Methods

Wilcoxon signed-rank test
McNemar’s Test
5x2CV paired t-test

61 of 63

WILCOXON SIGNED-RANK TEST

Wilcoxon signed-rank test suits for the small sample size and the data does not follow a normal distribution
It normally applied for the comparison with k-fold cross validation

62 of 63

MCNEMAR’S TEST

Checking the predictions between one model and another match or not

63 of 63

5X2CV PAIRED T-TEST