1 of 14

Classification

Machine learning using Galaxy webinar

22 - 26 June, 2020

Institute of Applied Biosciences (INAB)

Centre for Research and Technology Hellas (CERTH)

Thessaloniki, Greece

3 of 14

The ML taxonomy

4 of 14

Supervised Learning

5 of 14

Supervised Learning

Apply what has been learned in the past to new data using labeled examples to predict future events. �
Starting from the analysis of a known training dataset, the learning algorithm produces a prediction model that can provide targets for any new input (after sufficient training). �
The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify and improve the prediction model accordingly.

6 of 14

Classification

Two steps:

Building the classifier or model:

the learning step, in which the classification algorithms build the classifier.
built from the training set made up of database samples and their associated class labels.
Each sample that constitutes the training set is referred to as a class.�

Applying the classifier to a classification task:

Classifier is used for classification
Test data is used to estimate the accuracy of classification rules

7 of 14

Classification vs Regression

https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/

8 of 14

Classification vs regression

Classification	Regression
Discrete, categorical variable	Continuous (real number range)
Supervised classification problem	Supervised classification problem
Assign the output to a class (a label)	Predict the output value using training data
Predict the type of tumor (harmful vs not harmful)	Predict a house price, predict survival time

9 of 14

Linear regression

Regression algorithms can be used for example when some continuous value needs to be computed as compared to classification where the output is categorical.
So whenever there is a need to predict some future value of a process which is currently running, regression algorithm can be used.
Operating on a two dimensional set of observations (two continuous variables), simple linear regression attempts to fit, as best as possible, a line through the data points.
The regression line (our model) becomes a tool that can help uncover underlying trends in our dataset.
The regression line, when properly fitted, can serve as a predictive model for new events.
Linear Regressions are however unstable in case features are redundant, i.e. if there is multicollinearity
Example where linear regression can be used are:

Using gene expression data to classify (or predict) tumor types using gene expression data

10 of 14

Decision Trees (Supervised)

Single trees are used very rarely, but in composition with many others they build very efficient algorithms such as Random Forest or Gradient Tree Boosting.
Decision trees easily handle feature interactions and they are non-parametric, so there is no need to worry about outliers or whether the data is linearly separable.
Disadvantages are:

Often the tree needs to be rebuilt when new examples come on.
Decision trees easily overfit, but ensemble methods like random forests (or boosted trees) take care of this problem.
They can also take a lot of memory (the more features you have, the deeper and larger your decision tree is likely to be)

Trees are excellent tools for helping to choose between several courses of action.

Example: Classification of genomic islands using decision trees and ensemble algorithms

11 of 14

Random Forest (Supervised)

Random Forest is an ensemble of decision trees.
It can solve both regression and classification problems with large data sets.
It also helps identify most significant variables from thousands of input variables.
Random Forest is highly scalable to any number of dimensions and has generally quite acceptable performances.
However with Random Forest, learning may be slow (depending on the parameterization) and it is not possible to iteratively improve the generated models
Random Forest can be used in real-world applications such as:

Predict patients for high risks for certain diseases

12 of 14

Support Vector Machines (Supervised)

Support Vector Machine (SVM) is a supervised machine learning technique that is widely used in pattern recognition and classification problems — when your data has exactly two classes.
Advantages include high accuracy and even if the data is not linearly separable in the base feature space, SVM can work well with an appropriate kernel.
However SVMs are memory-intensive, hard to interpret, and difficult to tune.
SVM is especially popular in text classification problems where very high-dimensional spaces are the norm.
SVM can be used in real-world bioinformatics applications such as:

detecting persons with common diseases such as diabetes
Classification of genomic islands

13 of 14

Validation of supervised ML algorithms results

To test the performance of the learning system:

The system can be tested with sequences where the labels are known (and were excluded from the training set because they were intended to be used for this purpose).
Based on the results of the test data, the performance of the learning system can be assessed.

14 of 14

Some terms

Confusion matrix��
Precision��
Specificity��
Recall / Sensitivity��
Receiver Operating Characteristic (ROC) and AUC curves

https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

1 - Specificity

Sensitivity