1 of 14

Classification

Machine learning using Galaxy webinar

22 - 26 June, 2020

Institute of Applied Biosciences (INAB)

Centre for Research and Technology Hellas (CERTH)

Thessaloniki, Greece

2 of 14

2

3 of 14

The ML taxonomy

3

4 of 14

Supervised Learning

4

5 of 14

Supervised Learning

  • Apply what has been learned in the past to new data using labeled examples to predict future events. �
  • Starting from the analysis of a known training dataset, the learning algorithm produces a prediction model that can provide targets for any new input (after sufficient training). �
  • The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify and improve the prediction model accordingly.

5

6 of 14

Classification

Two steps:

  • Building the classifier or model:
    • the learning step, in which the classification algorithms build the classifier.
    • built from the training set made up of database samples and their associated class labels.
    • Each sample that constitutes the training set is referred to as a class.�
  • Applying the classifier to a classification task:
    • Classifier is used for classification
    • Test data is used to estimate the accuracy of classification rules

6

7 of 14

Classification vs Regression

https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/

7

8 of 14

Classification vs regression

8

Classification

Regression

Discrete, categorical variable

Continuous (real number range)

Supervised classification problem

Supervised classification problem

Assign the output to a class (a label)

Predict the output value using training data

Predict the type of tumor (harmful vs not harmful)

Predict a house price, predict survival time

9 of 14

Linear regression

  • Regression algorithms can be used for example when some continuous value needs to be computed as compared to classification where the output is categorical.
  • So whenever there is a need to predict some future value of a process which is currently running, regression algorithm can be used.
  • Operating on a two dimensional set of observations (two continuous variables), simple linear regression attempts to fit, as best as possible, a line through the data points.
  • The regression line (our model) becomes a tool that can help uncover underlying trends in our dataset.
  • The regression line, when properly fitted, can serve as a predictive model for new events.
  • Linear Regressions are however unstable in case features are redundant, i.e. if there is multicollinearity
  • Example where linear regression can be used are:
    • Using gene expression data to classify (or predict) tumor types using gene expression data

9

10 of 14

Decision Trees (Supervised)

  • Single trees are used very rarely, but in composition with many others they build very efficient algorithms such as Random Forest or Gradient Tree Boosting.
  • Decision trees easily handle feature interactions and they are non-parametric, so there is no need to worry about outliers or whether the data is linearly separable.
  • Disadvantages are:
    • Often the tree needs to be rebuilt when new examples come on.
    • Decision trees easily overfit, but ensemble methods like random forests (or boosted trees) take care of this problem.
    • They can also take a lot of memory (the more features you have, the deeper and larger your decision tree is likely to be)
  • Trees are excellent tools for helping to choose between several courses of action.
    • Example: Classification of genomic islands using decision trees and ensemble algorithms

10

11 of 14

Random Forest (Supervised)

  • Random Forest is an ensemble of decision trees.
  • It can solve both regression and classification problems with large data sets.
  • It also helps identify most significant variables from thousands of input variables.
  • Random Forest is highly scalable to any number of dimensions and has generally quite acceptable performances.
  • However with Random Forest, learning may be slow (depending on the parameterization) and it is not possible to iteratively improve the generated models
  • Random Forest can be used in real-world applications such as:
    • Predict patients for high risks for certain diseases

11

12 of 14

Support Vector Machines (Supervised)

  • Support Vector Machine (SVM) is a supervised machine learning technique that is widely used in pattern recognition and classification problems — when your data has exactly two classes.
  • Advantages include high accuracy and even if the data is not linearly separable in the base feature space, SVM can work well with an appropriate kernel.
  • However SVMs are memory-intensive, hard to interpret, and difficult to tune.
  • SVM is especially popular in text classification problems where very high-dimensional spaces are the norm.
  • SVM can be used in real-world bioinformatics applications such as:
    • detecting persons with common diseases such as diabetes
    • Classification of genomic islands

12

13 of 14

Validation of supervised ML algorithms results

  • To test the performance of the learning system:
    • The system can be tested with sequences where the labels are known (and were excluded from the training set because they were intended to be used for this purpose).
    • Based on the results of the test data, the performance of the learning system can be assessed.

13

14 of 14

Some terms

  • Confusion matrix��
  • Precision��
  • Specificity��
  • Recall / Sensitivity��
  • Receiver Operating Characteristic (ROC) and AUC curves

14

https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

1 - Specificity

Sensitivity