1 of 35

Machine Learning

DeapSECURE module 3

https://deapsecure.gitlab.io/deapsecure-lesson03-ml/

2 of 35

Machine Learning at a Glance

Face unlock in Smartphones

Natural language processing

E-mail spam alert

Play Store

recommendations

Autonomous driving

3 of 35

DeepFake

4 of 35

Types of Machine Learning Models

Regression

Classification

Supervised

Learning

Unsupervised

Learning

Semi-supervised

Learning

Reinforcement Learning

5 of 35

Supervised Learning

Regression

Classification

  • Fit the training datas into the continuous function
  • Predict results within a continuous output
  • E.g. Predicting price of house according to the size.

6 of 35

Unsupervised/Semi-supervised Learning

Unsupervised

Learning

Semi-supervised

Learning

  • Training with data without label
  • E.g. Cluster algorithm
  • E.g. DeepFake

7 of 35

Reinforcement Learning

  • Getting an agent to act in the environment so as to maximize its rewards
  • Goal: figure out the next correct answer that will take it to the next step of the process.
  • E.g. Go(AlphaGo), video game AI, robot navigation.

8 of 35

General Process in Machine Learning

Involve data preprocessing

Need to determine the model parameters

9 of 35

What Is a Machine Learning Model?

Illustration of a ML model

ML Model = mathematical function to perform a certain task

  • Continuous vs Discrete

  • Regression vs Classification

Parameters yet to be determined�(example: w(1), w(2), …)

What makes machine learning powerful is that the model contains parameters that can be systematically improved according to a prescribed algorithm.

10 of 35

Phases for Machine Learning Models

Training Phase

Testing Phase

& Deployment

11 of 35

Case Study: Classification of Smartphone Applications

Dataset: Sherlock

Detailed phone information

  • System stats
  • App stats
  • Network-related stats
  • Sensor readings (including GPS)

develop simple machine learning models to predict the name of the smartphone apps

Goal for this workshop:

12 of 35

Data Preparation!

Data Wrangling & Visualization

13 of 35

Data Wrangling

The importance of good data:

  • Data => analysis result => actionable decision
  • Bad data => bad analysis => bad decision

Data Wrangling:

  • Understand the nature of data (type, meaning, values)
  • Identify & address issues in data
  • Make data suitable for processing

Goal: clean, consistent, and processable data� ⇒ input of further analysis such as machine learning

Understand each Feature

Remove Bad Data

Deal with Missing Data

14 of 35

Types of Data — A Data Scientist’s View

Numerical vs. Categorical

  • Numerical
    • Number is assigned as a quantitative value
    • Ex: memory usage, number of threads
  • Categorical
    • defined by the classes (categories) into which its value may fall
    • Ex: application name, eye color

Discrete vs. Continuous

  • Discrete
    • Can only take on certain discrete values
  • Continuous
    • Can take on any value between the highest and lowest point on the scale

Qualitative vs. Quantitative

  • Qualitative
    • Values or categories are not described as numbers, but as verbal groupings
  • Quantitative
    • Described with numerical quantities
    • Numerical differences among values have quantitative meaning
    • Ex: price, temperature

15 of 35

Issues with Data

  • Missing data
  • Irrelevant data
  • Outliers
  • Duplicate data (features / rows)
  • Inconsistent data
  • Formatting / representational issues

Address issues in data:

  • Replace missing/bad values (imputation)
  • Remove rows/columns with missing/bad values
  • Important: must avoid introducing bias in the data & analysis

16 of 35

Visualization

Powerful aid for discovery and comprehension:

  • Patterns, relationships, trends in data
  • Issues in data

Helps you “see” many numbers in a quick glance!

17 of 35

About SherLock “2-apps” dataset

  • Resource utilization dataset for two apps: Facebook and WhatsApp
  • Nearly 800k rows
  • 14 columns:

18 of 35

Questions to Explore

  1. Statistical properties of the individual features: mean, spread, distribution
  2. Identify issues with the data:
    • useless features
    • missing data
    • duplicate data
    • outliers
  3. Extract some basic statistics about the applications:
    • which app uses more CPU cycles on average?
    • which app uses more memory on average?
  4. Correlations among the features in the dataset: If there are, what do they look like?

19 of 35

Hands-on

Go & Explore the Sherlock Dataset!

20 of 35

Obtaining Hands-On Files

You have a home directory => your own storage

/home/YOUR_MIDAS_ID

Copying Hands-On Files:

  • Create a new folder: CItraining
  • Enter into CItraining folder

/shared/DeapSECURE/install-modules

From Jupyter: Prepend “!”

21 of 35

Machine Learning

Workflow

22 of 35

Practical Tool to Build Machine Learning Models

https://scikit-learn.org/stable/

23 of 35

Data Preprocessing

Remove Irrelevant Features

Dealing with Missing Data

Removing Duplicate Features

Separating Labels from Features

Data Normalization

Label Encoding/

One-Hot Encoding

  • Use .drop() functions
  • Use .isna().sum() and .dropna() functions
  • Use sklearn.preprocessing.StandardScaler()�and .fit()
  • Use sklearn.preprocessing.LabelEncoder()�and .fit()

24 of 35

Machine Learning Steps

Split data to “Train” and “Test” sets

Build model (see next slides)

Train model

Evaluate model

Use the model!

  • sklearn.model_selection.train_test_split(...)
  • Example: model_lr = LogisticRegression(solver="lbfgs")
  • model_lr.predict(new_data_features)
  • sklearn.metrics.accuracy_score() sklearn.metrics.confusion_matrix()
  • model_lr.fit(train_features, train_labels)

25 of 35

Logistic Regression

  • models the probabilities for classification problems
  • the logistic function squeezes output between 0 and 1
  • Can extend it to multi-class case as Multinomial Regression

  • Advantages:
    • In a low dimensional dataset, it is less prone to over-fitting
  • Disadvantages:
    • Construct a linear decision surface

  • Application: Identify Spam / not Spam

  • Scikit-learn class: sklearn.linear_model.LogisticRegression

Figure: Logistic regression of iris plant species based two features. (Source: scikit-learn)

26 of 35

Decision Tree

  • Predicts by learning simple rules of decision
  • Transforming the data into a tree representation
  • Each internal node denotes an attribute and

each leaf node denotes a class label

  • Advantages:
    • The output of decision trees is easy to interpret
    • Classify via both numerical and categorical variables
    • Resistant to outliers, hence require little data preprocessing
  • Disadvantages:
    • Computation expensive and time consuming
    • Prone to overfitting

  • Application: Recommendation system

  • Scikit-learn class:

sklearn.tree.DecisionTreeClassifier

27 of 35

Logistic Regression or Decision Tree?

  • Logistic regression is more suited for classification tasks that are linearly separable in the feature space. While decision tree can better handle nonlinear classification. Thus, if you are certainly sure that the dataset can be linearly separated, try logistic regression first. Otherwise try decision tree first.

  • If the output data type is category, you can try decision tree first, and if the output data type is continuous, try logistic regression first.

28 of 35

Support Vector Machine

  • Maximize the margin of the classifier

  • Create the best line or decision boundary that can segregate n-dimensional space into classes
  • It works really well with a clear margin of separation

  • It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient

  • It doesn’t perform well when we have large data set because the required training time is higher
  • It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping

29 of 35

Metrics to Assess the Machine Learning Models

Confusion Matrix

Accuracy:

Precision:

Recall:

TP+TN

TP+TN+FP+FN

TP

TP+FP

TP

TP+FN

  • An N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes.

  • For a binary classification problem:
    • The target variable has two values: Positive or Negative
    • The columns represent the actual values
    • The rows represent the predicted values

Precision tells us how many of the correctly predicted cases actually turned out to be positive.

Recall tells us how many of the actual positive cases we were able to predict correctly with our model.

Accuracy tells us how many of the cases predicted correctly.

30 of 35

Metrics to Assess the Machine Learning Models

Bias: how far are the average prediction from the actual values

Variance: how scattered are the predicted values from the actual values

High Bias

Low Bias

Low Variance

High Variance

31 of 35

Hands-On

Build and Train a Machine Learning Model!

32 of 35

Feature Selection

Histograms

  • Use .hist() function

Correlation

  • Use .corr() function

Simple Group Analysis

  • Use .groupby() function

Data Visualization

  • Use plt.hist() function

33 of 35

T h a n k Y o u !

The DeapSECURE team

PI: Dr. Hongyi Wu (ECE), Dr. Masha Sosonkina (CMSE),� Dr. Wirawan Purwanto (ITS)

Assessor: Dr. Karina Arcaute

Assistants:

Qiao Zhang, Liuwan Zhu, Jacob Strother, Rosby Asiamah, Yuming He

Funding: NSF OAC grant #1829771

34 of 35

Regression

Linear Regression:

  • interpolates between the points
  • minimizes distances between points and hyperplane
  • Can do linear interpolation and corresponding prediction
  • Not suitable for classification as it doesn't output probability
  • Scikit-learn class: sklearn.linear_model.LinearRegression

Logistic Regression:

  • models the probabilities for classification problems
  • the logistic function squeezes output between 0 and 1
  • Can extend it to multi-class case as Multinomial Regression
  • Has a linear decision surface
  • In a low dimensional dataset, it is less prone to over-fitting.
  • Scikit-learn class: sklearn.linear_model.LogisticRegression

35 of 35

Support Vector Machines

Strengths:

  • Finding boundaries--either linear hyperplane or complex curvature?
  • Separate categories in high dimensionality hyperplane