2 of 35

Machine Learning at a Glance

Face unlock in Smartphones

Natural language processing

E-mail spam alert

Play Store

recommendations

Autonomous driving

3 of 35

DeepFake

https://www.cnn.com/interactive/2019/01/business/pentagons-race-against-deepfakes/

4 of 35

Types of Machine Learning Models

Regression

Classification

Supervised

Learning

Unsupervised

Learning

Semi-supervised

Learning

Reinforcement Learning

5 of 35

Supervised Learning

Regression

Classification

Fit the training datas into the continuous function
Predict results within a continuous output
E.g. Predicting price of house according to the size.

Outputs is predicted in discrete value
Such yes or no, true or false, male or female, positive or negative, 0 or 1 or 2…
E.g. Identify behaviors are malicious or benign https://www.microsoft.com/security/blog/2017/08/03/windows-defender-atp-machine-learning-detecting-new-and-unusual-breach-activity/

6 of 35

Unsupervised/Semi-supervised Learning

Unsupervised

Learning

Semi-supervised

Learning

Training with data without label
E.g. Cluster algorithm
E.g. DeepFake

Labor intensive and time-consuming to collect labeled data
Makes uses of both labeled and unlabelled data for training
E.g. Google Expander used for language translation, visual object recognition, https://ai.googleblog.com/2016/10/graph-powered-machine-learning-at-google.html

7 of 35

Reinforcement Learning

Getting an agent to act in the environment so as to maximize its rewards
Goal: figure out the next correct answer that will take it to the next step of the process.
E.g. Go(AlphaGo), video game AI, robot navigation.

8 of 35

General Process in Machine Learning

Involve data preprocessing

Need to determine the model parameters

9 of 35

What Is a Machine Learning Model?

Illustration of a ML model

ML Model = mathematical function to perform a certain task

Continuous vs Discrete

Regression vs Classification

Parameters yet to be determined�(example: w(1), w(2), …)

What makes machine learning powerful is that the model contains parameters that can be systematically improved according to a prescribed algorithm.

10 of 35

Phases for Machine Learning Models

Training Phase

Testing Phase

& Deployment

11 of 35

Case Study: Classification of Smartphone Applications

Dataset: Sherlock

Detailed phone information

System stats
App stats
Network-related stats
Sensor readings (including GPS)

develop simple machine learning models to predict the name of the smartphone apps

Goal for this workshop:

12 of 35

Data Preparation!

Data Wrangling & Visualization

13 of 35

Data Wrangling

The importance of good data:

Data => analysis result => actionable decision
Bad data => bad analysis => bad decision

Data Wrangling:

Understand the nature of data (type, meaning, values)
Identify & address issues in data
Make data suitable for processing

Goal: clean, consistent, and processable data� ⇒ input of further analysis such as machine learning

Understand each Feature

Remove Bad Data

Deal with Missing Data

14 of 35

Types of Data — A Data Scientist’s View

Numerical vs. Categorical

Numerical

Number is assigned as a quantitative value
Ex: memory usage, number of threads

Categorical

defined by the classes (categories) into which its value may fall
Ex: application name, eye color

Discrete vs. Continuous

Discrete

Can only take on certain discrete values

Continuous

Can take on any value between the highest and lowest point on the scale

Qualitative vs. Quantitative

Qualitative

Values or categories are not described as numbers, but as verbal groupings

Quantitative

Described with numerical quantities
Numerical differences among values have quantitative meaning
Ex: price, temperature

15 of 35

Issues with Data

Missing data
Irrelevant data
Outliers
Duplicate data (features / rows)
Inconsistent data
Formatting / representational issues

Address issues in data:

Replace missing/bad values (imputation)
Remove rows/columns with missing/bad values
Important: must avoid introducing bias in the data & analysis

16 of 35

Visualization

Powerful aid for discovery and comprehension:

Patterns, relationships, trends in data
Issues in data

Helps you “see” many numbers in a quick glance!

17 of 35

About SherLock “2-apps” dataset

Resource utilization dataset for two apps: Facebook and WhatsApp
Nearly 800k rows
14 columns:

18 of 35

Questions to Explore

Statistical properties of the individual features: mean, spread, distribution
Identify issues with the data:

useless features
missing data
duplicate data
outliers

Extract some basic statistics about the applications:

which app uses more CPU cycles on average?
which app uses more memory on average?

Correlations among the features in the dataset: If there are, what do they look like?

19 of 35

Hands-on

Go & Explore the Sherlock Dataset!

20 of 35

Obtaining Hands-On Files

You have a home directory => your own storage

/home/YOUR_MIDAS_ID

Copying Hands-On Files:

Create a new folder: CItraining
Enter into CItraining folder

/shared/DeapSECURE/install-modules

From Jupyter: Prepend “!”

21 of 35

Machine Learning

Workflow

22 of 35

Practical Tool to Build Machine Learning Models

https://scikit-learn.org/stable/

23 of 35

Data Preprocessing

Remove Irrelevant Features

Dealing with Missing Data

Removing Duplicate Features

Separating Labels from Features

Data Normalization

Label Encoding/

One-Hot Encoding

Use .drop() functions

Use .isna().sum() and .dropna() functions

Use sklearn.preprocessing.StandardScaler()�and .fit()

Use sklearn.preprocessing.LabelEncoder()�and .fit()

24 of 35

Machine Learning Steps

Split data to “Train” and “Test” sets

Build model (see next slides)

Train model

Evaluate model

Use the model!

sklearn.model_selection.train_test_split(...)

Example: model_lr = LogisticRegression(solver="lbfgs")

model_lr.predict(new_data_features)

sklearn.metrics.accuracy_score() sklearn.metrics.confusion_matrix()

model_lr.fit(train_features, train_labels)

25 of 35

Logistic Regression

models the probabilities for classification problems
the logistic function squeezes output between 0 and 1
Can extend it to multi-class case as Multinomial Regression

Advantages:

In a low dimensional dataset, it is less prone to over-fitting

Disadvantages:

Construct a linear decision surface

Application: Identify Spam / not Spam

Scikit-learn class: sklearn.linear_model.LogisticRegression

Figure: Logistic regression of iris plant species based two features. (Source: scikit-learn)

26 of 35

Decision Tree

Predicts by learning simple rules of decision
Transforming the data into a tree representation
Each internal node denotes an attribute and

each leaf node denotes a class label

Advantages:

The output of decision trees is easy to interpret
Classify via both numerical and categorical variables
Resistant to outliers, hence require little data preprocessing

Disadvantages:

Computation expensive and time consuming
Prone to overfitting

Application: Recommendation system

Scikit-learn class:

sklearn.tree.DecisionTreeClassifier

27 of 35

Logistic Regression or Decision Tree?

Logistic regression is more suited for classification tasks that are linearly separable in the feature space. While decision tree can better handle nonlinear classification. Thus, if you are certainly sure that the dataset can be linearly separated, try logistic regression first. Otherwise try decision tree first.

If the output data type is category, you can try decision tree first, and if the output data type is continuous, try logistic regression first.

28 of 35

Support Vector Machine

Maximize the margin of the classifier

Create the best line or decision boundary that can segregate n-dimensional space into classes

It works really well with a clear margin of separation

It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient

It doesn’t perform well when we have large data set because the required training time is higher
It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping

29 of 35

Metrics to Assess the Machine Learning Models

Confusion Matrix

Accuracy:

Precision:

Recall:

TP+TN

TP+TN+FP+FN

TP+FP

TP+FN

An N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes.

For a binary classification problem:

The target variable has two values: Positive or Negative
The columns represent the actual values
The rows represent the predicted values

Precision tells us how many of the correctly predicted cases actually turned out to be positive.

Recall tells us how many of the actual positive cases we were able to predict correctly with our model.

Accuracy tells us how many of the cases predicted correctly.

30 of 35

Metrics to Assess the Machine Learning Models

Bias: how far are the average prediction from the actual values

Variance: how scattered are the predicted values from the actual values

High Bias

Low Bias

Low Variance

High Variance

31 of 35

Hands-On

Build and Train a Machine Learning Model!

32 of 35

Feature Selection

Histograms

Use .hist() function

Correlation

Use .corr() function

Simple Group Analysis

Use .groupby() function

Data Visualization

Use plt.hist() function

33 of 35

T h a n k Y o u !

The DeapSECURE team

PI: Dr. Hongyi Wu (ECE), Dr. Masha Sosonkina (CMSE),� Dr. Wirawan Purwanto (ITS)

Assessor: Dr. Karina Arcaute

Assistants:

Qiao Zhang, Liuwan Zhu, Jacob Strother, Rosby Asiamah, Yuming He

Funding: NSF OAC grant #1829771

34 of 35

Regression

Linear Regression:

interpolates between the points
minimizes distances between points and hyperplane

Can do linear interpolation and corresponding prediction
Not suitable for classification as it doesn't output probability
Scikit-learn class: sklearn.linear_model.LinearRegression

Logistic Regression:

models the probabilities for classification problems
the logistic function squeezes output between 0 and 1

Can extend it to multi-class case as Multinomial Regression
Has a linear decision surface
In a low dimensional dataset, it is less prone to over-fitting.
Scikit-learn class: sklearn.linear_model.LogisticRegression

1 of 35

2 of 35

3 of 35

4 of 35

5 of 35

6 of 35

7 of 35

8 of 35

9 of 35

10 of 35

11 of 35

12 of 35

13 of 35

14 of 35

15 of 35

16 of 35

17 of 35

18 of 35

19 of 35

20 of 35

21 of 35

22 of 35

23 of 35

24 of 35

25 of 35

26 of 35

27 of 35

28 of 35

29 of 35

30 of 35

31 of 35

32 of 35

33 of 35

34 of 35

35 of 35