5 of 110

Introduction

Machine Learning: The subfield of Artificial Intelligence and Computer Science

Imported much relevant knowledge from statistics and probability theory
Better computer scientists who are able to handle data analysis problems

Incremental

User Data

Large Distributed

Databases

Incremental

Sensor Data

Statistics

Big Data

Artificial Intelligence

Machine Learning

Data Science

6 of 110

Relation: Data Science, Big-data

Machine Learning is mostly dependent on inferential statistics which draws conclusions on populations from studies of samples in contrast to descriptive statistics, which primarily summarizes samples.
Data Science as for Statistics is assumed to cover:

Data collection/data capturing/data harvesting
Data modeling
Data maintenance
Data analysis/data processing

- Visualization/presentation of data and decision-making based on data.

Big Data (TB/ ZB; variety, quality, speed) primarily refers to the storage, maintenance, and access to data,

- The Big Data area is based on more traditional areas such as very large databases, data warehousing,

and distributed databases.

https://www.cin.ufpe.br/~cavmj/Machine%20-%20Learning%20-%20Tom%20Mitchell.pdf

7 of 110

Some Background….

Artificial Intelligence has 62-year-old roots

The area was named and defined at a Summer workshop in 1956.
This happened not much longer than a decade after the advent of the first computer.
A small group of computer scientists gathered at Dartmouth College in New Hampshire, US.

Agenda: “The study is to proceed on the basis of the conjecture that every aspect of

learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.

An attempt will be made to find out how to make machines:

use language
form abstractions and concepts
solve kinds of problems now reserved for humans and improve themselves”

8 of 110

Some Background….

Founding Fathers’ of Artificial Intelligence in 1956

Claude Shannon Founder of Information and Communication Theory

D.M. Mackay British researcher in Information Theory and Brain organization

Julian Bigelow Chief engineer for the von Neumann computer at Princeton in 1946

Nathaniel Rochester Author of the first assembler for the first commercial computer

Oliver Selfridge Named ’the father of Machine Perception’

Ray Solomonoff Inventor of Algorithmic Probability

John Holland The inventor of Genetic Algorithms

Marvin Minsky Key MIT researcher in the early development of AI

Allen Newell Champion for symbolic AI and inventor of central AI techniques

Herbert Simon Pioneer in Decision-making theory and a Nobel Prize Winner

John McCarthy Inventor of the LISP programming language

9 of 110

Some Background….

In 1955 Allen Newell, Herbert A. Simon and Cliff Shaw created the Logic Theorist, the first program deliberately engineered to mimic the problem-solving skills of a human being. It is called "the first Artificial Intelligence program". It could prove theorems in Whitehead and Russel’s Principia Mathematica. It introduced key artificial intelligence techniques such as list processing and heuristic search.
In 1958 John McCarthy created the first version of LISP -based on Lambda Calculus and using list processing, the second-oldest high-level programming language in widespread use today. (Only Fortran is older, by one year).
Oliver Selfridge created the Pandemonium architecture in 1959, which was one of the first computational models in pattern recognition for images.
In 1959 Simon, Newell and Shaw created the General Problem Solver (GPS) a computer program intended to work as a universal problem solver machine. Any problem that can be expressed as a set of well-formed formulas (WFFs) or Horn clauses, can be solved, in principle, by GPS.

Arthur Samuel coined the term Machine Learning in 1959
McCulloch and Pitts introduced Neural Networks as a model of computation as early as 1943
Marvin Minsky and Dean Edmonds build SNARC in 1956, the first Neural Network machine, able to learn.
Frank Rosenblatt invents the Perceptron in 1957

10 of 110

Some Background….

LG has launched the ThinQ AI focused TV brand.

Huawei, Samsung and Qualcomm launch AI powered Smart Phones.

Burger King boosts ´AI-written´ ads. .................

A majority of real current Artificial Intelligence success stories relate to the application of Machine Learning only!

A majority of current Machine Learning success stories relate to Image and Speech processing!

11 of 110

ML application sectors

General application sectors:

Medical diagnosis, personalized treatments and drug design
Driverless vehicles and household robots
Personal assistants, recommender systems and navigators
Adapting Communications and Social media services
Marketing and sales
Optimization of technical processes
Monitoring and surveillance
Financial services
Cyber security
Machine translation

Specific categories of data analysis:

Image Recognition – Computer vision (Image analysis for diagnosis of breast cancer )
Speech Recognition (filing medical records)
Data-mining for Large Datasets (large clinical databases)
Text-mining of Large Document Collections (new medical publications to updated medical expert systems)
Dynamic adaption of technical systems (Training of robot movements for surgical robots).

12 of 110

Data Analysis for ML

Data Analysis

The End-to-end process for Real-World problems

In a typical machine learning application, practitioners must apply the appropriate:

Data harvesting from potentially heterogeneous sources
Pre-processing of data (e.g. from analog to digital form)
Model or theory support
Feature engineering
Algorithm selection
Tailoring conditions for algorithms

(hyper-parameter settings, language biases, complexity)

Core analysis phase
Post-processing of acquired knowledge
Visualization and preparation of material for online updating and decision making.

Machine

Learning

13 of 110

Regression & Classification

Main Scenarios for Data Analysis

Regression: establishing prognosis of future states
Classification: establishing concepts for classifying in future situations

Regression is a technique from statistics that is used to predict values of the desired target quantity when the target quantity is ´continuous´.
Classification predicts the discrete number of values. In classification, the data is categorized under different labels according to some parameters and then the labels are predicted for the data.

14 of 110

Objects & Features

Object: Thing, Entity, Observation, Data, Data-item, Record, Tuple, Instance, Example
Feature: Property, Attribute, Characteristic, Variable, Output Variable, Predictor, Target, Category
The Object (Data, Observation) Language is the chosen language (formalism) in which objects and features are described.
Types of Features:

Ordinal (binary)
Discrete numerical (integers)
Continous numerical (real numbers)
Symbolic
Structural (e.g graphs or lists)

ZOO dataset (from UCI ML repository):
Naive and partial classification of animals
107 objects characterized by 18 features classified in 7 categories

Category structure (Animal)

Mammal(#1)

Bird(#2)

Reptile(#3)

Fish(#4)

Amphibian(#5)

Insect(#6)

Invertebrate(#7)

15 of 110

Object and Feature

Synonyms Synonyms

Thing Property

Entity Attribute

Observation Characteristic

Data, Data-item

Record, Tuple Field

Row, Vector Column

Variable, Output Variable

Instance, training instance Independent variable, Predictor Variable

Example, training example Target or Category feature

The Object (Data, Observation) Language is the chosen language (formalism) in which objects and features can be described.

16 of 110

Object space

Instance space

Population

Subsets of object space available for Learning

Sample, Training sample, Statistical sample

Data-set, Table, Array

Training example set

17 of 110

Example from the ZOO Dataset

The Object space or population is the set of all potential feature vectors with feature values as can be expressed in the ZOO object language.

The Sample or Data-set is the whole set of ZOO feature vectors.

The Extension of the Concept of buffalo is the set of all buffalos in real life.

18 of 110

Objects & Features

Features

animal_name, hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, catsize, class_type.

All features are Boolean except the animal-name which is a text and class-type and legs which are integers.

Example from the ZOO Dataset: buffalo,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1

Example of Object and Feature vector

The Object Language in this case is the specific formalism for specifying Feature vectors

animal_name buffalo category symbolic feature

Hair 1 predictor ordinal feature

Feathers 0 predictor ordinal feature

Eggs 0 predictor ordinal feature

Milk 1 predictor ordinal feature

Airborne 0 predictor ordinal feature

Aquatic 0 predictor ordinal feature

Predator 0 predictor ordinal feature

Toothed 1 predictor ordinal feature

Backbone 1 predictor ordinal feature

Breathes 1 predictor ordinal feature

Venomous 0 predictor ordinal feature

Fins 0 predictor ordinal feature

Legs 4 predictor discrete numerical feature

Tail 1 predictor ordinal feature

Domestic 0 predictor ordinal feature

Catsize 1 predictor ordinal feature

class_type 1 category discrete numerical feature

19 of 110

Generalization

Object Category

Definition

Subset of the Data-set Subset of the

Consistent with Object space

the category consistent with

definition the category

definition

Instance-of

Element-of

Subset-of

20 of 110

Example from the ZOO Dataset

Example of

Conception

Definition

The Hypotheses

Language is the

same as the Object

Language apart from the

introduction of a wildcard (?)

for ordinal feature values.

fish 0,0,1,0,0,1,?,1,1,0,?,1,0,1,?,?,4

tuna, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4

stingray, 0,0,1,0,0,1,1,1,1,0,1,1,0,1,0,1,4

seahorse, 0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4

pike, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4

piranha, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4

herring, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4

haddock, 0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4

dogfish, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4

chub, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4

catfish, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4

carp, 0,0,1,0,0,1,0,1,1,0,0,1,0,1,1,0,4

bass, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4

21 of 110

The ZOO dataset (107 Objects)

aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1 antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1 boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 buffalo,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 calf,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,1,1 carp,0,0,1,0,0,1,0,1,1,0,0,1,0,1,1,0,4 catfish,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 cavy,1,0,0,1,0,0,0,1,1,1,0,0,4,0,1,0,1 cheetah,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 chicken,0,1,1,0,1,0,0,0,1,1,0,0,2,1,1,0,2 chub,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 clam,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,7 crab,0,0,1,0,0,1,1,0,0,0,0,0,4,0,0,0,7 crayfish,0,0,1,0,0,1,1,0,0,0,0,0,6,0,0,0,7 crow,0,1,1,0,1,0,1,0,1,1,0,0,2,1,0,0,2 deer,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 dogfish,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4 dolphin,0,0,0,1,0,1,1,1,1,1,0,1,0,1,0,1,1 dove,0,1,1,0,1,0,0,0,1,1,0,0,2,1,1,0,2 duck,0,1,1,0,1,1,0,0,1,1,0,0,2,1,0,0,2 elephant,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 flamingo,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,1,2 flea,0,0,1,0,0,0,0,0,0,1,0,0,6,0,0,0,6 frog,0,0,1,0,0,1,1,1,1,1,0,0,4,0,0,0,5 frog,0,0,1,0,0,1,1,1,1,1,1,0,4,0,0,0,5 fruitbat,1,0,0,1,1,0,0,1,1,1,0,0,2,1,0,0,1 giraffe,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 girl,1,0,0,1,0,0,1,1,1,1,0,0,2,0,1,1,1 gnat,0,0,1,0,1,0,0,0,0,1,0,0,6,0,0,0,6 goat,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,1,1 gorilla,1,0,0,1,0,0,0,1,1,1,0,0,2,0,0,1,1 gull,0,1,1,0,1,1,1,0,1,1,0,0,2,1,0,0,2 haddock,0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4 hamster,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,0,1 hare,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,0,1 hawk,0,1,1,0,1,0,1,0,1,1,0,0,2,1,0,0,2 herring,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 honeybee,1,0,1,0,1,0,0,0,0,1,1,0,6,0,1,0,6 housefly,1,0,1,0,1,0,0,0,0,1,0,0,6,0,0,0,6 kiwi,0,1,1,0,0,0,1,0,1,1,0,0,2,1,0,0,2 ladybird,0,0,1,0,1,0,1,0,0,1,0,0,6,0,0,0,6 lark,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,0,2 leopard,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 lion,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 lobster,0,0,1,0,0,1,1,0,0,0,0,0,6,0,0,0,7 lynx,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 mink,1,0,0,1,0,1,1,1,1,1,0,0,4,1,0,1,1 mole,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,0,1 mongoose,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 moth,1,0,1,0,1,0,0,0,0,1,0,0,6,0,0,0,6 newt,0,0,1,0,0,1,1,1,1,1,0,0,4,1,0,0,5 octopus,0,0,1,0,0,1,1,0,0,0,0,0,8,0,0,1,7 opossum,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,0,1 oryx,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 ostrich,0,1,1,0,0,0,0,0,1,1,0,0,2,1,0,1,2 parakeet,0,1,1,0,1,0,0,0,1,1,0,0,2,1,1,0,2 penguin,0,1,1,0,0,1,1,0,1,1,0,0,2,1,0,1,2 pheasant,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,0,2 pike,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4 piranha,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 pitviper,0,0,1,0,0,0,1,1,1,1,1,0,0,1,0,0,3 platypus,1,0,1,1,0,1,1,0,1,1,0,0,4,1,0,1,1 polecat,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 pony,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,1,1 porpoise,0,0,0,1,0,1,1,1,1,1,0,1,0,1,0,1,1 puma,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 pussycat,1,0,0,1,0,0,1,1,1,1,0,0,4,1,1,1,1 raccoon,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 reindeer,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,1,1 rhea,0,1,1,0,0,0,1,0,1,1,0,0,2,1,0,1,2 scorpion,0,0,0,0,0,0,1,0,0,1,1,0,8,1,0,0,7 seahorse,0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4 seal,1,0,0,1,0,1,1,1,1,1,0,1,0,0,0,1,1 sealion,1,0,0,1,0,1,1,1,1,1,0,1,2,1,0,1,1 seasnake,0,0,0,0,0,1,1,1,1,0,1,0,0,1,0,0,3 seawasp,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,7 skimmer,0,1,1,0,1,1,1,0,1,1,0,0,2,1,0,0,2 skua,0,1,1,0,1,1,1,0,1,1,0,0,2,1,0,0,2 slowworm,0,0,1,0,0,0,1,1,1,1,0,0,0,1,0,0,3 slug,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,7 sole,0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4 sparrow,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,0,2 squirrel,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,0,1 starfish,0,0,1,0,0,1,1,0,0,0,0,0,5,0,0,0,7 stingray,0,0,1,0,0,1,1,1,1,0,1,1,0,1,0,1,4 swan,0,1,1,0,1,1,0,0,1,1,0,0,2,1,0,1,2 termite,0,0,1,0,0,0,0,0,0,1,0,0,6,0,0,0,6 toad,0,0,1,0,0,1,0,1,1,1,0,0,4,0,0,0,5 tortoise,0,0,1,0,0,0,0,0,1,1,0,0,4,1,0,1,3 tuatara,0,0,1,0,0,0,1,1,1,1,0,0,4,1,0,0,3 tuna,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4 vampire,1,0,0,1,1,0,0,1,1,1,0,0,2,1,0,0,1 vole,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,0,1 vulture,0,1,1,0,1,0,1,0,1,1,0,0,2,1,0,1,2 wallaby,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,1,1 wasp,1,0,1,0,1,0,0,0,0,1,1,0,6,0,0,0,6 wolf,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 worm,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,7 wren,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,0,2

22 of 110

Classification Task

In terms of Features, Feature vectors, and the Object (Feature) Space
The classical way of viewing a scenario for a learning task is to:
Define an appropriate set of Features
View each data-item as a Feature vector
Consider the Feature (Object) Space spanned by the Features.
Populate the Feature space with the Feature vectors (Data-items)
Find optimal multi-dimensional surfaces (Hyperplanes) in the Object Space that circumscribe the extensions of all concepts involved.
The engineering of Features is crucial for the complexity of the Object Space and as a consequence also crucial for the complexity of the learning problem.
Very often, data-items are of a non-digital nature and relevant features need to be extracted from the data-items as a separate process.

23 of 110

Dimensionality Reduction

Features

Dimension-1

Dimension-2

Dimension-3

Dimension …..

Dimension-k

Features’

Dimension-1’

Dimension-2’

Dimension-3’

Dimension …..

Dimension-k’

Principal Component Analysis: Dimension Reduction (compression etc )

Transformation

24 of 110

PCA Example

PC1 has the largest variance

(Most information)

PC7 has the smallest variance

(Least information)

25 of 110

Feature Selection/ Reduction

Each image is a Data-item

Feature Selection:

Features can be derived in a variety of manners ranging from totally manual, via manual/automatic hybrids to totally automated.

Every non-digital form of representation demands it own specialized techniques in the automated case.

Dimensionality/ Feature reduction:

making models easier to interpret by human users.
avoiding the curse of dimensionality
reducing the risk for overfitting^.
shortening the computation times for learning processes.

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (typically with hundreds or thousands of dimensions)

26 of 110

Over-Fitting Vs. Under-Fitting

Over-fitting is the production of a model that corresponds too closely or exactly to a particular data-set, and may therefore fail to fit additional data or predict future observations reliably. An over-fitted model is a model that contains more features than can be justified by the data-set.
Under-fitting occurs when a set of features cannot adequately capture the available data-set. An under-fitted model is a model where some features that would appear in a correctly specified model are missing. Such a model will tend to have poor predictive performance.

27 of 110

Feature Selection Vs. Feature Extraction

Feature selection is the process of selecting a subset of relevant features from the original set. The three main criteria for selection of a feature are:
Informative-ness
Relevance and
Non-redundancy.

Feature extraction is the process of deriving new features either as simple combinations of original features or as a more complex mapping from the original set to the new set.

In both cases, the learning task is supposed to be more tractable in the resulting feature space than in the original.

28 of 110

Machine Learning Tasks

Supervised learning

Regression: predict numerical values
Classification: predict categorical values, i.e., labels

Unsupervised learning

Clustering: group data according to "distance"
Association: find frequent co-occurrences
Link prediction: discover relationships in data
Data reduction: project features to fewer features

Reinforcement learning

Reward-based state prediction of agent in an environment, to maximize the cumulative rewards.

29 of 110

Regression

Colorize B&W images automatically

https://tinyclouds.org/colorize/

30 of 110

Classification

Object recognition

https://ai.googleblog.com/2014/09/building-deeper-understanding-of-images.html

31 of 110

Reinforcement learning

Learning to play Break Out

https://www.youtube.com/watch?v=V1eYniJ0Rnk

32 of 110

Clustering

Crime prediction using k-means clustering

http://www.grdjournals.com/uploads/article/GRDJE/V02/I05/0176/GRDJEV02I050176.pdf

33 of 110

Applications in Science

34 of 110

Machine Learning Algorithms

35 of 110

Classification techniques predict categorical responses, for example, whether an email is genuine or spam, or whether a tumor is cancerous or benign. Classification models classify input data into categories. Typical applications include medical imaging, image and speech recognition, and credit scoring.
Regression techniques predict continuous responses, for example, changes in temperature or fluctuations in power demand. Typical applications include electricity load forecasting and algorithmic trading.
Unsupervised learning finds hidden patterns or intrinsic structures in data. It is used to draw inferences from datasets consisting of input data without labeled responses. Clustering is the most common unsupervised learning technique. It is used for exploratory data analysis to find hidden patterns or groupings in data. Applications for clustering include gene sequence analysis, market research, and object recognition.

36 of 110

Not Always Perfect…..

Many machine learning/AI projects fail�(Gartner claims 85 %)

Ethics, e.g., Amazon has/had�sub-par employees fired by an AI�automatically

37 of 110

Failure Reasons

Asking the wrong question
Trying to solve the wrong problem
Not having enough data
Not having the right data
Having too much data
Hiring the wrong people
Using the wrong tools
Not having the right model
Not having the right yardstick

38 of 110

Implementation

Programming languages

Python
R
C++
...

Many libraries

scikit-learn
PyTorch
TensorFlow
Keras
…

classic machine learning

deep learning frameworks

Fast-evolving ecosystem!

39 of 110

Scikit-learn

Nice end-to-end framework

Data exploration (+ pandas + holoviews)
Data preprocessing (+ pandas)

Cleaning/missing values
Normalization

Training
Testing
Application

"Classic" machine learning only
https://scikit-learn.org/stable/

40 of 110

Keras(TensorFlow)

High-level framework for deep learning
Tensor-Flow backend
Layer types

Dense
Convolutional
Pooling
Embedding
Recurrent
Activation
…

https://keras.io/

41 of 110

Procedure

Data ingestion

CSV/JSON/XML/H5 files, RDBMS, NoSQL, HTTP,...

Data cleaning

Outliers/invalid values? → filter
Missing values? → impute

Data transformation

Scaling/Normalization

Must be done systematically

42 of 110

Supervised Learning: Methodology

Select model, e.g., random forest, (deep) neural network, ...
Train model, i.e., determine parameters

Data: input + output

Training data → determine model parameters
Validation data → yardstick to avoid overfitting

Test model

Data: input + output

Testing data → final scoring of the model

Production

Data: input → predict output

43 of 110

From Neurons to ANN’s

activation function

...

inspiration

44 of 110

From ANN’’S ‘s to DNN’’s

How to determine�weights?

45 of 110

Training: Backpropagation

Initialize weights "randomly"
For all training epochs

for all input-output in training set

using input, compute output (forward)
compare computed output with training output
adapt weights (backward) to improve output

if accuracy is good enough, stop

Ex. dataset with 200 samples (rows of data) and a batch size of 5 and 1,000 epochs.

This means that the dataset will be divided into 40 batches, each with five samples.
The model weights will be updated after each batch of five samples.
This also means that one epoch will involve 40 batches or 40 updates to the model.
With 1,000 epochs, the model will be exposed to or pass through the whole dataset 1,000 times. That is a total of 40,000 batches during the entire training process.

46 of 110

Deep neural networks

Many layers
Features are learned, not given
Low-level features combined into�high-level features

Special types of layers

Convolutional
Drop-out
Recurrent
...

47 of 110

Convolutional neural networks

⊗

48 of 110

Convolution examples

49 of 110

Sentiment Classification

Input data

movie review (English)

Output data

Training examples
Test examples

<start> this film was just brilliant casting location

scenery story direction everyone's really suited the part

they played and you could just imagine being there Robert

redford's is an amazing actor and now the same being director

norman's father came from the same scottish island as myself

so i loved the fact there was a real connection with this

film the witty remarks throughout the film were great it was

just brilliant so much that i bought the film as soon as it

…

50 of 110

Quill Bot

Represent words as one-hot vectors�length = vocabulary size

dense vector

Word embeddings

dense vector

vector distance ≈ semantic distance

Training

use context
discover relations with surrounding words

Issues:

Unwieldy (large in size)
no semantics

51 of 110

Part-2:

A working example (With python)

52 of 110

Working Example

Python: high-level, interpreted, general-purpose programming language
Jupyter notebook: a web application for creating and sharing computational documents.
Python libraries

– pandas: It has functions for analyzing, cleaning, exploring, and manipulating data.

– numpy: NumPy is the fundamental package for scientific computing in Python

– matplotlib.pyplot: a collection of functions that make matplotlib work like MATLAB

– seaborn: Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures

Python commands

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In Python, alias eg. pd, are an alternate name for referring to the same thing.

53 of 110

Importing the Dataset

Import the heart disease dataset from the link:

https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data

cols = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num’]
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/heart’, names = cols)

Open Jupyter Notebook

55 of 110

Click on “new”

56 of 110

Select Python 3 (ipykernel)

New Tab/Window appears

57 of 110

After entering commands (one at a time) here, click on the “Run” button

58 of 110

NumPy: Fundamental package for scientific computing. It includes the functionality for

Multidimensional arrays
High level mathematical functions (linear algebra, Fourier transform, pseudorandom number generation)
In scikit-learn (sklearn) , NumPy array is fundamental data structure.
Scikit-learn provides clean datasets. It takes data in the form of NumPy arrays ( each data needs to be converted into NumPy array)

SciPy: Collection of functions for scientific computing in python (advanced linear algebra, mathematical function optimization, signal processing, special mathematical functions, and statistical distributions)

When do we require SciPy ?

When the 2-D array with a lot of zeros (sparse array ) needs to be stored

59 of 110

Convert Numpy Array to a SciPy sparse matrix in CSR (Compressed Sparse Row format)

60 of 110

Matplotlib: Primary scientific plotting library in python. It provides functions for:

Publication-quality visualizations (line charts, histograms, scatter plots etc.)

%matplotlib inline : displays figure (static) inside the notebook

%matplotlib : magic command

%matplotlib notebook gives interactive plots embedded within the notebook.

%matplotlib inline gives static images of plot embedded in the notebook

With magic command, plt.show () is not required

61 of 110

Reset original view

Back

Forward

Left button pans, Right button zooms x/y fixes axis, CTRL fixes aspect

Zoom to rectangle x/y fixes axis

Download plot

62 of 110

Pandas: Python library for data wrangling and analysis.

Built around the data structure called DataFrame.
Similar to an Table in an excel spreadsheet.
Pandas provides operations on tables, where each Colum can be different type ( not possible in NumPy)

Data wrangling: process of removing errors and combining complex data sets to make them more accessible and easier to analyze.

63 of 110

Simple illustration with scikit-learn iris dataset

Toy-dataset (total 6) can be found at sklearn.datasets

Load_iris returns a “bunch” object, instead of a tabular format. Bunch has keys (for lookup) and values; similar to a dictionary. Iris_dataset has 8 keys.

‘data’ (all the feature data in a NumPy array) & ‘target’ ( variable to predict, in a Numpy array)

64 of 110

The ‘data’ key & ‘target’ key

150 rows (entries), each with 4 attributes (features) => define specific ‘target’ key (classification)

0 means setosa;

1 means versicolor;

2 means virginica

67 of 110

Custom Dataset (Excel worksheet) Importing

Pickle is used for serializing and de-serializing Python object structures, also called marshalling or flattening. Serialization refers to the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network.

68 of 110

https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data

69 of 110

Index of /ml/machine-learning-databases/heart-disease

Download: processed.cleveland.data

Open (with File/Open):

processed.cleveland.data

70 of 110

Dataset features

71 of 110

Used Predictors

All Predictors

Dataset Specifications

72 of 110

Out of 76, 14 attributes (features) used

‘age’= 63.0, ‘sex’ =1.0, ‘cp’=1.0, ‘trestbps’= 145.0, ‘chol’ = 233.0, ‘fbs’ = 1.0, ‘restecg’ = 2.0, ‘thalach’ = 150.0, ‘exang’ = 0.0, ‘oldpeak’= 2.3, ‘slope’= 3.0, ‘ca’=0.0, ‘thal’=6.0, ‘num’=0

Seaborn: Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

73 of 110

Read the dataset

Determine dataset type

Predictors

74 of 110

Last 10 Values of the dataset

75 of 110

Shape (rows, columns) of dataset

76 of 110

Describe statistical characteristics of dataset

77 of 110

Check for missing values (if any)

78 of 110

Check for missing ‘?’ values

‘?’ values in ‘ca’ and ‘thal’ predictors

79 of 110

Handle missing ‘?’ values using SimpleImputer

Definition

Impute: Attribute (sort of assignment)

80 of 110

Now no ‘?’ values in ‘ca’ and ‘thal’

81 of 110

Again check for missing values (if any)

4 missing values in ‘ca’

2 missing values in ‘thal’

82 of 110

Replace missing values by the mean value

Imputer returns numpy array & not dataframe

83 of 110

Numpy.ndarray convert to panda dataframe

# while using pd.read_csv() we use names = cols

# while using pd.DataFrame() we use columns = cols

84 of 110

No missing value or ‘?’ or ‘nan’

Data also panda data frame

(Examining and Cleaning Data)

85 of 110

# unique values (Classes) in predictor num

86 of 110

5 Classes converted to 2 classes (Binary classification)

Heart Disease: Yes /No

Binary Classification

87 of 110

Check data for binary class

88 of 110

Split data into X (feature matrix) and y (target vector)

data.iloc[:,0:-1] => all rows, all columns from 0 to last column.

data.iloc[:,-1] => all rows, and only last column.

89 of 110

Split into:

Training data

Testing data

90 of 110

Import Classifiers from corresponding model libraries in Scikit Learn

91 of 110

Build Classifier (Algorithms) Models

92 of 110

Fit models on training data

93 of 110

Determine score, i.e. accuracy on test-data

94 of 110

Boxplot of variables

A box plot is a graphical rendition of statistical data based on the minimum, first quartile, median, third quartile, and maximum (additionally whiskers and outliers)

95 of 110

Import Scaler and Pipeline

Fit training data to pipeline

Calculate score (after scaling and pipelining)

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name

96 of 110

Part-3:

working example (With MATLAB)

97 of 110

Classification :Fisher’s Iris Data (with MATLAB)

load fisheriris
f = figure;
gscatter(meas(:,1), meas(:,2), species,'rgb','osd');
xlabel('Sepal length');
ylabel('Sepal width');

>> size(meas)

ans =

150 4

98 of 110

The fitcdiscr function can perform classification using different types of discriminant analysis. First, classify the data using the default linear discriminant analysis (LDA).

lda = fitcdiscr(meas(:,1:2),species)

lda =

ClassificationDiscriminant

ResponseName: 'Y'

CategoricalPredictors: []

ClassNames: {'setosa' 'versicolor' 'virginica'}

ScoreTransform: 'none'

NumObservations: 150

DiscrimType: 'linear'

Mu: [3×2 double]

Coeffs: [3×3 struct]

ldaClass = resubPredict(lda) -- ?

99 of 110

The observations with known class labels are usually called the training data. Now compute the resubstitution error, which is the misclassification error (the proportion of misclassified observations) on the training set.

>> ldaResubErr = resubLoss(lda)

ldaResubErr =

0.2000

You can also compute the confusion matrix on the training set. A confusion matrix contains information about known class labels and predicted class labels. Generally speaking, the (i,j) element in the confusion matrix is the number of samples whose known class label is class i and whose predicted class is j. The diagonal elements represent correctly classified observations.

Of the 150 training observations, 20% or 30 observations are misclassified by the linear discriminant function.

100 of 110

100

Total misclassifications

1+14+15 = 30 or 20%

Confusion Matrix

figure

ldaResubCM = confusionchart(species,ldaClass);

101 of 110

101

figure(f)

bad = ~strcmp(ldaClass,species);

hold on;

plot(meas(bad,1), meas(bad,2), 'kx');

hold off;

102 of 110

102

The function has separated the plane into regions divided by lines, and assigned different regions to different species. One way to visualize these regions is to create a grid of (x,y) values and apply the classification function to that grid.

[x,y] = meshgrid(4:.1:8,2:.1:4.5);

x = x(:);

y = y(:);

j = classify([x y],meas(:,1:2),species);

gscatter(x,y,j,'grb','sod')

103 of 110

103

For some data sets, the regions for the various classes are not well separated by lines. When that is the case, linear discriminant analysis is not appropriate. Instead, you can try quadratic discriminant analysis (QDA) for our data.

Compute the resubstitution error for quadratic discriminant analysis.

qda = fitcdiscr(meas(:,1:2),species,'DiscrimType','quadratic');

qdaResubErr = resubLoss(qda)

qdaResubErr =

0.2000

test error (also referred to as generalization error), which is the expected prediction error on an independent set.

104 of 110

104

In this case you don't have another labeled data set, then you can simulate one by doing cross-validation.
A 10-fold cross-validation is a popular choice for estimating the test error on classification algorithms.
It randomly divides the training set into 10 disjoint subsets.
Each subset has roughly equal size and roughly the same class proportions as in the training set.
Remove one subset, train the classification model using the other nine subsets, and use the trained model to classify the removed subset.
This is repeated by removing each of the ten subsets one at a time.

Because cross-validation randomly divides data, its outcome depends on the initial random seed. To reproduce the exact results in this example, execute the following command:

105 of 110

105

rng(0,'twister’);

cp = cvpartition(species,'KFold',10)

cp =

K-fold cross validation partition

NumObservations: 150

NumTestSets: 10

TrainSize: 135 135 135 135 135 135 135 135 135 135

TestSize: 15 15 15 15 15 15 15 15 15 15

The crossval and kfoldLoss methods can estimate the misclassification error for both LDA and QDA using the given data partition cp.

Estimate the true test error for LDA using 10-fold stratified cross-validation.

106 of 110

106

cvlda = crossval(lda,'CVPartition',cp);

ldaCVErr = kfoldLoss(cvlda)

ldaCVErr =

0.2000

The LDA cross-validation error has the same value as the LDA resubstitution error on this data.

Estimate the true test error for QDA using 10-fold stratified cross-validation.

cvqda = crossval(qda,'CVPartition',cp);

qdaCVErr = kfoldLoss(cvqda)

qdaCVErr =

0.2200

QDA has a slightly larger cross-validation error than LDA. It shows that a simpler model may get comparable, or better performance than a more complicated model.

107 of 110

107

Naive Bayes classifiers are among the most popular classifiers

The fitcnb function can be used to create a more general type of naive Bayes classifier.

First model each variable in each class using a Gaussian distribution. Then, you can compute the resubstitution error and the cross-validation error.

nbGau = fitcnb(meas(:,1:2), species);

nbGauResubErr = resubLoss(nbGau)

nbGauResubErr =

0.2200

nbGauCV = crossval(nbGau, 'CVPartition',cp);

nbGauCVErr = kfoldLoss(nbGauCV)

labels = predict(nbGau, [x y]);

gscatter(x,y,labels,'grb','sod')

1 of 110

2 of 110

3 of 110

4 of 110

5 of 110

6 of 110

7 of 110

8 of 110

9 of 110

10 of 110

11 of 110

12 of 110

13 of 110

14 of 110

15 of 110

16 of 110

17 of 110

18 of 110

19 of 110

20 of 110

21 of 110

22 of 110

23 of 110

24 of 110

25 of 110

26 of 110

27 of 110

28 of 110

29 of 110

30 of 110

31 of 110

32 of 110

33 of 110

34 of 110

35 of 110

36 of 110

37 of 110

38 of 110

39 of 110

40 of 110

41 of 110

42 of 110

43 of 110

44 of 110

45 of 110

46 of 110

47 of 110

48 of 110

49 of 110

50 of 110

51 of 110

52 of 110

53 of 110

54 of 110

55 of 110

56 of 110

57 of 110

58 of 110

59 of 110

60 of 110

61 of 110

62 of 110

63 of 110

64 of 110

65 of 110

66 of 110

67 of 110

68 of 110

69 of 110

70 of 110

71 of 110

72 of 110

73 of 110

74 of 110

75 of 110

76 of 110

77 of 110

78 of 110

79 of 110

80 of 110