Data Science Application of�Artificial Intelligence/Machine Learning
Saurabh +xxxxxxxxxxxxxxx
Srivastava saurabhnitkian@gmail.com
vscentrum.be
Part-1:
Data Science
2
DATA
3
There is not a single big industry that does not rely on data and the insights gained through them
Part-1:
The theory Behind
4
Introduction
5
Incremental
User Data
Large Distributed
Databases
Incremental
Sensor Data
Statistics
Big Data
Artificial Intelligence
Machine Learning
Data Science
Relation: Data Science, Big-data
- Visualization/presentation of data and decision-making based on data.
- The Big Data area is based on more traditional areas such as very large databases, data warehousing,
and distributed databases.
6
Some Background….
Agenda: “The study is to proceed on the basis of the conjecture that every aspect of
learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.
An attempt will be made to find out how to make machines:
7
Some Background….
8
Founding Fathers’ of Artificial Intelligence in 1956
Claude Shannon Founder of Information and Communication Theory
D.M. Mackay British researcher in Information Theory and Brain organization
Julian Bigelow Chief engineer for the von Neumann computer at Princeton in 1946
Nathaniel Rochester Author of the first assembler for the first commercial computer
Oliver Selfridge Named ’the father of Machine Perception’
Ray Solomonoff Inventor of Algorithmic Probability
John Holland The inventor of Genetic Algorithms
Marvin Minsky Key MIT researcher in the early development of AI
Allen Newell Champion for symbolic AI and inventor of central AI techniques
Herbert Simon Pioneer in Decision-making theory and a Nobel Prize Winner
John McCarthy Inventor of the LISP programming language
Some Background….
9
Some Background….
10
LG has launched the ThinQ AI focused TV brand.
Huawei, Samsung and Qualcomm launch AI powered Smart Phones.
Burger King boosts ´AI-written´ ads. .................
A majority of real current Artificial Intelligence success stories relate to the application of Machine Learning only!
A majority of current Machine Learning success stories relate to Image and Speech processing!
ML application sectors
11
General application sectors:
Specific categories of data analysis:
Data Analysis for ML
12
Data Analysis
The End-to-end process for Real-World problems
In a typical machine learning application, practitioners must apply the appropriate:
(hyper-parameter settings, language biases, complexity)
Machine
Learning
Regression & Classification
13
Objects & Features
14
Category structure (Animal)
Mammal(#1)
Bird(#2)
Reptile(#3)
Fish(#4)
Amphibian(#5)
Insect(#6)
Invertebrate(#7)
Object and Feature
Synonyms Synonyms
Thing Property
Entity Attribute
Observation Characteristic
Data, Data-item
Record, Tuple Field
Row, Vector Column
Variable, Output Variable
Instance, training instance Independent variable, Predictor Variable
Example, training example Target or Category feature
15
16
Object space
Instance space
Population
Subsets of object space available for Learning
Sample, Training sample, Statistical sample
Data-set, Table, Array
Training example set
17
Example from the ZOO Dataset
The Object space or population is the set of all potential feature vectors with feature values as can be expressed in the ZOO object language.
The Sample or Data-set is the whole set of ZOO feature vectors.
The Extension of the Concept of buffalo is the set of all buffalos in real life.
Objects & Features
18
Features
animal_name, hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, catsize, class_type.
All features are Boolean except the animal-name which is a text and class-type and legs which are integers.
Example from the ZOO Dataset: buffalo,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
Example of Object and Feature vector
The Object Language in this case is the specific formalism for specifying Feature vectors
animal_name buffalo category symbolic feature
Hair 1 predictor ordinal feature
Feathers 0 predictor ordinal feature
Eggs 0 predictor ordinal feature
Milk 1 predictor ordinal feature
Airborne 0 predictor ordinal feature
Aquatic 0 predictor ordinal feature
Predator 0 predictor ordinal feature
Toothed 1 predictor ordinal feature
Backbone 1 predictor ordinal feature
Breathes 1 predictor ordinal feature
Venomous 0 predictor ordinal feature
Fins 0 predictor ordinal feature
Legs 4 predictor discrete numerical feature
Tail 1 predictor ordinal feature
Domestic 0 predictor ordinal feature
Catsize 1 predictor ordinal feature
class_type 1 category discrete numerical feature
Generalization
19
Object Category
Definition
Subset of the Data-set Subset of the
Consistent with Object space
the category consistent with
definition the category
definition
Instance-of
Element-of
Subset-of
20
Example from the ZOO Dataset
Example of
Conception
Definition
The Hypotheses
Language is the
same as the Object
Language apart from the
introduction of a wildcard (?)
for ordinal feature values.
fish 0,0,1,0,0,1,?,1,1,0,?,1,0,1,?,?,4
tuna, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4
stingray, 0,0,1,0,0,1,1,1,1,0,1,1,0,1,0,1,4
seahorse, 0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4
pike, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4
piranha, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
herring, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
haddock, 0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4
dogfish, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4
chub, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
catfish, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
carp, 0,0,1,0,0,1,0,1,1,0,0,1,0,1,1,0,4
bass, 0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
The ZOO dataset (107 Objects)
21
aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1 antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1 boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 buffalo,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 calf,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,1,1 carp,0,0,1,0,0,1,0,1,1,0,0,1,0,1,1,0,4 catfish,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 cavy,1,0,0,1,0,0,0,1,1,1,0,0,4,0,1,0,1 cheetah,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 chicken,0,1,1,0,1,0,0,0,1,1,0,0,2,1,1,0,2 chub,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 clam,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,7 crab,0,0,1,0,0,1,1,0,0,0,0,0,4,0,0,0,7 crayfish,0,0,1,0,0,1,1,0,0,0,0,0,6,0,0,0,7 crow,0,1,1,0,1,0,1,0,1,1,0,0,2,1,0,0,2 deer,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 dogfish,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4 dolphin,0,0,0,1,0,1,1,1,1,1,0,1,0,1,0,1,1 dove,0,1,1,0,1,0,0,0,1,1,0,0,2,1,1,0,2 duck,0,1,1,0,1,1,0,0,1,1,0,0,2,1,0,0,2 elephant,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 flamingo,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,1,2 flea,0,0,1,0,0,0,0,0,0,1,0,0,6,0,0,0,6 frog,0,0,1,0,0,1,1,1,1,1,0,0,4,0,0,0,5 frog,0,0,1,0,0,1,1,1,1,1,1,0,4,0,0,0,5 fruitbat,1,0,0,1,1,0,0,1,1,1,0,0,2,1,0,0,1 giraffe,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 girl,1,0,0,1,0,0,1,1,1,1,0,0,2,0,1,1,1 gnat,0,0,1,0,1,0,0,0,0,1,0,0,6,0,0,0,6 goat,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,1,1 gorilla,1,0,0,1,0,0,0,1,1,1,0,0,2,0,0,1,1 gull,0,1,1,0,1,1,1,0,1,1,0,0,2,1,0,0,2 haddock,0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4 hamster,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,0,1 hare,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,0,1 hawk,0,1,1,0,1,0,1,0,1,1,0,0,2,1,0,0,2 herring,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 honeybee,1,0,1,0,1,0,0,0,0,1,1,0,6,0,1,0,6 housefly,1,0,1,0,1,0,0,0,0,1,0,0,6,0,0,0,6 kiwi,0,1,1,0,0,0,1,0,1,1,0,0,2,1,0,0,2 ladybird,0,0,1,0,1,0,1,0,0,1,0,0,6,0,0,0,6 lark,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,0,2 leopard,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 lion,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 lobster,0,0,1,0,0,1,1,0,0,0,0,0,6,0,0,0,7 lynx,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 mink,1,0,0,1,0,1,1,1,1,1,0,0,4,1,0,1,1 mole,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,0,1 mongoose,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 moth,1,0,1,0,1,0,0,0,0,1,0,0,6,0,0,0,6 newt,0,0,1,0,0,1,1,1,1,1,0,0,4,1,0,0,5 octopus,0,0,1,0,0,1,1,0,0,0,0,0,8,0,0,1,7 opossum,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,0,1 oryx,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1 ostrich,0,1,1,0,0,0,0,0,1,1,0,0,2,1,0,1,2 parakeet,0,1,1,0,1,0,0,0,1,1,0,0,2,1,1,0,2 penguin,0,1,1,0,0,1,1,0,1,1,0,0,2,1,0,1,2 pheasant,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,0,2 pike,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4 piranha,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4 pitviper,0,0,1,0,0,0,1,1,1,1,1,0,0,1,0,0,3 platypus,1,0,1,1,0,1,1,0,1,1,0,0,4,1,0,1,1 polecat,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 pony,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,1,1 porpoise,0,0,0,1,0,1,1,1,1,1,0,1,0,1,0,1,1 puma,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 pussycat,1,0,0,1,0,0,1,1,1,1,0,0,4,1,1,1,1 raccoon,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 reindeer,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,1,1 rhea,0,1,1,0,0,0,1,0,1,1,0,0,2,1,0,1,2 scorpion,0,0,0,0,0,0,1,0,0,1,1,0,8,1,0,0,7 seahorse,0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4 seal,1,0,0,1,0,1,1,1,1,1,0,1,0,0,0,1,1 sealion,1,0,0,1,0,1,1,1,1,1,0,1,2,1,0,1,1 seasnake,0,0,0,0,0,1,1,1,1,0,1,0,0,1,0,0,3 seawasp,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,7 skimmer,0,1,1,0,1,1,1,0,1,1,0,0,2,1,0,0,2 skua,0,1,1,0,1,1,1,0,1,1,0,0,2,1,0,0,2 slowworm,0,0,1,0,0,0,1,1,1,1,0,0,0,1,0,0,3 slug,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,7 sole,0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,4 sparrow,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,0,2 squirrel,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,0,1 starfish,0,0,1,0,0,1,1,0,0,0,0,0,5,0,0,0,7 stingray,0,0,1,0,0,1,1,1,1,0,1,1,0,1,0,1,4 swan,0,1,1,0,1,1,0,0,1,1,0,0,2,1,0,1,2 termite,0,0,1,0,0,0,0,0,0,1,0,0,6,0,0,0,6 toad,0,0,1,0,0,1,0,1,1,1,0,0,4,0,0,0,5 tortoise,0,0,1,0,0,0,0,0,1,1,0,0,4,1,0,1,3 tuatara,0,0,1,0,0,0,1,1,1,1,0,0,4,1,0,0,3 tuna,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,4 vampire,1,0,0,1,1,0,0,1,1,1,0,0,2,1,0,0,1 vole,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,0,1 vulture,0,1,1,0,1,0,1,0,1,1,0,0,2,1,0,1,2 wallaby,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,1,1 wasp,1,0,1,0,1,0,0,0,0,1,1,0,6,0,0,0,6 wolf,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1 worm,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,7 wren,0,1,1,0,1,0,0,0,1,1,0,0,2,1,0,0,2
Classification Task
22
Dimensionality Reduction
23
Features
Dimension-1
Dimension-2
Dimension-3
Dimension …..
Dimension-k
Features’
Dimension-1’
Dimension-2’
Dimension-3’
Dimension …..
Dimension-k’
Principal Component Analysis: Dimension Reduction (compression etc )
Transformation
PCA Example
24
PC1 has the largest variance
(Most information)
PC7 has the smallest variance
(Least information)
Feature Selection/ Reduction
25
Each image is a Data-item
Feature Selection:
Features can be derived in a variety of manners ranging from totally manual, via manual/automatic hybrids to totally automated.
Every non-digital form of representation demands it own specialized techniques in the automated case.
Dimensionality/ Feature reduction:
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (typically with hundreds or thousands of dimensions)
Over-Fitting Vs. Under-Fitting
26
Feature Selection Vs. Feature Extraction
27
Machine Learning Tasks
28
Regression
Colorize B&W images automatically
29
Classification
30
Reinforcement learning
Learning to play Break Out
31
Clustering
Crime prediction using k-means clustering
http://www.grdjournals.com/uploads/article/GRDJE/V02/I05/0176/GRDJEV02I050176.pdf
32
Applications in Science
33
Machine Learning Algorithms
34
35
Not Always Perfect…..
36
Failure Reasons
37
Implementation
38
classic machine learning
deep learning frameworks
Fast-evolving ecosystem!
Scikit-learn
39
Keras(TensorFlow)
40
Procedure
41
Must be done systematically
Supervised Learning: Methodology
42
From Neurons to ANN’s
43
activation function
...
inspiration
From ANN’’S ‘s to DNN’’s
44
How to determine�weights?
Training: Backpropagation
45
Ex. dataset with 200 samples (rows of data) and a batch size of 5 and 1,000 epochs.
Deep neural networks
46
Convolutional neural networks
47
⊗
Convolution examples
48
Sentiment Classification
49
/
<start> this film was just brilliant casting location
scenery story direction everyone's really suited the part
they played and you could just imagine being there Robert
redford's is an amazing actor and now the same being director
norman's father came from the same scottish island as myself
so i loved the fact there was a real connection with this
film the witty remarks throughout the film were great it was
just brilliant so much that i bought the film as soon as it
…
Quill Bot
50
Issues:
Part-2:
A working example (With python)
51
Working Example
– pandas: It has functions for analyzing, cleaning, exploring, and manipulating data.
– numpy: NumPy is the fundamental package for scientific computing in Python
– matplotlib.pyplot: a collection of functions that make matplotlib work like MATLAB
– seaborn: Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structures
52
Python commands
In Python, alias eg. pd, are an alternate name for referring to the same thing.
Importing the Dataset
https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data
Open Jupyter Notebook
53
54
55
Click on “new”
56
Select Python 3 (ipykernel)
New Tab/Window appears
57
After entering commands (one at a time) here, click on the “Run” button
58
When the 2-D array with a lot of zeros (sparse array ) needs to be stored
59
60
%matplotlib inline : displays figure (static) inside the notebook
%matplotlib : magic command
%matplotlib notebook gives interactive plots embedded within the notebook.
%matplotlib inline gives static images of plot embedded in the notebook
With magic command, plt.show () is not required
61
Reset original view
Back
Forward
Left button pans, Right button zooms x/y fixes axis, CTRL fixes aspect
Zoom to rectangle x/y fixes axis
Download plot
62
Data wrangling: process of removing errors and combining complex data sets to make them more accessible and easier to analyze.
Simple illustration with scikit-learn iris dataset
63
Load_iris returns a “bunch” object, instead of a tabular format. Bunch has keys (for lookup) and values; similar to a dictionary. Iris_dataset has 8 keys.
‘data’ (all the feature data in a NumPy array) & ‘target’ ( variable to predict, in a Numpy array)
The ‘data’ key & ‘target’ key
64
150 rows (entries), each with 4 attributes (features) => define specific ‘target’ key (classification)
0 means setosa;
1 means versicolor;
2 means virginica
65
66
Custom Dataset (Excel worksheet) Importing
67
Pickle is used for serializing and de-serializing Python object structures, also called marshalling or flattening. Serialization refers to the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network.
68
https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleveland.data
69
Index of /ml/machine-learning-databases/heart-disease
Download: processed.cleveland.data
Open (with File/Open):
processed.cleveland.data
70
Dataset features
71
Used Predictors
All Predictors
Dataset Specifications
72
Out of 76, 14 attributes (features) used
‘age’= 63.0, ‘sex’ =1.0, ‘cp’=1.0, ‘trestbps’= 145.0, ‘chol’ = 233.0, ‘fbs’ = 1.0, ‘restecg’ = 2.0, ‘thalach’ = 150.0, ‘exang’ = 0.0, ‘oldpeak’= 2.3, ‘slope’= 3.0, ‘ca’=0.0, ‘thal’=6.0, ‘num’=0
Seaborn: Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
73
Read the dataset
Determine dataset type
Predictors
74
Last 10 Values of the dataset
75
Shape (rows, columns) of dataset
76
Describe statistical characteristics of dataset
77
Check for missing values (if any)
78
Check for missing ‘?’ values
‘?’ values in ‘ca’ and ‘thal’ predictors
79
Handle missing ‘?’ values using SimpleImputer
Definition
Impute: Attribute (sort of assignment)
80
Now no ‘?’ values in ‘ca’ and ‘thal’
81
Again check for missing values (if any)
4 missing values in ‘ca’
2 missing values in ‘thal’
82
Replace missing values by the mean value
Imputer returns numpy array & not dataframe
83
Numpy.ndarray convert to panda dataframe
# while using pd.read_csv() we use names = cols
# while using pd.DataFrame() we use columns = cols
84
No missing value or ‘?’ or ‘nan’
Data also panda data frame
(Examining and Cleaning Data)
85
# unique values (Classes) in predictor num
86
5 Classes converted to 2 classes (Binary classification)
Heart Disease: Yes /No
Binary Classification
87
Check data for binary class
88
Split data into X (feature matrix) and y (target vector)
data.iloc[:,0:-1] => all rows, all columns from 0 to last column.
data.iloc[:,-1] => all rows, and only last column.
89
Split into:
Training data
Testing data
90
Import Classifiers from corresponding model libraries in Scikit Learn
91
Build Classifier (Algorithms) Models
92
Fit models on training data
93
Determine score, i.e. accuracy on test-data
94
Boxplot of variables
A box plot is a graphical rendition of statistical data based on the minimum, first quartile, median, third quartile, and maximum (additionally whiskers and outliers)
95
Import Scaler and Pipeline
Fit training data to pipeline
Calculate score (after scaling and pipelining)
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name
96
Part-3:
working example (With MATLAB)
Classification :Fisher’s Iris Data (with MATLAB)
97
>> size(meas)
ans =
150 4
98
lda = fitcdiscr(meas(:,1:2),species)
lda =
ClassificationDiscriminant
ResponseName: 'Y'
CategoricalPredictors: []
ClassNames: {'setosa' 'versicolor' 'virginica'}
ScoreTransform: 'none'
NumObservations: 150
DiscrimType: 'linear'
Mu: [3×2 double]
Coeffs: [3×3 struct]
ldaClass = resubPredict(lda) -- ?
99
The observations with known class labels are usually called the training data. Now compute the resubstitution error, which is the misclassification error (the proportion of misclassified observations) on the training set.
>> ldaResubErr = resubLoss(lda)
ldaResubErr =
0.2000
You can also compute the confusion matrix on the training set. A confusion matrix contains information about known class labels and predicted class labels. Generally speaking, the (i,j) element in the confusion matrix is the number of samples whose known class label is class i and whose predicted class is j. The diagonal elements represent correctly classified observations.
Of the 150 training observations, 20% or 30 observations are misclassified by the linear discriminant function.
100
Total misclassifications
1+14+15 = 30 or 20%
Confusion Matrix
figure
ldaResubCM = confusionchart(species,ldaClass);
101
figure(f)
bad = ~strcmp(ldaClass,species);
hold on;
plot(meas(bad,1), meas(bad,2), 'kx');
hold off;
102
The function has separated the plane into regions divided by lines, and assigned different regions to different species. One way to visualize these regions is to create a grid of (x,y) values and apply the classification function to that grid.
[x,y] = meshgrid(4:.1:8,2:.1:4.5);
x = x(:);
y = y(:);
j = classify([x y],meas(:,1:2),species);
gscatter(x,y,j,'grb','sod')
103
For some data sets, the regions for the various classes are not well separated by lines. When that is the case, linear discriminant analysis is not appropriate. Instead, you can try quadratic discriminant analysis (QDA) for our data.
Compute the resubstitution error for quadratic discriminant analysis.
qda = fitcdiscr(meas(:,1:2),species,'DiscrimType','quadratic');
qdaResubErr = resubLoss(qda)
qdaResubErr =
0.2000
test error (also referred to as generalization error), which is the expected prediction error on an independent set.
104
Because cross-validation randomly divides data, its outcome depends on the initial random seed. To reproduce the exact results in this example, execute the following command:
105
rng(0,'twister’);
cp = cvpartition(species,'KFold',10)
cp =
K-fold cross validation partition
NumObservations: 150
NumTestSets: 10
TrainSize: 135 135 135 135 135 135 135 135 135 135
TestSize: 15 15 15 15 15 15 15 15 15 15
The crossval and kfoldLoss methods can estimate the misclassification error for both LDA and QDA using the given data partition cp.
Estimate the true test error for LDA using 10-fold stratified cross-validation.
106
cvlda = crossval(lda,'CVPartition',cp);
ldaCVErr = kfoldLoss(cvlda)
ldaCVErr =
0.2000
The LDA cross-validation error has the same value as the LDA resubstitution error on this data.
Estimate the true test error for QDA using 10-fold stratified cross-validation.
cvqda = crossval(qda,'CVPartition',cp);
qdaCVErr = kfoldLoss(cvqda)
qdaCVErr =
0.2200
QDA has a slightly larger cross-validation error than LDA. It shows that a simpler model may get comparable, or better performance than a more complicated model.
107
Naive Bayes classifiers are among the most popular classifiers
The fitcnb function can be used to create a more general type of naive Bayes classifier.
First model each variable in each class using a Gaussian distribution. Then, you can compute the resubstitution error and the cross-validation error.
nbGau = fitcnb(meas(:,1:2), species);
nbGauResubErr = resubLoss(nbGau)
nbGauResubErr =
0.2200
nbGauCV = crossval(nbGau, 'CVPartition',cp);
nbGauCVErr = kfoldLoss(nbGauCV)
labels = predict(nbGau, [x y]);
gscatter(x,y,labels,'grb','sod')
108
We assumed the variables from each class to have a multivariate normal distribution. But sometimes the assumption is not valid. Now try to model each variable in each class using a kernel density estimation, which is a more flexible nonparametric technique. Setting the kernel to box
nbKD = fitcnb(meas(:,1:2), species, 'DistributionNames','kernel', 'Kernel','box');
nbKDResubErr = resubLoss(nbKD)
nbKDResubErr = 0.2067
109
nbKDCV = crossval(nbKD, 'CVPartition',cp);
nbKDCVErr = kfoldLoss(nbKDCV)
nbKDCVErr = 0.2133
labels = predict(nbKD, [x y]);
gscatter(x,y,labels,'rgb','osd')
For this data set, the naive Bayes classifier with kernel density estimation gets smaller resubstitution error and cross-validation error than the naive Bayes classifier with a Gaussian distribution.
110