5 of 87

Multiple Logistic Regression

It is simple to illustrate examples in logistic regression when there is just one predictor variable.
But the approach ‘easily’ generalizes to the situation where there are multiple predictors.
A lot of the same details as linear regression apply to logistic regression. Interactions can be considered. So is overfitting. Etc...
So how do we correct for such problems?
Regularization and checking through train, test, and cross-validation!
We will get into the details of this, along with other extensions of logistic regression

Data Science for All 2023

Lecture 1: Classification

6 of 87

Classifier with two predictors

How can we estimate a classifier, based on logistic regression, for the following plot?

Data Science for All 2023

Lecture 1: Classification

7 of 87

Multiple Logistic Regression

Data Science for All 2023

Lecture 1: Classification

8 of 87

Multiple Logistic Regression

Data Science for All 2023

Lecture 1: Classification

9 of 87

Fitting Multiple Logistic Regression

Data Science for All 2023

Lecture 1: Classification

10 of 87

Interpreting Multiple Logistic Regression: an Example

Let's get back to the Heart Data. We are attempting to predict whether someone has HD based on MaxHR and whether the person is female or male. The simultaneous effect of these two predictors can be brought into one model.

For single logistic regression, we have the following estimated models:

Data Science for All 2023

Lecture 1: Classification

11 of 87

Interpreting Multiple Logistic Regression: an Example

The results for the multiple logistic regression model are:

Data Science for All 2023

Lecture 1: Classification

12 of 87

Classification boundaries

Data Science for All 2023

Lecture 1: Classification

13 of 87

Geometry of Data

Recall:
logistic regression for building classification boundaries works best when:
the classes are well-separated in the feature space
have a nice geometry to the classification boundary

Data Science for All 2023

Lecture 1: Classification

14 of 87

Geometry of Data

Question: How about these?

Data Science for All 2023

Lecture 1: Classification

15 of 87

Geometry of Data

Notice that in all of the datasets the classes are still well-separated in the feature space, but the decision boundaries cannot easily be described by single equations:

Data Science for All 2023

Lecture 1: Classification

16 of 87

Geometry of Data

Data Science for All 2023

Lecture 1: Classification

17 of 87

Interpretable Models

People have long been using interpretable models for differentiating between classes of objects and phenomena:

Data Science for All 2023

Lecture 1: Classification

18 of 87

Decision Tree

Data Science for All 2023

Lecture 1: Classification

19 of 87

Decision Trees

It turns out that the simple flow charts in our examples can be formulated as mathematical models for classification and these models have the properties we desire; they are:
interpretable by humans
have sufficiently complex decision boundaries
the decision boundaries are locally linear, each component of the decision boundary is simple to describe mathematically.

Data Science for All 2023

Lecture 1: Classification

20 of 87

Tree-based Models

Use trees to partition the data into different regions and make predictions

age?

overcast

student?

credit rating?

<=30

>40

yes

31..40

fair

excellent

yes

Root node

Internal nodes

Leaf nodes

Data Science for All 2023

Lecture 1: Classification

21 of 87

The Geometry of Flow Charts

Flow charts whose graph is a tree (connected and no cycles) represents a model called a decision tree.
Formally, a decision tree model is one in which the final outcome of the model is based on a series of comparisons of the values of predictors against threshold values.
In a graphical representation (flow chart),

the internal nodes of the tree represent attribute testing.
branching in the next level is determined by attribute value (e.g., yes/no).
terminal leaf nodes represent class assignments.

Data Science for All 2023

Lecture 1: Classification

22 of 87

The Geometry of Flow Charts

A path from root to a leaf node corresponds to a rule

E.g., if age<=30 and student=no

then target value=no

age?

student?

credit rating?

<=30

>40

yes

31..40

fair

excellent

yes

Data Science for All 2023

Lecture 1: Classification

23 of 87

The Geometry of Flow Charts

Every flow chart tree corresponds to a partition of the feature space by axis aligned lines or (hyper) planes. Conversely, every such partition can be written as a flow chart tree.

Data Science for All 2023

Lecture 1: Classification

24 of 87

The Geometry of Flow Charts

Each comparison and branching represents splitting a region in the feature space on a single feature. Typically, at each iteration, we split once along one dimension (one predictor). Why?

Data Science for All 2023

Lecture 1: Classification

25 of 87

The Geometry of Flow Charts

Each comparison and branching represents splitting a region in the feature space on a single feature. Typically, at each iteration, we split once along one dimension (one predictor).

Data Science for All 2023

Lecture 1: Classification

26 of 87

Learning the Model

Given a training set, learning a decision tree model for binary classification means:
producing an optimal partition of the feature space with axis-aligned linear boundaries (very interpretable!),
each region is predicted to have a class label based on the largest class of the training points in that region when performing prediction.

Data Science for All 2023

Lecture 1: Classification

27 of 87

Learning the Model

Learning the smallest ‘optimal’ decision tree for any given set of data is very hard. Instead, we will seek a reasonably model using a greedy algorithm.
Start with an empty decision tree (undivided feature space)
Choose the ‘optimal’ predictor on which to split and choose the ‘optimal’ threshold value for splitting.
Recurse on each new node until stopping condition is met
Now, we need only define our splitting criterion and stopping condition.

Data Science for All 2023

Lecture 1: Classification

28 of 87

The key is to partition data into smaller regions

Numerical variable

Use threshold to partition data

Categorical variable

Use the possible value to partition the data

Data Science for All 2023

Lecture 1: Classification

29 of 87

Guideline of Splitting

While there is no ‘correct’ way to define an optimal split, there are some common sensical guidelines for every splitting criterion:

the regions in the feature space should grow progressively more pure with the number of splits. That is, we should see each region ‘specialize’ towards a single class.
we shouldn’t end up with empty regions - regions containing no training points.

Data Science for All 2023

Lecture 1: Classification

30 of 87

Example

Data Science for All 2023

Lecture 1: Classification

31 of 87

Which attribute to choose?

Ages

Yes

<=30

31…40

>40

VS.

Credit_Rating

Yes

YesNo

Excellent

Fair

Q: Which attribute is better for the classification task?

Data Science for All 2023

Lecture 1: Classification

32 of 87

How to define purity of each region?

Classification error
Gini Index
Entropy

Data Science for All 2023

Lecture 1: Classification

33 of 87

Classification Error

Data Science for All 2023

Lecture 1: Classification

34 of 87

Classification Error

Data Science for All 2023

Lecture 1: Classification

35 of 87

Stopping Conditions & Pruning

If we don’t terminate the decision tree learning algorithm manually, the tree will continue to grow until each region defined by the model possibly contains exactly one training point (and the model attains 100% training accuracy).
To prevent this from happening, we can simply stop the algorithm at a particular depth.
But how do we determine the appropriate depth?

Data Science for All 2023

Lecture 1: Classification

36 of 87

Variance vs Bias

We make some observations about our models:
(High Bias) A tree of depth 4 is not a good fit for the training data - it’s unable to capture the nonlinear boundary separating the two classes.
(Low Bias) With an extremely high depth, we can obtain a model that correctly classifies all points on the boundary (by zig-zagging around each point).
(Low Variance) The tree of depth 4 is robust to slight perturbations in the training data - the square carved out by the model is stable if you move the boundary points a bit.
(High Variance) Trees of high depth are sensitive to perturbations in the training data, especially to changes in the boundary points.
Not surprisingly, complex trees have low bias (able to capture more complex geometry in the data) but high variance (can overfit). Complex trees are also harder to interpret and more computationally expensive to train.

Data Science for All 2023

Lecture 1: Classification

37 of 87

Stopping Conditions

Common simple stopping conditions:
Don’t split a region if all instances in the region belong to the same class.
Don’t split a region if the number of instances in the sub-region will fall below pre-defined threshold (min_samples_leaf).
Don’t split a region if the total number of leaves in the tree will exceed pre-defined threshold.
The appropriate thresholds can be determined by evaluating the model on a held-out data set or, better yet, via cross-validation.

Data Science for All 2023

Lecture 1: Classification

38 of 87

Alternative to Using Stopping Conditions

What is the major issue with pre-specifying a stopping condition?

you may stop too early or stop too late.

How can we fix this issue?

choose several stopping criterion and cross-validate which is the best.

What is an alternative approach to this issue?

Don’t stop. Instead prune back!

Data Science for All 2023

Lecture 1: Classification

39 of 87

The Scalable Analytics Institute (ScAI)�Department of Computer Science�University of California, Los Angeles (UCLA)

Instructor: Jeehyun Hwang

Lecture 1: Classification

45 of 87

What is Cluster Analysis?

Cluster: A collection of data objects

similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups

Cluster analysis (or clustering, data segmentation, …)

Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised)
Typical applications

As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms

Data Science for All 2023

Lecture 1: Classification

46 of 87

What is Cluster Analysis?

Data Science for All 2023

Lecture 1: Classification

47 of 87

What is Cluster Analysis?

Data Science for All 2023

Lecture 1: Classification

48 of 87

What is Cluster Analysis? - KNN

Data Science for All 2023

Lecture 1: Classification

49 of 87

What is Cluster Analysis? - KNN

Data Science for All 2023

Lecture 1: Classification

50 of 87

Applications of Cluster Analysis

Prediction based on groups

Cluster & find characteristics/patterns for each group

Finding K-nearest Neighbors

Localizing search to one or a small number of clusters

Outlier detection: Outliers are often viewed as those “far away” from any cluster

Data Science for All 2023

Lecture 1: Classification

51 of 87

Clustering: Application Examples - Species

Data Science for All 2023

Lecture 1: Classification

52 of 87

Clustering: Application Examples - Marketing

Data Science for All 2023

Lecture 1: Classification

53 of 87

Clustering: Application Examples

Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth observation database
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric and ocean

Data Science for All 2023

Lecture 1: Classification

54 of 87

K-means objective

Partitioning method: Partitioning a dataset D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where c_j is the centroid of cluster C_j)

Given k, find a partition of k clusters that optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster

Data Science for All 2023

Lecture 1: Classification

55 of 87

K-means objective

Q: suppose there are 5 data points in the data set, and we want to cluster them into 2 clusters. In total, how many ways are there for us to partition the data into two clusters?

Data Science for All 2023

Lecture 1: Classification

56 of 87

K-means objective – Iterative Process

Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters
Step-4: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
Step-5: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-6: The model is ready.

Data Science for All 2023

Lecture 1: Classification

57 of 87

K-means Example

Data Science for All 2023

Lecture 1: Classification

58 of 87

K-means Example

Data Science for All 2023

Lecture 1: Classification

59 of 87

K-means Example on Image Segmentation

Data Science for All 2023

Lecture 1: Classification

60 of 87

K-means Example

Data Science for All 2023

Lecture 1: Classification

61 of 87

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition - WHY?

Step 0

Step 1

Step 2

Step 3

Step 4

a b

d e

c d e

a b c d e

Step 4

Step 3

Step 2

Step 1

Step 0

agglomerative

(AGNES)

divisive

(DIANA)

Data Science for All 2023

Lecture 1: Classification

62 of 87

Pseudo Code

Initialization: Place each data point into its own cluster and compute distance matrix between clusters
Repeat:

Merge the two closest clusters
Update the distance matrix for the affected entries

Until: all the data are merged into a single cluster

Clustering based on density (local cluster criterion), such as density-connected points
Major features:

Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition

Several interesting studies:

DBSCAN: Ester, et al. (KDD’96)
OPTICS*: Ankerst, et al (SIGMOD’99).
DENCLUE*: Hinneburg & D. Keim (KDD’98)
CLIQUE*: Agrawal, et al. (SIGMOD’98) (more grid-based)

Data Science for All 2023

Lecture 1: Classification

73 of 87

DBSCAN: Basic Concepts

Two parameters:

Eps: Maximum radius of the neighborhood
MinPts: Minimum number of points in an Eps-neighborhood of that point

N_Eps(q): {p belongs to D | dist(p,q) ≤ Eps}
q is a core point if:

|N_Eps (q)| ≥ MinPts

Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if

q is a core point
p belongs to N_Eps(q)

MinPts = 5

Eps = 1 cm

Data Science for All 2023

Lecture 1: Classification

74 of 87

DBSCAN: Basic Concepts

Core

Border

Eps = 1cm

MinPts = 5

Data Science for All 2023

Lecture 1: Classification

75 of 87

DBSCAN: Basic Concepts

Core

Border

Noise

Eps = 1cm

MinPts = 5

Data Science for All 2023

Lecture 1: Classification

76 of 87

DBSCAN: Density-Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points
Noise: object not contained in any cluster is noise
Discovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Noise

Eps = 1cm

K-means vs DBSCAN

Data Science for All 2023

Lecture 1: Classification

87 of 87

Summary

Cluster analysis groups objects based on their similarity and has wide applications;
K-means algorithm is a popular partitioning-based clustering algorithm
hierarchical clustering algorithms
DBSCAN is a density-based algorithm

Data Science for All 2023

Lecture 1: Classification