1 of 15

Introduction to Data Science

By

S.V.V.D.Jagadeesh

Sr. Assistant Professor

Dept of Artificial Intelligence & Data Science

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

2 of 15

  • Unit-II Outcomes
  • Machine Learning
  • Applications of ML in Data Science
  • Benefits and Challenges of Data Science
  • Python tools used in ML
  • Packages for Working With Data in Memory
  • Optimizing Operations

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Previously Discussed Topics

LBRCE

IDS

3 of 15

At the end of this session, Student will be able to:

  • Understand the Machine Learning Process(Understand- L2)

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Session Outcomes

LBRCE

IDS

4 of 15

  • The modeling phase consists of four steps:

1 Feature engineering and model selection

2 Training the model

3 Model validation and selection

4 Applying the trained model to unseen data�

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Machine Learning Modeling Process

LBRCE

IDS

5 of 15

  • With engineering features, you must come up with and create possible predictors/classifiers for the model.
  • This is one of the most important steps in the process because a model recombines these features to achieve its predictions.
  • Often you may need to consult an expert or the appropriate literature to come up with meaningful features.
  • Certain features are the variables you get from a data set, as is the case with the provided.
  • In practice you’ll need to find the features yourself, which may be scattered among different data sets

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Feature Engineering And Model Selection

LBRCE

IDS

6 of 15

  • Often you’ll need to apply a transformation to an input before it becomes a good predictor or to combine multiple inputs.
  • An example of combining multiple inputs would be interaction variables: the impact of either single variable is low, but if both are present their impact becomes immense.
  • This is especially true in chemical and medical environments
  • Sometimes you have to use modeling techniques to derive features: the output of a model becomes part of another model.
  • This isn’t uncommon, especially in text mining.

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Feature Engineering And Model Selection

LBRCE

IDS

7 of 15

  • One of the biggest mistakes in model construction is the availability bias: your features are only the ones that you could easily get your hands on and your model consequently represents this one-sided “truth.”
  • Models suffering from availability bias often fail when they’re validated because it becomes clear that they’re not a valid representation of the truth.
  • When the initial features are created, a model can be trained to the data.

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Feature Engineering And Model Selection

LBRCE

IDS

8 of 15

  • With the right predictors in place and a modeling technique in mind, you can progress to model training.
  • In this phase you present to your model data from which it can learn.
  • The most common modeling techniques have industry-ready implementations in almost every programming language, including Python.
  • These enable you to train your models by executing a few lines of code.

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Training the Model

LBRCE

IDS

9 of 15

  • For more state-of-the art data science techniques, you’ll probably end up doing heavy mathematical calculations and implementing them with modern computer science techniques.
  • Once a model is trained, it’s time to test whether it can be extrapolated to reality: model validation.

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Training the Model

LBRCE

IDS

10 of 15

  • Data science has many modeling techniques, and the question is which one is the right one to use.
  • A good model has two properties: it has good predictive power and it generalizes well to data it hasn’t seen.
  • To achieve this you define an error measure (how wrong the model is) and a validation strategy.
  • Two common error measures in machine learning are the classification error rate for classification problems and the mean squared error for regression problems.

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Validating a Model

LBRCE

IDS

11 of 15

  • The classification error rate is the percentage of observations in the test data set that your model mislabeled; lower is better.
  • The mean squared error measures how big the average error of your prediction is.
  • Squaring the average error has two consequences: you can’t cancel out a wrong prediction in one direction with a faulty prediction in the other direction.

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Validating a Model

LBRCE

IDS

12 of 15

  • Many validation strategies exist, including the following common ones:

■ Dividing your data into a training set with X% of the observations and keeping the rest as a holdout data set (a data set that’s never used for model creation)—This is the most common technique.

■ K-folds cross validation—This strategy divides the data set into k parts and uses each part one time as a test data set while using the others as a training data set.

This has the advantage that you use all the data available in the data set.

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Validating a Model

LBRCE

IDS

13 of 15

■ Leave-1 out—This approach is the same as k-folds but with k=1.

You always leave one observation out and train on the rest of the data.

This is used only on small data sets, so it’s more valuable to people evaluating laboratory experiments than to big data analysts.

■ Another popular term in machine learning is regularization.

When applying regularization, you incur a penalty for every extra variable used to construct the model.

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Validating a Model

LBRCE

IDS

14 of 15

  • The process of applying your model to new data is called model scoring.
  • In fact, model scoring is something you implicitly did during validation, only now you don’t know the correct outcome
  • Model scoring involves two steps.
  • First, you prepare a data set that has features exactly as defined by your model(This represents repeating the data preparation you did in step one of the modeling process but for a new data set).
  • Then you apply the model on this new data set, and this results in a prediction

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Predicting New Observations

LBRCE

IDS

15 of 15

  • Session Outcomes
  • Machine Learning Modeling Process
  • Feature Engineering and Model Selection
  • Training the Model
  • Validating a Model
  • Predicting New Observations

S.V.V.D.Jagadeesh

Thursday, January 2, 2025

Summary

LBRCE

IDS