1 of 46

  • Insert Text Here

FOR INSTRUCTOR PURPOSES ONLY

INSTRUCTOR NOTES

2 of 46

  • Insert Text Here

FOR INSTRUCTOR PURPOSES ONLY

MATERIALS

3 of 46

  • Insert Text Here

FOR INSTRUCTOR PURPOSES ONLY

PRE-WORK

4 of 46

Insert Instructor Name

Title, Company

INTRODUCTION TO REGRESSION ANALYSIS

5 of 46

LEARNING OBJECTIVES

  • Define data modeling and simple linear regression
  • Build a linear regression model using a dataset that meets the linearity assumption using the sci-kit learn library
  • Understand and identify multicollinearity in a multiple regression.

INTRODUCTION TO REGRESSION ANALYSIS

6 of 46

INTRODUCTION TO REGRESSION ANALYSIS

PRE-WORK

7 of 46

  • Effectively show correlations between an independent variable x and a dependent variable y
  • Be familiar with the get_dummies function in pandas
  • Understand the difference between vectors, matrices, Series, and DataFrames
  • Understand the concepts of outliers and distance.
  • Be able to interpret p values and confidence intervals

PRE-WORK REVIEW

8 of 46

OPENING

INTRODUCTION TO REGRESSION ANALYSIS

9 of 46

  • Data has been acquired and parsed.

  • Today we’ll refine the data and build models.

  • We’ll also use plots to represent the results.

WHERE ARE WE IN THE DATA SCIENCE WORKFLOW?

10 of 46

INTRODUCTION

SIMPLE LINEAR REGRESSION

11 of 46

  • Def: Explanation of a continuous variable given a series of independent variables

  • The simplest version is just a line of best fit: y = mx + b

  • Explain the relationship between x and y using the starting point b and the power in explanation m.

SIMPLE LINEAR REGRESSION

12 of 46

  • However, linear regression uses linear algebra to explain the relationship between multiple x’s and y.

  • The more sophisticated version: y = beta * X + alpha (+ error)

  • Explain the relationship between the matrix X and a dependent vector y using a y-intercept alpha and the relative coefficients beta.

SIMPLE LINEAR REGRESSION

13 of 46

  • Linear regression works best when:

    • The data is normally distributed (but doesn’t have to be)

    • X’s significantly explain y (have low p-values)

    • X’s are independent of each other (low multicollinearity)

    • Resulting values pass linear assumption (depends upon problem)

  • If data is not normally distributed, we could introduce bias.

SIMPLE LINEAR REGRESSION

14 of 46

DEMO

REGRESSING AND NORMAL DISTRIBUTIONS

15 of 46

  • Follow along with your starter code notebook while I walk through these examples.

  • The first plot shows a relationship between two values, though not a linear solution.

  • Note that lmplot() returns a straight line plot.

  • However, we can transform the data, both log-log distributions to get a linear solution.

DEMO: REGRESSING AND NORMAL DISTRIBUTIONS

16 of 46

GUIDED PRACTICE

USING SEABORN TO GENERATE SIMPLE LINEAR MODEL PLOTS

17 of 46

EXERCISE

  1. Update and complete the code in the starter notebook to use lmplot and display correlations between body weight and two dependent variables: sleep_rem and awake.

Two plots

DELIVERABLE

DIRECTIONS (15 minutes)

ACTIVITY: GENERATE SINGLE VARIABLE LINEAR MODEL PLOTS

18 of 46

INTRODUCTION

SIMPLE REGRESSION ANALYSIS IN SKLEARN

19 of 46

  • Sklearn defines models as objects (in the OOP sense).

  • You can use the following principles:

    • All sklearn modeling classes are based on the base estimator. This means all models take a similar form.

    • All estimators take a matrix X, either sparse or dense.

    • Supervised estimators also take a vector y (the response).

    • Estimators can be customized through setting the appropriate parameters.

SIMPLE LINEAR REGRESSION ANALYSIS IN SKLEARN

20 of 46

  • Classes are an abstraction for a complex set of ideas, e.g. human.

  • Specific instances of classes can be created as objects.
    • john_smith = human()

  • Objects have properties. These are attributes or other information.
    • john_smith.age
    • john_smith.gender

  • Object have methods. These are procedures associated with a class/object.
    • john_smith.breathe()
    • john_smith.walk()

CLASSES AND OBJECTS IN OBJECT ORIENTED PROGRAMMING

21 of 46

  • General format for sklearn model classes and methods

# generate an instance of an estimator classestimator = base_models.AnySKLearnObject()# fit your dataestimator.fit(X, y)# score it with the default scoring method (recommended to use the metrics module in the future)estimator.score(X, y)# predict a new set of dataestimator.predict(new_X)# transform a new X if changes were made to the original X while fittingestimator.transform(new_X)

  • LinearRegression() doesn’t have a transform function

  • With this information, we can build a simple process for linear regression.

SIMPLE LINEAR REGRESSION ANALYSIS IN SKLEARN

22 of 46

DEMO

SIGNIFICANCE IS KEY

23 of 46

  • Follow along with your starter code notebook while I walk through these examples.

  • What does the residual plot tell us?

  • How can we use the linear assumption?

DEMO: SIGNIFICANCE IS KEY

24 of 46

GUIDED PRACTICE

USING THE LINEAR REGRESSION OBJECT

25 of 46

EXERCISE

  • With a partner, generate two more models using the log-transformed data to see how this transform changes the model’s performance.
  • Use the code on the following slide to complete #1.

Two new models

DELIVERABLE

DIRECTIONS (15 minutes)

ACTIVITY: USING THE LINEAR REGRESSION OBJECT

26 of 46

EXERCISE

X =�y =�loop = []�for boolean in loop:� print 'y-intercept:', boolean� lm = linear_model.LinearRegression(fit_intercept=boolean)� get_linear_model_metrics(X, y, lm)� print

Two new models

DELIVERABLE

DIRECTIONS (15 minutes)

ACTIVITY: USING THE LINEAR REGRESSION OBJECT

27 of 46

INDEPENDENT PRACTICE

BASE LINEAR REGRESSION CLASSES

28 of 46

EXERCISE

  • Experiment with the model evaluation function we have (get_linear_model_metrics) with the following sklearn estimator classes.

    • linear_model.Lasso()
    • linear_model.Ridge()
    • linear_model.ElasticNet()

Note: We’ll cover these new regression techniques in a later class.

New models and evaluation metrics

DELIVERABLE

DIRECTIONS (20 minutes)

ACTIVITY: BASE LINEAR REGRESSION CLASSES

29 of 46

INTRODUCTION

MULTIPLE REGRESSION ANALYSIS

30 of 46

  • Simple linear regression with one variable can explain some variance, but using multiple variables can be much more powerful.

  • We want our multiple variables to be mostly independent to avoid multicollinearity.

  • Multicollinearity, when two or more variables in a regression are highly correlated, can cause problems with the model.

MULTIPLE REGRESSION ANALYSIS

31 of 46

  • We can look at a correlation matrix of our bike data.

  • Even if adding correlated variables to the model improves overall variance, it can introduce problems when explaining the output of your model.

  • What happens if we use a second variable that isn't highly correlated with temperature?

BIKE DATA EXAMPLE

32 of 46

GUIDED PRACTICE

MULTICOLLINEARITY WITH DUMMY VARIABLES

33 of 46

EXERCISE

  • Load the bike data.
  • Run through the code on the following slide.
  • What happens to the coefficients when you include all weather situations instead of just including all except one?

Two models’ output

DELIVERABLE

DIRECTIONS (15 minutes)

ACTIVITY: MULTICOLLINEARITY WITH DUMMY VARIABLES

34 of 46

EXERCISE

lm = linear_model.LinearRegression()�weather = pd.get_dummies(bike_data.weathersit)�get_linear_model_metrics(weather[[1, 2, 3, 4]], y, lm)�print# drop the least significant, weather situation = 4�get_linear_model_metrics(weather[[1, 2, 3]], y, lm)

Two models’ output

DELIVERABLE

DIRECTIONS (15 minutes)

ACTIVITY: MULTICOLLINEARITY WITH DUMMY VARIABLES

35 of 46

GUIDED PRACTICE

COMBINING FEATURES INTO A BETTER MODEL

36 of 46

EXERCISE

  • With a partner, complete the code on the following slide.
  • Visualize the correlations of all the numerical features built into the dataset.
  • Add the three significant weather situations into our current model.
  • Find two more features that are not correlated with the current features, but could be strong indicators for predicting guest riders.

DIRECTIONS (15 minutes)

ACTIVITY: COMBINING FEATURES INTO A BETTER MODEL

Visualization of correlations, new models

DELIVERABLE

37 of 46

EXERCISE

lm = linear_model.LinearRegression()�bikemodel_data = bike_data.join() # add in the three weather situations��cmap = sns.diverging_palette(220, 10, as_cmap=True)�correlations = # what are we getting the correlations of?print correlations�print sns.heatmap(correlations, cmap=cmap)��columns_to_keep = [] #[which_variables?]�final_feature_set = bikemodel_data[columns_to_keep]��get_linear_model_metrics(final_feature_set, y, lm)

DIRECTIONS (15 minutes)

ACTIVITY: COMBINING FEATURES INTO A BETTER MODEL

Visualization of correlations, new models

DELIVERABLE

38 of 46

INDEPENDENT PRACTICE

BUILDING MODELS FOR OTHER Y VARIABLES

39 of 46

EXERCISE

  • Build a new model using a new y variable: registered riders.
  • Pay attention to the following:
    • the distribution of riders (should we rescale the data?)
    • checking correlations between the variables and y variable
    • choosing features to avoid multicollinearity
    • model complexity vs. explanation of variance
    • the linear assumption

A new model and evaluation metrics

DELIVERABLE

DIRECTIONS (25 minutes)

ACTIVITY: BUILDING MODELS FOR OTHER Y VARIABLES

  • Which variables make sense to dummy?
  • What features might explain ridership but aren’t included? Can you build these features with the included data and pandas?

BONUS

40 of 46

CONCLUSION

TOPIC REVIEW

41 of 46

  • You should now be able to answer the following questions:

    • What is simple linear regression?

    • What makes multi-variable regressions more useful?

    • What challenges do they introduce?

    • How do you dummy a category variable?

    • How do you avoid a singular matrix?

CONCLUSION

42 of 46

WEEK 3 : LESSON 6

UPCOMING WORK

43 of 46

Week 4 : Lesson 8

  • Project: Final Project, Deliverable 1

UPCOMING WORK

44 of 46

Q & A

INTRODUCTION TO REGRESSION ANALYSIS

45 of 46

EXIT TICKET

INTRODUCTION TO REGRESSION ANALYSIS

DON’T FORGET TO FILL OUT YOUR EXIT TICKET!

46 of 46

THANKS!

    • Body Level One
    • Body Level Two
    • Body Level Three
    • Body Level Four
    • Body Level Five
    • Body Level One
    • Body Level Two
    • Body Level Three
    • Body Level Four
    • Body Level Five

INSTRUCTOR NAME

    • Optional Information:
    • Email?
    • Website?
    • Twitter?
    • Body Level One
    • Body Level Two
    • Body Level Three
    • Body Level Four
    • Body Level Five
    • Body Level One
    • Body Level Two
    • Body Level Three
    • Body Level Four
    • Body Level Five