19 of 80

Unexplained deviation (“errors”)

Expected sales at $0.8M Advertising?

Actual sales at $0.8M advertising

Unexplained deviation “Error”

ŷ = b₀ + b₁x0.8

Predicted value

20 of 80

Total deviation = explained deviation + error

Actual sales at $0.8M advertising

Unexplained deviation “Error”

Explained deviation

Total

deviation

21 of 80

Unexplained deviation (“errors” or “residuals”)

22 of 80

Unexplained deviation: Sum of Squared Errors (SSE)

23 of 80

Unexplained deviation: Sum of Squared Errors (SSE)

Regression (OLS) determines the line that minimizes the

Sum of Squared Errors. i.e., b₀ and b₁ are determined such that

they minimize:

Note: Also known as Residual Sum of Squares

24 of 80

Explained deviation

25 of 80

Explained deviation: Sum of Squares Regression(SSR)

26 of 80

Adding up deviations

SSR

Sum of Squares

Regression

SSE

Sum of Squared Errors

SST

Total

Sum of Squares

27 of 80

Simple Linear Regression – The Basic Model

Observed data: (x₁, y₁), (x₂, y₂),…, (x_n, y_n)

x₁, x₂,…, x_n are n given values of the independent variable
y₁, y₂,…, y_n are the corresponding n observations of the dependent variable

Model of the population: Y_i = β₀+ β₁x_i+ Ɛ_i

Ɛ₁,Ɛ₂,…, Ɛ_n are i.i.d. random variables, N(0, σ)
This is the true relation between Y and x, yet we do not know β₀ and β₁ and have to estimate them based on the observed data.

28 of 80

Population Model : Y_i = β₀+ β₁x_i+ Ɛ_i

ε_i~N(0,σ)

β₁

β₀

29 of 80

Comments

Relationship is linear – described by a “line”
β₀ = “baseline” value of Y

(i.e. value of Y if x is 0)

β₁= “slope” of line

(average change in Y per unit change in x)

E(Y_i|x_i) =

SD(Y_i|x_i) = σ

β₀+ β₁x_i

30 of 80

Notation

Regression coefficients: b₀ and b₁ are estimates of β₀and β₁

Regression estimate for Y at x_i (prediction): ŷ_i = b₀ + b₁x_i

Residuals: e_i = y_i - ŷ_i

31 of 80

Realization: y_i = b₀+ b₁x_i+ e_i

y₅

e₅

y₅= b₀ + b₁*x₅

b₁

b₀

32 of 80

Multiple Linear Regression

Observed data:

(x₁₁, x₂₁,…, x_k1), (x₁₂, x₂₂,…, x_k2),…, (x_1n, x_2n,…, x_kn) are records of given values of the explanatory variables
y₁, y₂,…, y_n are the n observations of the dependent variable

Population model:

Y_i = β₀+ β₁x_1i+ … + β_kx_ki + Ɛ_i
Ɛ₁,Ɛ₂, … Ɛ_n are i.i.d. random variables, N(0, σ)

33 of 80

Multiple Linear Regression

Regression coefficients: b₀,b₁,…, b_k are estimates of β₀ , β₁,...,β_k
Prediction for Y at x_i (prediction): �ŷ_i = b₀ + b₁x_1i + … + b_kx_ki
Residual (error): e_i = y_i - ŷ_i
Goal: choose b₀, b₁,…, b_k to minimize the sum of squared error

34 of 80

Estimating a regression model

in Radiant

Nature.rda

35 of 80

Plot or perish!

36 of 80

Do the ‘signs’ make sense?

37 of 80

Regression output

38 of 80

Regression output: Coefficients

b₀, b₁,…, b_k are estimates of the true parameters β₀, β₁,…, β_k

39 of 80

Interpreting regression coefficients

Regression coefficients: b₀,b₁,…, b_k are estimates of β₀ , β₁,...,β_kFact: E(b_i) = β_i

Example:

b₀ = 65.70 (Intersection with Y-axis. Interpret with care or ignore)
b₁= 48.98 ($1 million increase in advertising is expected to increase sales by $49 million, keeping all else constant)
b₂= 59.65 ($1 million increase in promotions is expected to increase sales by $59.7 million, keeping all else constant)
b₃= -1.84 ($1 million increase in competitor sales is expected to decrease sales by $1.8 million, keeping all else constant)

40 of 80

Regression output: Standard Error

41 of 80

Regression output: Standard Error

Standard error s: an estimate of σ, the standard deviation of the Ɛ_i. It is a measure of the spread of the data around the line and represents the amount of “noise” in the model.

Example: s = 17.60

42 of 80

Regression output:

Sum of Squares

SST: “Total Sum of Squares”

SSE: “Sum of Squared Errors”

SSR: “Sum of Squares Regression”

43 of 80

Regression output:

R²

Coefficient of determination(R²)

A measure of the overall strength of the relationship between the Response and Predictor variables

How much of the variation has been explained?

44 of 80

Understanding regression output

45 of 80

Coefficient of determination: R²

R² = 1; x values account for all of the variation in the Y values

R² = 0; x values account for none of the variation in the Y values

46 of 80

Coefficient of determination: R²

R² = 0.8; x values account for most of the variation in the Y values

R² = 0.4; x values account for some of the variation in the Y values

47 of 80

Standard deviations for the coefficients

48 of 80

T values for the coefficients

“distance” of the coefficients from 0, measured in standard deviations. i.e., t.value = coeff./std.error

49 of 80

P values for the coefficients

p.value is the probability of finding a t.value of this size if the null-hypothesis is true

50 of 80

95% confidence interval for the coefficients.

If the interval does not contain 0, our confidence that the variable has explanatory power is (at least) 95%.

51 of 80

Interpreting confidence intervals

If the interval DOES NOT contain 0 we conclude that β_i is significantly different from zero

🗴 Bad :

✓ Good :

52 of 80

Interpreting confidence intervals

53 of 80

Let’s practice diamond!!

54 of 80

Split data for validation

Training data for building model

Test data for validation of the model

-Create new vaiable for training/test

-For training, give 1.

-For test, give 0.

55 of 80

Data->transform->transformation type-> training variable

56 of 80

70% of data will have 1 in ‘training’ variable

Random function

This enables various selections

2100 data out of 3000 data is ’training=1’

57 of 80

data -> view -> filter -> training ==1 -> store in ‘diamond_training’

data -> view -> filter -> training ==0 -> store in ‘diamond_test’

58 of 80

Select ‘diamond_training’ and build linear regression

59 of 80

65% probability of x is not contributing on price

60 of 80

Let’s look at the data and variables.

carat = weight of the diamond (0.2–3.00)
depth = total depth percentage = z / mean(x, y) = 2 * z / (x + y) (54.2–70.80)
table = width of top of diamond relative to widest point (50–69)
x = length in mm (3.73–9.42)
y = width in mm (3.71–9.29)
z = depth in mm (2.33–5.58)

These are all talking about ‘size’

61 of 80

table

depth = total depth percentage = z / mean(x, y) = 2 * z / (x + y) (54.2–70.80)

63 of 80

X,y,z are stronlgy corelated, and that is why depth is less corelated.

64 of 80

D,E->A

Change type!! (as a factor)

Just in case, make a variable for color D

65 of 80

Color F+D(yes)??

66 of 80

Something wrong?

67 of 80

Carat vs color?

68 of 80

Color

New variable

Neural Network Method

Color

Model

f(x1,x2,x3,..)=a

a*f(x1,x2,x3,..)=price

69 of 80

Require the setup to shorten the calculation amount

71 of 80

Depth, table evaluation

72 of 80

Exclude table, and play with “size” parameters

73 of 80

Can you predict price of x>y diamond (various depth) using your model?

75 of 80

Finalize the model and run the prediction!

Prediction->data->choose test data->run and save the data

Don’t forget to create the variable of color_new in test data to run the prediction!!

D,E->A

Change type!! (as a factor)

79 of 80

Additional analysis! Look at the data again

All data

Pred<0 data

Smaller carat

(smaller x,y,z)

Higher depth

(depth = z / mean(x, y) = 2 * z / (x + y))

80 of 80

Segmentation! Let’s build a new linear regression for carat <0.5.

How many segmentation you can take?

data

Carat>0.5

Carat<0.5

Color=D,E

Other color

Color=D,E

1 of 80

2 of 80

3 of 80

4 of 80

5 of 80

6 of 80

7 of 80

8 of 80

9 of 80

10 of 80

11 of 80

12 of 80

13 of 80

14 of 80

15 of 80

16 of 80

17 of 80

18 of 80

19 of 80

20 of 80

21 of 80

22 of 80

23 of 80

24 of 80

25 of 80

26 of 80

27 of 80

28 of 80

29 of 80

30 of 80

31 of 80

32 of 80

33 of 80

34 of 80

35 of 80

36 of 80

37 of 80

38 of 80

39 of 80

40 of 80

41 of 80

42 of 80

43 of 80

44 of 80

45 of 80

46 of 80

47 of 80

48 of 80

49 of 80

50 of 80

51 of 80

52 of 80

53 of 80

54 of 80

55 of 80

56 of 80

57 of 80

58 of 80

59 of 80

60 of 80

61 of 80

62 of 80

63 of 80

64 of 80

65 of 80

66 of 80

67 of 80

68 of 80

69 of 80

70 of 80

71 of 80

72 of 80

73 of 80

74 of 80

75 of 80

76 of 80

77 of 80

78 of 80

79 of 80

80 of 80