1 of 80

R-Radiant

2 of 80

3 of 80

heights.csv

4 of 80

Visualize->scatter

5 of 80

Create a new variable to denote who is taller in a mother-daughter pair

6 of 80

Change its type to factor

7 of 80

8 of 80

Create another variable to break data into 3 tiers depending on mother height

9 of 80

Change its type to factor

Need to reorder levels to something more natural

10 of 80

Reorder levels

Much better

11 of 80

Interpret!

12 of 80

13 of 80

Simple model = linear regression model

f(x1,x2,x3,x4,…..)=y

14 of 80

Sales [M$] By State

Nature.rda

15 of 80

Total Deviation: Total Sum of Squares (SST)

16 of 80

Sales [M$] by advertising expenditures & state

17 of 80

A linear relationship?

18 of 80

A linear relationship?

b0

b1

1

ŷ = b0 + b1x

19 of 80

Unexplained deviation (“errors”)

Expected sales at $0.8M Advertising?

Actual sales at $0.8M advertising

Unexplained deviation “Error”

ŷ = b0 + b1x0.8

Predicted value

20 of 80

Total deviation = explained deviation + error

Actual sales at $0.8M advertising

Unexplained deviation “Error”

Explained deviation

Total

deviation

21 of 80

Unexplained deviation (“errors” or “residuals”)

22 of 80

Unexplained deviation: Sum of Squared Errors (SSE)

23 of 80

Unexplained deviation: Sum of Squared Errors (SSE)

Regression (OLS) determines the line that minimizes the

Sum of Squared Errors. i.e., b0 and b1 are determined such that

they minimize:

Note: Also known as Residual Sum of Squares

24 of 80

Explained deviation

25 of 80

Explained deviation: Sum of Squares Regression(SSR)

26 of 80

Adding up deviations

SSR

Sum of Squares

Regression

SSE

Sum of Squared Errors

SST

Total

Sum of Squares

=

+

+

=

27 of 80

Simple Linear Regression – The Basic Model

  • Observed data: (x1, y1), (x2, y2),…, (xn, yn)
    • x1, x2,…, xn are n given values of the independent variable
    • y1, y2,…, yn are the corresponding n observations of the dependent variable

  • Model of the population: Yi = β0+ β1xi+ Ɛi
    • Ɛ1 , Ɛ2 ,…, Ɛn are i.i.d. random variables, N(0, σ)
    • This is the true relation between Y and x, yet we do not know β0 and β1 and have to estimate them based on the observed data.

28 of 80

Population Model : Yi = β0+ β1xi+ Ɛi

εi~N(0,σ)

β1

β0

29 of 80

Comments

  • Relationship is linear – described by a “line”
  • β0 = “baseline” value of Y
    • (i.e. value of Y if x is 0)
  • β1 = “slope” of line
    • (average change in Y per unit change in x)

E(Yi|xi) =

SD(Yi|xi) = σ

β0+ β1xi

30 of 80

Notation

  • Regression coefficients: b0 and b1 are estimates of β0 and β1

  • Regression estimate for Y at xi (prediction): ŷi = b0 + b1xi

  • Residuals: ei = yi - ŷi

31 of 80

Realization: yi = b0+ b1xi+ ei

y5

e5

y5 = b0 + b1*x5

^

b1

b0

32 of 80

Multiple Linear Regression

  • Observed data:
    • (x11, x21,…, xk1), (x12, x22,…, xk2),…, (x1n, x2n,…, xkn) are records of given values of the explanatory variables
    • y1, y2,…, yn are the n observations of the dependent variable

  • Population model:
    • Yi = β0+ β1x1i+ … + βkxki + Ɛi
    • Ɛ1 , Ɛ2 , … Ɛn are i.i.d. random variables, N(0, σ)

33 of 80

Multiple Linear Regression

  • Regression coefficients: b0,b1,…, bk are estimates of β0 , β1,..., βk
  • Prediction for Y at xi (prediction): �ŷi = b0 + b1x1i + … + bkxki
  • Residual (error): ei = yi - ŷi
  • Goal: choose b0, b1,…, bk to minimize the sum of squared error

34 of 80

Estimating a regression model

in Radiant

Nature.rda

35 of 80

Plot or perish!

36 of 80

Do the ‘signs’ make sense?

37 of 80

Regression output

38 of 80

Regression output: Coefficients

b0, b1,…, bk are estimates of the true parameters β0, β1,…, βk

39 of 80

Interpreting regression coefficients

  • Regression coefficients: b0,b1,…, bk are estimates of β0 , β1 ,..., βk Fact: E(bi) = βi

  • Example:
    • b0 = 65.70 (Intersection with Y-axis. Interpret with care or ignore)
    • b1= 48.98 ($1 million increase in advertising is expected to increase sales by $49 million, keeping all else constant)
    • b2= 59.65 ($1 million increase in promotions is expected to increase sales by $59.7 million, keeping all else constant)
    • b3= -1.84 ($1 million increase in competitor sales is expected to decrease sales by $1.8 million, keeping all else constant)

40 of 80

Regression output: Standard Error

σ

41 of 80

Regression output: Standard Error

  • Standard error s: an estimate of σ, the standard deviation of the Ɛi. It is a measure of the spread of the data around the line and represents the amount of “noise” in the model.
    • Example: s = 17.60

42 of 80

Regression output:

Sum of Squares

SST: “Total Sum of Squares”

SSE: “Sum of Squared Errors”

SSR: “Sum of Squares Regression”

43 of 80

Regression output:

R2

Coefficient of determination(R2)

A measure of the overall strength of the relationship between the Response and Predictor variables

How much of the variation has been explained?

44 of 80

Understanding regression output

45 of 80

Coefficient of determination: R2

  • R2 = 1; x values account for all of the variation in the Y values
  • R2 = 0; x values account for none of the variation in the Y values

46 of 80

Coefficient of determination: R2

  • R2 = 0.8; x values account for most of the variation in the Y values
  • R2 = 0.4; x values account for some of the variation in the Y values

47 of 80

Standard deviations for the coefficients

48 of 80

T values for the coefficients

“distance” of the coefficients from 0, measured in standard deviations. i.e., t.value = coeff./std.error

49 of 80

P values for the coefficients

p.value is the probability of finding a t.value of this size if the null-hypothesis is true

50 of 80

95% confidence interval for the coefficients.

If the interval does not contain 0, our confidence that the variable has explanatory power is (at least) 95%.

51 of 80

Interpreting confidence intervals

If the interval DOES NOT contain 0 we conclude that βi is significantly different from zero

0

0

🗴 Bad :

Good :

52 of 80

Interpreting confidence intervals

53 of 80

Let’s practice diamond!!

54 of 80

Split data for validation

Training data for building model

Test data for validation of the model

-Create new vaiable for training/test

-For training, give 1.

-For test, give 0.

55 of 80

Data->transform->transformation type-> training variable

56 of 80

70% of data will have 1 in ‘training’ variable

Random function

This enables various selections

2100 data out of 3000 data is ’training=1’

57 of 80

data -> view -> filter -> training ==1 -> store in ‘diamond_training’

data -> view -> filter -> training ==0 -> store in ‘diamond_test’

58 of 80

Select ‘diamond_training’ and build linear regression

59 of 80

65% probability of x is not contributing on price

60 of 80

Let’s look at the data and variables.

  • carat = weight of the diamond (0.2–3.00)
  • depth = total depth percentage = z / mean(x, y) = 2 * z / (x + y) (54.2–70.80)
  • table = width of top of diamond relative to widest point (50–69)
  • x = length in mm (3.73–9.42)
  • y = width in mm (3.71–9.29)
  • z = depth in mm (2.33–5.58)

These are all talking about ‘size’

61 of 80

x

y

z

table

  • depth = total depth percentage = z / mean(x, y) = 2 * z / (x + y) (54.2–70.80)

62 of 80

63 of 80

X,y,z are stronlgy corelated, and that is why depth is less corelated.

64 of 80

D,E->A

Change type!! (as a factor)

Just in case, make a variable for color D

65 of 80

Color F+D(yes)??

66 of 80

Something wrong?

67 of 80

Carat vs color?

68 of 80

Color

New variable

Neural Network Method

Color

Model

f(x1,x2,x3,..)=a

a*f(x1,x2,x3,..)=price

69 of 80

Require the setup to shorten the calculation amount

70 of 80

71 of 80

Depth, table evaluation

72 of 80

Exclude table, and play with “size” parameters

73 of 80

Can you predict price of x>y diamond (various depth) using your model?

74 of 80

75 of 80

Finalize the model and run the prediction!

Prediction->data->choose test data->run and save the data

Don’t forget to create the variable of color_new in test data to run the prediction!!

D,E->A

Change type!! (as a factor)

76 of 80

77 of 80

78 of 80

79 of 80

Additional analysis! Look at the data again

All data

Pred<0 data

Smaller carat

(smaller x,y,z)

Higher depth

(depth = z / mean(x, y) = 2 * z / (x + y))

80 of 80

Segmentation! Let’s build a new linear regression for carat <0.5.

How many segmentation you can take?

data

Carat>0.5

Carat<0.5

Color=D,E

Other color

Other color

Color=D,E