R-Radiant
heights.csv
Visualize->scatter
Create a new variable to denote who is taller in a mother-daughter pair
Change its type to factor
Create another variable to break data into 3 tiers depending on mother height
Change its type to factor
Need to reorder levels to something more natural
Reorder levels
Much better
Interpret!
Simple model = linear regression model
f(x1,x2,x3,x4,…..)=y
Sales [M$] By State
Nature.rda
Total Deviation: Total Sum of Squares (SST)
Sales [M$] by advertising expenditures & state
A linear relationship?
A linear relationship?
b0
b1
1
ŷ = b0 + b1x
Unexplained deviation (“errors”)
Expected sales at $0.8M Advertising?
Actual sales at $0.8M advertising
Unexplained deviation “Error”
ŷ = b0 + b1x0.8
Predicted value
Total deviation = explained deviation + error
Actual sales at $0.8M advertising
Unexplained deviation “Error”
Explained deviation
Total
deviation
Unexplained deviation (“errors” or “residuals”)
Unexplained deviation: Sum of Squared Errors (SSE)
Unexplained deviation: Sum of Squared Errors (SSE)
Regression (OLS) determines the line that minimizes the
Sum of Squared Errors. i.e., b0 and b1 are determined such that
they minimize:
Note: Also known as Residual Sum of Squares
Explained deviation
Explained deviation: Sum of Squares Regression(SSR)
Adding up deviations
SSR
Sum of Squares
Regression
SSE
Sum of Squared Errors
SST
Total
Sum of Squares
=
+
+
=
Simple Linear Regression – The Basic Model
Population Model : Yi = β0+ β1xi+ Ɛi
εi~N(0,σ)
β1
β0
Comments
E(Yi|xi) =
SD(Yi|xi) = σ
β0+ β1xi
Notation
Realization: yi = b0+ b1xi+ ei
y5
e5
y5 = b0 + b1*x5
^
b1
b0
Multiple Linear Regression
Multiple Linear Regression
Estimating a regression model
in Radiant
Nature.rda
Plot or perish!
Do the ‘signs’ make sense?
Regression output
Regression output: Coefficients
b0, b1,…, bk are estimates of the true parameters β0, β1,…, βk
Interpreting regression coefficients
Regression output: Standard Error
σ
Regression output: Standard Error
Regression output:
Sum of Squares
SST: “Total Sum of Squares”
SSE: “Sum of Squared Errors”
SSR: “Sum of Squares Regression”
Regression output:
R2
Coefficient of determination(R2)
A measure of the overall strength of the relationship between the Response and Predictor variables
How much of the variation has been explained?
Understanding regression output
Coefficient of determination: R2
Coefficient of determination: R2
Standard deviations for the coefficients
T values for the coefficients
“distance” of the coefficients from 0, measured in standard deviations. i.e., t.value = coeff./std.error
P values for the coefficients
p.value is the probability of finding a t.value of this size if the null-hypothesis is true
95% confidence interval for the coefficients.
If the interval does not contain 0, our confidence that the variable has explanatory power is (at least) 95%.
Interpreting confidence intervals
If the interval DOES NOT contain 0 we conclude that βi is significantly different from zero
0
0
🗴 Bad :
✓ Good :
Interpreting confidence intervals
Let’s practice diamond!!
Split data for validation
Training data for building model
Test data for validation of the model
-Create new vaiable for training/test
-For training, give 1.
-For test, give 0.
Data->transform->transformation type-> training variable
70% of data will have 1 in ‘training’ variable
Random function
This enables various selections
2100 data out of 3000 data is ’training=1’
data -> view -> filter -> training ==1 -> store in ‘diamond_training’
data -> view -> filter -> training ==0 -> store in ‘diamond_test’
Select ‘diamond_training’ and build linear regression
65% probability of x is not contributing on price
Let’s look at the data and variables.
These are all talking about ‘size’
x
y
z
table
X,y,z are stronlgy corelated, and that is why depth is less corelated.
D,E->A
Change type!! (as a factor)
Just in case, make a variable for color D
Color F+D(yes)??
Something wrong?
Carat vs color?
Color
New variable
Neural Network Method
Color
Model
f(x1,x2,x3,..)=a
a*f(x1,x2,x3,..)=price
Require the setup to shorten the calculation amount
Depth, table evaluation
Exclude table, and play with “size” parameters
Can you predict price of x>y diamond (various depth) using your model?
Finalize the model and run the prediction!
Prediction->data->choose test data->run and save the data
Don’t forget to create the variable of color_new in test data to run the prediction!!
D,E->A
Change type!! (as a factor)
Additional analysis! Look at the data again
All data
Pred<0 data
Smaller carat
(smaller x,y,z)
Higher depth
(depth = z / mean(x, y) = 2 * z / (x + y))
Segmentation! Let’s build a new linear regression for carat <0.5.
How many segmentation you can take?
data
Carat>0.5
Carat<0.5
Color=D,E
Other color
Other color
Color=D,E