PREDICTIONS | |
Linear regression y = mx + n | given the known correlation you can predict other values assumes linear relationship that doesn't need to exist line that's "best fit" most points with smallest distance to line → "predicts continous ordinal value" |
import statsmodels.formula.api as smf lm = smf.ols(formula="Weight~Height",data=df).fit() | ols = ordinary least square = linear regression model formula="what is to be predicted ~ what you want to predict it on" data = what data to base it on .fit() -- creates the actual regression line based on the input data |
lm.params = intercept, slope | provides intercept (n) and slope (m) |
plt.plot(x,y expressed by mx + n, linestyle, color) plt.plot(df["Height"],slope*df["Height"]+intercept,"-",color="red") | |
lm.summary() | prints the most important statistical measures of regresion line |
from sklearn.linear_model import LinearRegression import numpy as np lm = LinearRegression() | |
data = np.asarray(df[['Mortality','Exposure']]) x = data[:,1:] y = data[:,0] | read in data as numpy array |
lm.fit(x,y) | create linear regression model based, based on data x and y |
lm.coef_[0] lm.intercept_ | slope intercept |
plt.plot(x,lm.coef_[0]*x+lm.intercept_[0]) | |
lm.predict(x-value) | returns predicted y based on input x |
Multiple regression | each variable x has it's own coefficient for slope |
lm = smf.ols(formula="DEP_DELAY ~ LATE_AIRCRAFT_DELAY + CARRIER_DELAY + NAS_DELAY", data=df).fit() predict departure delay, based upon the variables late aircraft, carrier delay and NAS delay | |
Simple logistic regression Classification | used to explore connection between a set of values and binary statement fits data into logistic curve (instead of line graph) predicts the probability of the occurrence of an event → "predicts discrete categorical value" algorithm converts dataset into 0 to 1 bound classification of data |
from sklearn.linear_model import LogisticRegression import numpy as np | |
lm = LogisticRegression() x = np.asarray(dataset[['attribute1,'attribute2,'attribute3']]) y = np.asarray(dataset['target']) | takes three attributes (columnames) takes one target |
lm = lm.fit(x,y) | fits the model |
lm.score(x,y) | gives the accuracy score of the model |
lm.coef_ | gives the slope of the regression line for each of the attributes |
lm.intercept_ | gives the intercept of the regression curve |
lm.predict([0,1,0]) | predicts the target for the second attribute |
lm.predict_log_proba([0,0,1]) | ?returns two values |
Multiple logistic regression | used to explore the connection between many sets of values and a binary statements |
Overfitting | Model is too well adjusted to training data [Shirt of size 6 can be used to find all people of size 6 (category), unless one person (model) gets a tailor to adjust it to their shape only (overfitting), then it cannot longer be used on other person (new dataset) to identify other size 6 people] |
Underfitting | Model is not good enough to distinguish between different categories of data [one-size-fits-all shirt -- everyone fits in, so it doesn't help in distinguishing between differently sized people (categories)] |
Application | |
Data modelling | approximating and attempting to depict reality |