Published using Google Docs
CheatSheet_Predictions_Regression
Updated automatically every 5 minutes

PREDICTIONS

Linear regression

y = mx + n

given the known correlation you can predict other values

assumes linear relationship that doesn't need to exist

line that's "best fit" most points with smallest distance to line

→ "predicts continous ordinal value"

import statsmodels.formula.api as smf

lm = smf.ols(formula="Weight~Height",data=df).fit()

ols = ordinary least square = linear regression model

formula="what is to be predicted ~ what you want to predict it on"

data = what data to base it on

.fit()  -- creates the actual regression line based on the input data

lm.params = intercept, slope

provides intercept (n) and slope (m)

plt.plot(x,y expressed by mx + n, linestyle, color)

plt.plot(df["Height"],slope*df["Height"]+intercept,"-",color="red")

lm.summary()

prints the most important statistical measures of regresion line

from sklearn.linear_model import LinearRegression

import numpy as np

lm = LinearRegression()

data = np.asarray(df[['Mortality','Exposure']])

x = data[:,1:]

y = data[:,0]

read in data as numpy array

lm.fit(x,y)

create linear regression model based, based on data x and y

lm.coef_[0]

lm.intercept_

slope

intercept

plt.plot(x,lm.coef_[0]*x+lm.intercept_[0])

lm.predict(x-value)

returns predicted y based on input x

Multiple regression

each variable x has it's own coefficient for slope

lm = smf.ols(formula="DEP_DELAY ~ LATE_AIRCRAFT_DELAY + CARRIER_DELAY + NAS_DELAY", data=df).fit()

predict departure delay, based upon the variables late aircraft, carrier delay and NAS delay

Simple logistic regression

Classification

used to explore connection between a set of values and binary statement

fits data into logistic curve (instead of line graph)

predicts the probability of the occurrence of an event

→ "predicts discrete categorical value"

algorithm converts dataset into 0 to 1 bound classification of data

from sklearn.linear_model import LogisticRegression

import numpy as np

lm = LogisticRegression()

x = np.asarray(dataset[['attribute1,'attribute2,'attribute3']])

y = np.asarray(dataset['target'])

takes three attributes (columnames)

takes one target

lm = lm.fit(x,y)

fits the model

lm.score(x,y)

gives the accuracy score of the model

lm.coef_

gives the slope of the regression line for each of the attributes

lm.intercept_

gives the intercept of the regression curve

lm.predict([0,1,0])

predicts the target for the second attribute

lm.predict_log_proba([0,0,1])

?returns two values

Multiple logistic regression

used to explore the connection between many sets of values and a binary statements

Overfitting

Model is too well adjusted to training data

[Shirt of size 6 can be used to find all people of size 6 (category), unless one person (model) gets a tailor to adjust it to their shape only (overfitting), then it cannot longer be used on other person (new dataset) to identify other size 6 people]

Underfitting

Model is not good enough to distinguish between different categories of data

[one-size-fits-all shirt -- everyone fits in, so it doesn't help in distinguishing between differently sized people (categories)]

Application

Data modelling

approximating and attempting to depict reality