CheatSheet_Predictions

PREDICTIONS
Linear regression y = mx + n	given the known correlation you can predict other values assumes linear relationship that doesn't need to exist line that's "best fit" most points with smallest distance to line → "predicts continous ordinal value"
import statsmodels.formula.api as smf lm = smf.ols(formula="Weight~Height",data=df).fit()	ols = ordinary least square = linear regression model formula="what is to be predicted ~ what you want to predict it on" data = what data to base it on .fit() -- creates the actual regression line based on the input data
lm.params = intercept, slope	provides intercept (n) and slope (m)
plt.plot(x,y expressed by mx + n, linestyle, color) plt.plot(df["Height"],slope*df["Height"]+intercept,"-",color="red")
lm.summary()	prints the most important statistical measures of regresion line
from sklearn.linear_model import LinearRegression import numpy as np lm = LinearRegression()
data = np.asarray(df[['Mortality','Exposure']]) x = data[:,1:] y = data[:,0]	read in data as numpy array
lm.fit(x,y)	create linear regression model based, based on data x and y
lm.coef_[0] lm.intercept_	slope intercept
plt.plot(x,lm.coef_[0]*x+lm.intercept_[0])
lm.predict(x-value)	returns predicted y based on input x
Multiple regression	each variable x has it's own coefficient for slope
lm = smf.ols(formula="DEP_DELAY ~ LATE_AIRCRAFT_DELAY + CARRIER_DELAY + NAS_DELAY", data=df).fit() predict departure delay, based upon the variables late aircraft, carrier delay and NAS delay
Simple logistic regression Classification	used to explore connection between a set of values and binary statement fits data into logistic curve (instead of line graph) predicts the probability of the occurrence of an event → "predicts discrete categorical value" algorithm converts dataset into 0 to 1 bound classification of data
from sklearn.linear_model import LogisticRegression import numpy as np
lm = LogisticRegression() x = np.asarray(dataset[['attribute1,'attribute2,'attribute3']]) y = np.asarray(dataset['target'])	takes three attributes (columnames) takes one target
lm = lm.fit(x,y)	fits the model
lm.score(x,y)	gives the accuracy score of the model
lm.coef_	gives the slope of the regression line for each of the attributes
lm.intercept_	gives the intercept of the regression curve
lm.predict([0,1,0])	predicts the target for the second attribute
lm.predict_log_proba([0,0,1])	?returns two values

Multiple logistic regression	used to explore the connection between many sets of values and a binary statements

Overfitting	Model is too well adjusted to training data [Shirt of size 6 can be used to find all people of size 6 (category), unless one person (model) gets a tailor to adjust it to their shape only (overfitting), then it cannot longer be used on other person (new dataset) to identify other size 6 people]
Underfitting	Model is not good enough to distinguish between different categories of data [one-size-fits-all shirt -- everyone fits in, so it doesn't help in distinguishing between differently sized people (categories)]
Application
Data modelling	approximating and attempting to depict reality