Machine learning algorithm cheatsheet

	A	B	F	G	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z	AA
1		Task	Properties	Main sklearn models (bold = recommended)	Should scale features	Should scale target for regression tasks	Multi-target/label	Deterministic	Has partial_fit (suitable for online learning)	Has predict_proba_()	Has feature_importances_	Key sklearn hyperparameters	Typical loss function	To increase regularization or similar effect	Typical evaluation metric	Training complexity	Prediction complexity	Space complexity	Embarrassingly parallel	Supports NaNs	Restriction	Preference	Remarks

2	Linear regression	Regression	Linear, deterministic regressor	linear_model.LinearRegression (no regularization), linear_model.Lasso (L1-regularization), linear_model.Ridge (L2-regularization), linear_model.ElasticNet (L1 and L2), linear_model.ElasticNetCV, linear_model.SGDRegressor	Yes, if regularization is applied	Yes, if regularization is applied	Yes	Yes	No (use SGDRegressor instead)	No	Use coef_ only if data is scaled	alpha	Mean squared error	Increase alpha (usually squared L2 and/or L1 penalty)	R², MSE, RMSE	O(p²n + p³) for n examples and p model parameters	O(p)	O(p)	No	No		Linear or linearized relationships	Check residuals for normality and homoscedasticity
3	Logistic regression	Classification	Binary classifier (multiclass via OVR), deterministic (depends on solver)	linear_model.LogisticRegression, linear_model.SGDClassifier	Yes, if regularization is applied	Not applicable	No	Yes	No (use SGDClassifier with log loss instead)	Yes	No	penalty, C	Cross-entropy aka log loss, aka logistic loss, aka deviance	Decrease C (usually L2 penalty), or increase alpha for SGDClassifier	Likelihood ratio, weighted F1	O(np)	O(p)	O(p)	No	No
4	Ridge regression classification	Classification	Kernel-based, non-linear decision boundary, binary classifier	linear_model.RidgeClassifier	Yes	Not applicable	No	Depends on solver	No	No	No	alpha	Cross-entropy aka log loss, aka logistic loss, aka deviance	Increase alpha	Weighted F1	O(p²n + p³) for n examples and p model parameters	O(p)	O(p)	No	No		Continuous features
5	k-nearest neighbours	Classification or Regression	Instance-based, non-parametric, multiclass classifier	neighbors.KNeighborsClassifier, neighbors.KNeighborsRegressor	Yes	No	Yes	Yes	No	Yes	No	n_neighbours	None	Increase n_neighbors	Weighted F1 (classification) or R², MSE, RMSE (regression)	O(1) for brute force, or O(n.log(n).p) for k-d tree for n examples and p features	For brute force, O(npk) for k neighbours and n training examples of dimension p (number of features ~ number of parameters). Or, for k-d tree, O(k.log(n)).	O(1) for brute force, or O(npk) for k-d tree	No	No		Small datasets, low dimensionality, dense data	Fast to train, slow to predict
6	Linear support vector machine	Classification or Regression	Linear decision boundary, binary classifier (multiclass via OVR)	svm.SVC, svm.SVR, svm.LinearSVC, svm.LinearSVR, svm.NuSVC, svm.NuSVR, linear_model.SGDClassifier	Yes	Yes	No	Only if probability=False	No (use SGD instead)	Only if probability=True	No	gamma, C (or alpha for SGDClassifier)	Hinge loss	Decrease C (squared L2 penalty), or increase alpha for SGDClassifier	Weighted F1 (classification) or R², MSE, RMSE (regression)	O(n²p) for n examples and p model parameters	O(sp) for s support vectors and p model parameters	O(sp) for s support vectors and p model parameters	No	No		Linearly separable classes
7	Nonlinear support vector machine	Classification or Regression	Kernel-based, non-linear decision boundary, binary classifier (multiclass via OVR)	svm.SVC, svm.SVR, svm.NuSVC, svm.NuSVR	Yes	Yes	No	Only if probability=False	No (use SGD instead)	Only if probability=True	No	kernel, gamma, C	Hinge loss	Decrease C (squared L2 penalty)	Weighted F1 (classification) or R², MSE, RMSE (regression)	O(n²p + n³) for n examples and p model parameters	O(sp) for s support vectors and p model parameters	O(sp) for s support vectors and p model parameters	No	No
8	Decision tree	Classification or Regression	Non-parametric, multiclass classifier	tree.DecisionTreeClassifier, tree.DecisionTreeRegressor	No	No	Yes	Yes	No	Yes	Yes	max_features, max_depth, min_samples_leaf, min_samples_split	Gini (per split, not global so not strictly a loss function per se)	Decrease max_depth, max_features, or increase min_samples_split, min_samples_leaf	Weighted F1 (classification) or R², MSE, RMSE (regression)	O(nzp) for n examples, p model parameters, if depth is limited to z.	O(z) for max depth z	O(z)	Yes	No			Prone to overfitting
9	Random forest	Classification or Regression	Stochastic, ensemble-based multiclass classifier	ensemble.RandomForestClassifier, ensemble.RandomForestRegressor	No	No	Yes	No	No	Yes	Yes	n_estimators, max_features, max_depth, min_samples_leaf, min_samples_split	Gini (per split, not global so not strictly a loss function per se)	Decrease max_depth, max_features, or increase min_samples_split, min_samples_leaf	Weighted F1 (classification) or R², MSE, RMSE (regression)	O(nzpt) for n examples, p model parameters, max depth z, and t trees	O(zt)	O(zt)	Yes	No
10	Extremely randomized trees	Classification or Regression	Stochastic, ensemble multiclass classifier (ExtraTrees is to ExtraTree as RandomForest is to DecisionTree)	ensemble.ExtraTreesClassifier, ensemble.ExtraTreesRegressor	No	No	Yes	No	No	Yes	Yes	n_estimators, max_features, max_depth, min_samples_leaf, min_samples_split	Gini (per split, not global so not strictly a loss function per se)	Decrease max_depth, max_features, or increase min_samples_split, min_samples_leaf	Weighted F1 (classification) or R², MSE, RMSE (regression)	O(nzpt) for n examples, p model parameters, max depth z, and t trees	O(zt)	O(zt)	Yes	No
11	Gradient boosted trees	Classification or Regression	Stochastic, ensemble multiclass classifier (ExtraTrees is to ExtraTree as RandomForest is to DecisionTree)	ensemble.GradientBoostingClassifier, ensemble.GradientBoostingRegressor, ensemble.HistGradientBoostingClassifier, ensemble.HistGradientBoostingRegressor	Yes, if regularization is applied	No	No	No	No	Yes	Yes	n_estimators, max_features, max_depth, min_samples_leaf, min_samples_split	Cross-entropy aka log loss, aka logistic loss, aka deviance	Decrease max_depth, max_features, or increase min_samples_split, min_samples_leaf	Weighted F1 (classification) or R², MSE, RMSE (regression)	O(nzpt) for n examples, p model parameters, max depth z, and t trees	O(zt)	O(zt)	Yes	Yes (histogram boosting only)
12	Gaussian process	Classification or Regression	Non-parametric, probabilistic regressor	gaussian_process.GaussianProcessClassifier, gaussian_process.GaussianProcessRegressor (not including the kernels)	Ensure input is approximately Gaussian, scale sensitivity depends on kernel (RBF is sensitive)	No	No	No	No	Yes	No	kernel, alpha	NA	Increase the length scale of the kernel, or increase the noise likelihood alpha	Weighted F1 (classification) or R², MSE, RMSE (regression)	O(n³) for n examples	O(n²) for n examples	O(n²) for n examples	No	No
13	Naive Bayes	Classification	Probabilistic, maximum a posteriori classifier with strong assumption of feature independence (naivety)	naive_bayes.GaussianNB, naive_bayes.MultinomialNB (eg for text), naive_bayes.BernoulliNB (binary values), naive_bayes.CategoricalNB (discrete values), naive_bayes.ComplementNB (for imbalance)	No	No	No	No	Yes	Yes	No	priors	NA	Weaken priors	Classification metrics	O(np)	O(cp) for c classes and p parameters	O(p)	No	No		Independent features	Often used in NLP
14	Multilayer perceptron	Classification or Regression	Deep feedforward artificial neural network	neural_network.MLPClassifier, neural_network.MLPRegressor	Yes	Yes	Yes	No	Yes	Yes	No	hidden_layer_sizes, activation, alpha, learning_rate_init, max_iter	Cross-entropy aka log loss, aka logistic loss, aka deviance (classification) or squared error (regression)	Increase alpha (L2 penalty)	Weighted F1 (classification) or R², MSE, RMSE (regression)	😬	O(p) for p parameters (weights)	O(p)	No	No
15	Stochastic gradient descent	Classification or Regression	Regularized linear models optimized with stochastic gradient descent. By default, classifier fits a linear SVM and regressor fits a linear model with L2 regularization.	linear_model.SGDClassifier, linear_model.SGDRegressor	Yes	Yes	Yes	No	Yes	Yes	No	loss, alpha, penalty	Depends on learning algorithm, default for classifier is hinge loss, for regressor is squared loss	Increase alpha	Weighted F1 (classification) or R², MSE, RMSE (regression)	O(knp) for k iterations over n examples with p features	O(p)	O(np)
16	Convolutional neural network	Feature extraction + classification			Yes	Not applicable	Yes	No	NA	NA	NA	NA	Cross-entropy aka log loss, aka logistic loss, aka deviance (classification) or squared error (regression)			O(n·Ld·c·kd) per layer for n examples of length L in each of d dimensions using c kernels of length k in each of d dimensions.	O(Ld·c·kd) per layer for 1 example of length L in each of d dimensions using c kernels of length k in each of d dimensions.	O(p), or O(c·k**d) per layer for c kernels of length k in each of d dimensions	Yes	No		Spatially correlated data	Often coupled to an MLP for classification.
17	Restricted Boltzmann machine	Feature extraction		neural_network.BernoulliRBM				Yes					NA						No	No
18	Principal component analysis	Dimensionality reduction		decomposition.PCA	Yes	Not applicable (unsupervised)		Yes	decomposition.IncrementalPCA.partial_fit				Rayleigh quotient			O(2nd² + d³ + n + nd) for truncated SVD.		O(np + p²)	No	No
19
20	Independent component analysis	Dimensionality reduction		decomposition.FastICA		Not applicable (unsupervised)		Yes					Sparsity penalty (L1)						No	No
21