It is very important to know the basics of statistics to analyse the data in order to get the Insights to basic understanding of data. A very brief description of topics in layman's term. Feel free to correct me & provide us with much better understanding on the topics.
#1 STATISTICAL LEARNING
Types:
Descriptive statistics : Simply a way to describe our data. Summation of data via calculations, graphs, tables etc. Data is of entire population
Inferential Statistics : Use of hypothesis to prove or disprove any statement. Data is of sample population.
Variables:
1. Categorical (nominal): Qualitative data in which the values are assigned to a set of distinct groups or categories. Example: Age group, sex, race, educational level etc.
2.Numerical variable : Values that represent quantities
a) Discrete: Finite in range.............. ex: Coin flips, distance covered etc.
b) Continuous: Infinite in range......ex: Length, date & time, weight, volume.
Measure of Central Tendency: Mean, Mode, Median
Measure of Dispersion: Range, Variance, Standard deviation, Percentile, Quartile (Q1, Q2, Q3), Interquartile range (Q3-Q1)
#2 STATISTICAL LEARNING
Measure of Central Tendency : Mean, Median, Mode
I) MEAN
a) Sample Mean ( x bar ): Simple average( sigma x / n) where
n = Number of elements in given sample taken.
b) Population Mean ( mu ): Simple average( sigma x / n) where
n = Number of elements in complete population given.
c) Weighted Mean(x bar):Category wise mean, Here some data point contribute more towards the end result. (Sigma(wi•xi)/Sigma )where
wi= weights & xi= data points
II) MEDIAN :
Middle value of the sorted in, ascending or descending order of the given data set.
a) Even number of data: Normally arranged in ascending & average of the two middle term is calculated. This value is median of data set.
b) Odd number of data: We have a exact center value as median.
III) MODE : Data which appears most frequently in data set.
• Data set can have 1 mode, >1mode or No mode at all.
Tomorrow: Measure of Dispersion
#3 STATISTICAL LEARNING
Topic : Measure of Dispersion- Extent to which a distribution is stretched or squeezed. It Indicates the scattering of data.
TYPES :
1) Algebraic Measure
2) Graphical Measure
1. ALGEBRAIC MEASURE:
I) Absolute Measure
II) Relative Measure
I) Absolute Measure
a) Range(R): Simplest of all. Difference between largest & smallest of given distribution.
R = Largest item(L) - Smallest item(S)
b) Interquartile Range: Difference between upper quartile & lower quartile of distribution
IR = Upper quartile(Q3)- Lower quartile(Q1)
c) Quartile deviation: Known as semi-Interquartile-Range. Half of the Interquartile Range.
QD = (Q3 -Q1) / 2
d) Mean Deviation: Arithmetic mean (average) of deviations of observation from a central value such as Mean/ Median.
e) Standard Deviation (SD): It tells us about the consistency of data.
If observations are from a normal distribution then:
68% Observations lie between mean +/- 1 SD
95% Observations lie between mean +/- 2 SD
99.7% Observations lie between mean +/- 3 SD (6sigma concept) (edited)
#4 STATISTICAL LEARNING
.DISTRIBUTION OF SHAPE.
UNIMODAL : A probability distribution which has a single peak. Processes a single unique mode.
BIMODAL : A probability distribution which has two peaks. With two different modes
NORMAL : A probability distribution where observations cluster symmetrically around the Central peak. Also known as Gaussian Distribution.
SKEWNESS : Used to measure the symmetry of the data.
S=3(mean-Median)/sigma
Normal distribution : The mean, mode & median are same & the skewness is 0.
Negatively skewed : The mean of the distribution is less than the median.
(-ve)
Positively skewed : The mean of the distribution is more than the median.
(+ve)
KURTOSIS : Used to measure of tailedness of the distribution. Measures the amount of probability in the tails (Heavy/Light tailed) relative to normal distribution.
<Caution : You'll find definition which says about measure of peakness/flatness, but its just an ambiguous interpretation not the actual definition>
- The kurtosis of any univariate normal distribution is 3. <refer formula in image below>
Mesokurtic (0) : Kurtosis measured against a normal distribution.
Leptokurtic (>0) : Has a high kurtosis with heavy/long tails, or outliers.
Platykurtic (<0): Has a low kurtosis with light/short tails, or lack outliers.
# 5 STATISTICAL LEARNING
HYPOTHESIS SELECTION
Step 1- Check the data for normality, if Yes Select parametric test else choose non-parametric test
NON-PARAMETRIC TEST:
a) Wilcoxon signed-rank test: Used to compare two paired (related) sample & data is continuous. (same person with two different situation)
from scipy.stats import wilcoxon
stats, p = wilcoxon(data1, data2)
b) Friedman test : Used to compare more than two paired samples. (same person with more than two situations)
from scipy.stats import friedmanchisquare
stats, p = friedmanchisquare(data1, data2, data3)
c) The Mann-Whitney U test: Used to compare two independent data samples.
from scipy.stats import mannwhitneyu
stats, p = mannwhitneyu(data1, data2)
d) Kruskal-Wallis H Test: Used to compare more than two independent samples.
from scipy.stats import kruskal
stats, p = kruskal(data1, data2, data3)
e) Chi Square test : Used to check the dependencies of variables. Used only when variables are Categorical
from scipy.stats import chi2_contingency
chitable = pd.crosstab(data1, data2)
stats, p, dof, expected=chi2_contingency(chitable)
print(stats, p)
# 6 STATISTICAL LEARNING
CHOOSING A TEST FOR GIVEN DATA
Step 1- Check the data for normality, if Yes Select parametric test else choose non-parametric test.
PARAMETRIC TEST:
a) One Sample t- test: Used to compare sample mean & population mean (this mean can be calculated or assumed for comparison purpose.)
from scipy.stats import ttest_1samp
stats,p = ttest_1samp(data, population mean)
print(stats, p)
b) Two samples paired t - test:. Used to compare the mean of two paired samples. (rel-related sample)
from scipy.stats import ttest_rel
stats,p = ttest_rel(data1, data2)
print(stats,p)
c) Two sample Independent (Separate) t-test: Used to compare mean of two independent samples.
from scipy.stats import ttest_ind
stats,p = ttest_ind(data1, data2)
print(stats, p)
d) One sample f-test: Dependent variable - Continuous
One Independent variable- Categorical
One Way ANOVA
e) Two sample f-test: Dependent variable - Continuous
Two Independent variable- Categorical
Two Way ANOVA
Next: Steps used in Critical Value Approach.
We will discuss more on ANOVA & ANCOVA topics in further discussions.
# 7 STATISTICAL LEARNING
HYPOTHESIS: Methodology to prove/disprove the statement with the help of statistical analysis
STEPS IN- CRITICAL VALUE APPROACH
STEP-1 : Formulate the NULL (H0) hypothesis & ALTERNATE (H1) hypothesis
NULL HYPOTHESIS (H0) : The assumption made here is always in negative aspects. It Includes symbols such as "=" , "<=", ">=" .
ALTERNATE HYPOTHESIS (H1/Ha): The assumptions made here are exactly opposite to what in null hypothesis. Includes no such symbols as mentioned above.
STEP-2 : Select the appropriate test statistics
One tail test : When the rejection region is on one side of the sample distribution.
a) Right Tail: Rejection region on the right side of the sample distribution.
b) Left Tail: Rejection region on the left side of the sample distribution.
Two Tail test: Rejection region is on both sides of the sample distribution.
a) Z Test: Used when the sigma (Standard deviation) value is known.
Z(stat) = (xbar - mu)/ (sigma/root over n)
xbar - Sample Mean
mu - Population mean
Sigma- Population Standard deviation
n - number of samples
b) T-Test: Used when the sigma(standard deviation) value is Unknown. Here S (sample standard deviation) is used in formula.
t(stat) = (xbar - mu)/ (S/root over n)
xbar - Sample Mean
mu - Population mean
S - Sample standard deviation
n - number of samples
Step-3 : Choose level of significance, confidence interval (+- range), Degree of freedom
a) Level of significance (alpha) : Also known as error rate, which depends upon the quality of data. Generally taken as 0.05 (5% error default) if not given.
b) Degree of freedom : Calculated as Rows - Columns.
Step-4 : Compute the calculated value of the test statistics.
Using the appropriate formula, based on the type of test, find the value for statistics. (Zstat / Tstat)
Step-5 : Compute the values of the test statistics using the suitable table.
Here use the T-table to find the +- range from the known alpha & Degree of freedom values.
Step-6 : Now compare the calculated value with the values from the table.
Step-7 : Make the statistical decision & state the managerial conclusion.
if it falls in the range - Accept the hypothesis assumed
If not in range - reject the hypothesis assumed
# 8 STATISTICAL LEARNING
ANOVA : Stands for Analysis of Variance
It is a way to compare within or In between groups. Basically we are testing groups to see if there's a difference between them.
In other words, they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis.
Example:
-A manufacturer has two different processes to make light bulbs. They want to know if one process is better than the other.
-Students from different colleges take the same exam. You want to see if one college outperforms the other.
Dependent variables (DV): Is always the target variables ie; the output variable & is continuous
Independent variables (IDV): Is always the input variables & is categorical
TYPES
Deciding factor in ANOVA is eta^2 .
eta^2 tells us about how much the Independent variable (Input) effects the dependent variable (Output).
The Formula for ANOVA is:
F= MSE/MST
where:
F=ANOVA coefficient
MST= Mean sum of squares due to treatment
MSE= Mean sum of squares due to error
F-Test : Used when the Independent variable is categorical.