It is very important to know the basics of statistics to analyse the data in order to get the Insights to basic understanding of data. A very brief description of topics in layman's term. Feel free to correct me & provide us with much better understanding on the topics.

#1 STATISTICAL LEARNING

Types:

Descriptive statistics : Simply a way to describe our data. Summation of data via calculations, graphs, tables etc.             :star-revolving: Data is of entire population

Inferential Statistics  : Use of hypothesis to prove or disprove any statement.  :star-revolving:  Data is of sample population.

Variables:

1. Categorical (nominal): Qualitative data in which the values are assigned to a set of distinct groups or categories. Example: Age group, sex, race, educational level etc.

2.Numerical variable : Values that represent quantities

a) Discrete: Finite in range.............. ex:   Coin flips, distance covered etc.

b) Continuous: Infinite in range......ex:        Length, date & time, weight, volume.

Measure of Central Tendency:  Mean, Mode, Median

Measure of Dispersion: Range, Variance, Standard deviation, Percentile, Quartile (Q1, Q2, Q3), Interquartile range (Q3-Q1)

#2 STATISTICAL LEARNING

Measure of Central Tendency : Mean, Median, Mode

I) MEAN

   

  a) Sample Mean ( x bar ):  Simple average( sigma x / n) where

            n = Number of elements in given sample taken.

  b) Population Mean ( mu ):  Simple average( sigma x / n) where

            n = Number of elements in complete population given.

  c) Weighted Mean(x bar):Category wise mean, Here some data point                                  contribute more towards the end result. (Sigma(wi•xi)/Sigma )where                    

            wi= weights & xi= data points

II) MEDIAN :

Middle value of the sorted in, ascending or descending order of the given data set.

a) Even number of data: Normally arranged in ascending & average of the two middle term is calculated. This value is median of data set.

b) Odd number of data:  We have a exact center value as median.

III) MODE : Data which appears most frequently in data set.

  • Data set can have 1 mode, >1mode or No mode at all.

:point_right:Tomorrow: Measure of Dispersion :v:

#3  STATISTICAL LEARNING

Topic : Measure of Dispersion- Extent to which a distribution is  stretched or squeezed. It Indicates the scattering of data.

TYPES :

              1) Algebraic Measure

              2) Graphical Measure

1. ALGEBRAIC MEASURE:

                    I) Absolute Measure

                   II) Relative Measure

I) Absolute Measure

a) Range(R): Simplest of all. Difference between largest & smallest of given distribution.      

            R = Largest item(L) - Smallest item(S)

b) Interquartile Range: Difference between upper quartile & lower quartile of distribution

                            IR = Upper quartile(Q3)- Lower quartile(Q1)

c) Quartile deviation: Known as semi-Interquartile-Range. Half of the Interquartile Range.  

            QD = (Q3 -Q1) / 2

d) Mean Deviation: Arithmetic mean (average) of deviations of observation from a central value such as Mean/ Median.

e) Standard Deviation (SD):  It tells us about the consistency of data.

If observations are from a normal distribution then:

68%   Observations lie between mean +/- 1 SD

95%   Observations lie between mean +/- 2 SD

99.7% Observations lie between mean +/- 3 SD   (6sigma concept) (edited)

#4  STATISTICAL LEARNING

             .DISTRIBUTION OF SHAPE.                    

                                                         

UNIMODAL : A probability distribution which has a single peak. Processes a single unique mode.

BIMODAL : A probability distribution which has two peaks. With two different modes

NORMAL : A probability distribution where observations cluster symmetrically around the Central peak. Also known as Gaussian Distribution.

SKEWNESS : Used to measure the symmetry of the data.

            S=3(mean-Median)/sigma

Normal distribution : The mean, mode & median are same & the skewness is 0.

Negatively skewed : The mean of the distribution is less than the median.

(-ve)

Positively skewed : The mean of the distribution is more than the median.

(+ve)

KURTOSIS   : Used to measure of tailedness of the distribution. Measures the amount of probability in the tails (Heavy/Light tailed) relative to normal distribution.

:alert:  <Caution : You'll find definition which says about measure of peakness/flatness, but its just an ambiguous interpretation not the actual definition>:alert:

- The kurtosis of any univariate normal distribution is 3. <refer formula in image below>

 Mesokurtic (0)  : Kurtosis measured against a normal distribution.

Leptokurtic (>0) : Has a high kurtosis with heavy/long tails, or outliers.

Platykurtic (<0): Has a low kurtosis with light/short tails, or lack outliers.

# 5  STATISTICAL LEARNING    :dance_4: :dance_4::dance_4:

HYPOTHESIS SELECTION

Step 1- Check the data for normality, if Yes Select parametric test else choose non-parametric test

NON-PARAMETRIC TEST:

a) Wilcoxon signed-rank test:  Used to compare two paired (related) sample & data is continuous. (same person with two different situation)

from scipy.stats import wilcoxon

stats, p = wilcoxon(data1, data2)  

b) Friedman test : Used to compare more than two paired samples. (same person with more than two situations)

from scipy.stats import friedmanchisquare

stats, p = friedmanchisquare(data1, data2, data3)

c) The Mann-Whitney U test: Used to compare two independent data samples.

from scipy.stats import mannwhitneyu

stats, p = mannwhitneyu(data1, data2)

d) Kruskal-Wallis H Test: Used to compare more than two independent samples.

from scipy.stats import kruskal

stats, p = kruskal(data1, data2, data3)

e) Chi Square test : Used to check the dependencies of variables.   :alert:  Used only when variables are Categorical

from scipy.stats import chi2_contingency

chitable = pd.crosstab(data1, data2)

stats, p, dof, expected=chi2_contingency(chitable)

print(stats, p)

# 6  STATISTICAL LEARNING   :dance_4::dance_4::dance_4:  :fire: :fire: :fire:

CHOOSING A TEST FOR GIVEN DATA

Step 1- Check the data for normality, if Yes Select parametric test else choose non-parametric test.

PARAMETRIC TEST:

a) One Sample t- test: Used to compare sample mean & population mean (this mean can be calculated or assumed for comparison purpose.)

from scipy.stats import ttest_1samp

stats,p = ttest_1samp(data, population mean)

print(stats, p)

b) Two samples paired t - test:. Used to compare the mean of two paired samples.        (rel-related sample)

from scipy.stats import ttest_rel  

stats,p = ttest_rel(data1, data2)

print(stats,p)

c) Two sample Independent (Separate) t-test: Used to compare mean of two independent samples.

from scipy.stats import ttest_ind

stats,p = ttest_ind(data1, data2)

print(stats, p)

d) One sample f-test: Dependent variable - Continuous

                                            One Independent variable- Categorical

 :point_up_2: One Way ANOVA

e) Two sample f-test: Dependent variable - Continuous

                                           Two Independent variable- Categorical

 

 :point_up_2: Two Way ANOVA

:muscle: Next: Steps used in Critical Value Approach.

We will discuss more on ANOVA & ANCOVA topics in further discussions.

# 7  STATISTICAL LEARNING     :fire: :fire: :fire:

HYPOTHESIS: Methodology to prove/disprove the statement with the help of statistical analysis

STEPS IN-   CRITICAL VALUE APPROACH

STEP-1 : Formulate the NULL (H0) hypothesis & ALTERNATE (H1) hypothesis

NULL HYPOTHESIS (H0) : The assumption made here is always in negative aspects. It Includes symbols such as "=" , "<=", ">=" .

ALTERNATE HYPOTHESIS (H1/Ha): The assumptions made here are exactly opposite to what in null hypothesis. Includes no such symbols as mentioned above.

STEP-2 : Select the appropriate test statistics

One tail test : When the rejection region is on one side of the sample distribution.

a) Right Tail: Rejection region on the right side of the sample distribution.

b) Left Tail: Rejection region on the left side of the sample distribution.

Two Tail test: Rejection region is on both sides of the sample distribution.

a) Z Test: Used when the sigma (Standard deviation) value is known.

Z(stat) = (xbar - mu)/ (sigma/root over n)

xbar - Sample Mean

mu - Population mean

Sigma- Population Standard deviation

n - number of samples

b) T-Test: Used when the sigma(standard deviation) value is Unknown. Here S (sample standard deviation) is used in formula.

t(stat) = (xbar - mu)/ (S/root over n)

xbar - Sample Mean

mu - Population mean

S - Sample standard deviation

n - number of samples

Step-3 : Choose level of significance, confidence interval (+- range), Degree of freedom

a)  Level of significance (alpha) : Also known as error rate, which depends upon the quality of data. Generally taken as 0.05  (5% error default) if not given.

b) Degree of freedom : Calculated as Rows - Columns.

Step-4 : Compute the calculated value of the test statistics.

Using the appropriate formula, based on the type of test, find the value for statistics. (Zstat / Tstat)

Step-5 : Compute the values of the test statistics using the suitable table.

Here use the T-table to find the +- range from the known alpha & Degree of freedom values.

Step-6 : Now compare the calculated value with the values from the table.

Step-7 : Make the statistical decision & state the managerial conclusion.

if it falls in the range - Accept the hypothesis assumed

If not in range - reject the hypothesis assumed

 

# 8 STATISTICAL LEARNING :fire::fire::fire:

ANOVA : Stands for Analysis of Variance

It is a way to compare within or In between groups. Basically we are testing groups to see if there's a difference between them.

In other words, they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis.

Example:

-A manufacturer has two different processes to make light bulbs. They want to know if one process is better than the other.

-Students from different colleges take the same exam. You want to see if one college outperforms the other.

Dependent variables (DV):  Is always the target variables ie; the output variable & is continuous

Independent variables (IDV):  Is always the input variables & is categorical

TYPES

  • One way ANOVA:    1 Dependent variable   1 Independent variable.
  • Two way ANOVA:    1 Dependent variable   2 Independent variable.
  • Multi way ANOVA:  1 Dependent variable   >2 Independent variable.

Deciding factor in ANOVA is  eta^2 .

eta^2  tells us about how much the Independent variable (Input) effects the dependent variable (Output).

The Formula for ANOVA is:

F= MSE/MST

where:

F=ANOVA coefficient

MST= Mean sum of squares due to treatment

MSE= Mean sum of squares due to error

F-Test : Used when the Independent variable is categorical.

  • If no true variance exists between the groups, the ANOVA's F-ratio should equal close to 1.    
  • A one-way ANOVA is used for three or more groups of data, to gain information about the relationship between the dependent and independent variables.