Databases and Data Mining�Statistics: the basics
dr. Jamolbek Mattiev
Outline
Basic definitions
Distributions
Probability
Patterns
About statistics …
Definition:
Introduction to Machine Learning and Data Mining
Statistics: the basics
3
“Statistical” statements – examples
Introduction to Machine Learning and Data Mining
Statistics: the basics
4
Thus, statistics …
But, is also important …
Let's take a look at some examples 🡪
Introduction to Machine Learning and Data Mining
Statistics: the basics
5
Example no. 1
“Statistical” finding/result:
Due to a new commercial campaign in May the sales of ice cream XYZ went up 30% in the next 3 months.
The sales of ice cream in the summer months (June, July, August) goes up regardless of the commercial.
“Historical effect” – interpreting the result depending on one variable when in reality it is dependent on another (variable) – in our case time.
Introduction to Machine Learning and Data Mining
Statistics: the basics
6
Example no. 2
“Statistical” finding/result:
The highest the number of churches in a city, the highest the criminal rate. Hence: churches lead to criminal.
Both the increase in the number of churches and criminal rate can be bound to the increase in a city's population – bigger city, more churches, more criminal.
“Third variable effect” – we wrongly assume that there is a connection between two variables where in fact there is a third variable affecting both variables.
Introduction to Machine Learning and Data Mining
Statistics: the basics
7
Example no. 3
“Statistical” finding/result:
This year there is 75% more interracial marriages than 25 years ago.
What if 25 years ago there were 1% interracial marriages, this year 1.75% (75% more). Does this really mean a so drastic increase? What about the fluctuations in the years in between?
Lack of data – we simply do not have enough data, to make sound conclusions.
Introduction to Machine Learning and Data Mining
Statistics: the basics
8
Why is it important to know statistics?
Introduction to Machine Learning and Data Mining
Statistics: the basics
9
Basic terminology and definitions
Introduction to Machine Learning and Data Mining
Statistics: the basics
10
Descriptive statistics
Interesting, Americans are paying more for people that take care of their teeth and feet than for those protecting and educating their children.
(is Slovenia different?)
Introduction to Machine Learning and Data Mining
Statistics: the basics
11
$ 112,760 | pediatritians |
$ 106,130 | dentists |
$ 100,090 | podiatritians |
$ 76,140 | fizicists |
$ 53,410 | architects |
$ 49,720 | psychologists |
$ 47,910 | hosteses |
$ 39,560 | elementary school teachers |
$ 38,710 | policemen |
$ 18,980 | florists |
Inferential statistics
Introduction to Machine Learning and Data Mining
Statistics: the basics
12
How to choose a sample? sampling
Rule:
The sample has to be representative = has to represent the properties of the polulation + beware of the sample size!
Introduction to Machine Learning and Data Mining
Statistics: the basics
13
sample bias
Sampling – examples (1)
Example:
Among the Slovenian population, aged 19 to 35 years we survey all those individuals whose last name begins with the letter “Z”, but just every hundredth such person.
What is the problem?
Introduction to Machine Learning and Data Mining
Statistics: the basics
14
Sampling – examples (2)
Example:
We infer the probabilities of a fair coin toss "coming out" head or tails form tossing such a coin 10 times.
What is the problem?
Introduction to Machine Learning and Data Mining
Statistics: the basics
15
Sampling – examples (3)
Example:
When testing the effect of a drug, we split a sample of people into 2 groups. To one group (the controls) we give the placebo, to the other the actual drug. We then observe whether there are differences between the two groups.
What could be the problem?
Introduction to Machine Learning and Data Mining
Statistics: the basics
16
Sampling – examples (4)
Example:
There are 1000 balls in the basket (population), �70% are red, 20% are green and 10% are blue. The property used �for stratification is thus the color of the balls.
How to sample this population to get a representative sample?
Introduction to Machine Learning and Data Mining
Statistics: the basics
17
Variables / attributes
Introduction to Machine Learning and Data Mining
Statistics: the basics
18
Percentiles
Say, you did a test of motoric abilities and you scored 35 points out of a total of 50 points. What does this tell you about your motoric abilities? What are your motoric abilities compared to other participants on the testing?
A more informative indicator would be: “what percentage of people is (motorically) less capable than me?” 🡪 this percentage is called a percentile.
If your score is in the 65th percentile, this means that 65% of all people taking the test scored worse than you. In your case the�65th percentile = 35.
Introduction to Machine Learning and Data Mining
Statistics: the basics
19
3 definitions of a percentile
Definition 1:
The Nth percentile is the lowest value that is �strictly greater than N% of all values.
Definition 2:
The Nth percentile is the lowest value that is �greater than or equal to N% of all values.
Definition 3:
A weighted average of the percentiles from the first two definitions (the most accurate definition that we are going to use)
Introduction to Machine Learning and Data Mining
Statistics: the basics
20
Percentile definitions – example
Introduction to Machine Learning and Data Mining
Statistics: the basics
21
Value | Rank |
3�5�7�8�9�11�13�15 | 1�2�3�4�5�6�7�8 |
25th percentile =
7
Definition 1
Definition 2
5
Definition 3
5.5
How do we measure things?
= basis of data collection / errors
Introduction to Machine Learning and Data Mining
Statistics: the basics
22
discrete
continuous
Information
Distributions of discrete variables
Frequency distribution:
Probability distribution:
Introduction to Machine Learning and Data Mining
Statistics: the basics
23
color | no. of M&Ms |
brown | 17 |
red | 18 |
yellow | 7 |
green | 7 |
blue | 2 |
orange | 4 |
color | probability |
brown | 0,31 |
red | 0,33 |
yellow | 0,13 |
green | 0,13 |
blue | 0,03 |
orange | 0,07 |
Distributions of continuous variables
Introduction to Machine Learning and Data Mining
Statistics: the basics
24
Time in ms |
568 |
577 |
581 |
640 |
641 |
645 |
657 |
673 |
696 |
703 |
720 |
728 |
729 |
777 |
808 |
824 |
825 |
865 |
875 |
1007 |
Interval | Frequency |
500-600 | 3 |
600-700 | 6 |
700-800 | 5 |
800-900 | 5 |
900-1000 | 0 |
1000-1100 | 1 |
Probability density
Introduction to Machine Learning and Data Mining
Statistics: the basics
25
Linear transformations
Introduction to Machine Learning and Data Mining
Statistics: the basics
26