1 of 26

Databases and Data MiningStatistics: the basics

dr. Jamolbek Mattiev

2 of 26

Outline

Basic definitions

Distributions

Probability

Patterns

3 of 26

About statistics …

Definition:

  1. Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation.�(from: Wikipedia)
  2. Statistics is a form of mathematical analysis that uses quantified models, representations and synopses for a given set of experimental data or real-life studies.�(from: Investopedia)

Introduction to Machine Learning and Data Mining

Statistics: the basics

3

4 of 26

“Statistical” statements – examples

  • The most violent earthquake measured 9.2 on Richter scale.
  • The probability for a murders of being a men is 10 times higher then for women.
  • Every eighth Southafrican is infected with the HIV virus.
  • In the year 2020 there will be 15 people older than 64 for each newborn.

Introduction to Machine Learning and Data Mining

Statistics: the basics

4

5 of 26

Thus, statistics …

  • … uses mathematical calculations,
  • … deals with numbers.

But, is also important

  • … how we choose those numbers,
  • … how we interpret the results of calculations.

Let's take a look at some examples 🡪

Introduction to Machine Learning and Data Mining

Statistics: the basics

5

6 of 26

Example no. 1

“Statistical” finding/result:

Due to a new commercial campaign in May the sales of ice cream XYZ went up 30% in the next 3 months.

The sales of ice cream in the summer months (June, July, August) goes up regardless of the commercial.

“Historical effect” – interpreting the result depending on one variable when in reality it is dependent on another (variable) – in our case time.

Introduction to Machine Learning and Data Mining

Statistics: the basics

6

7 of 26

Example no. 2

“Statistical” finding/result:

The highest the number of churches in a city, the highest the criminal rate. Hence: churches lead to criminal.

Both the increase in the number of churches and criminal rate can be bound to the increase in a city's population – bigger city, more churches, more criminal.

“Third variable effect” – we wrongly assume that there is a connection between two variables where in fact there is a third variable affecting both variables.

Introduction to Machine Learning and Data Mining

Statistics: the basics

7

8 of 26

Example no. 3

“Statistical” finding/result:

This year there is 75% more interracial marriages than 25 years ago.

What if 25 years ago there were 1% interracial marriages, this year 1.75% (75% more). Does this really mean a so drastic increase? What about the fluctuations in the years in between?

Lack of data – we simply do not have enough data, to make sound conclusions.

Introduction to Machine Learning and Data Mining

Statistics: the basics

8

9 of 26

Why is it important to know statistics?

  • We hear “statistical” statements, similar to those on previous slides, every day
    • We can believe to some
    • But, most of them can be deceiving

  • The knowing of statistics enables us to differentiate between truth and deception

  • Statistics is an introduction to Data Mining

Introduction to Machine Learning and Data Mining

Statistics: the basics

9

10 of 26

Basic terminology and definitions

  • Descriptive statistics
  • Inferential statistics
    • sampling
  • Variables/attributes
  • Percentiles
  • Measuring
    • How to choose a measure?
    • Data collection basics
  • (probabilistic) Distributions
  • Linear transformations

Introduction to Machine Learning and Data Mining

Statistics: the basics

10

11 of 26

Descriptive statistics

  • Describe the data at hand
  • Do not “make conclusions” based on this data

  • Descriptive statistic:

Interesting, Americans are paying more for people that take care of their teeth and feet than for those protecting and educating their children.

(is Slovenia different?)

  • Example – table representing the average annual income of people in the US by occupation for the year 1999:

Introduction to Machine Learning and Data Mining

Statistics: the basics

11

$ 112,760

pediatritians

$ 106,130

dentists

$ 100,090

podiatritians

$ 76,140

fizicists

$ 53,410

architects

$ 49,720

psychologists

$ 47,910

hosteses

$ 39,560

elementary school teachers

$ 38,710

policemen

$ 18,980

florists

12 of 26

Inferential statistics

  • From properties of a sample we tra to draw conclusions about the whole population

    • How to choose a “good” / random sample?

    • What is a sample's bias?

Introduction to Machine Learning and Data Mining

Statistics: the basics

12

13 of 26

How to choose a sample? sampling

Rule:

The sample has to be representative = has to represent the properties of the polulation + beware of the sample size!

  • Types of sampling:
    • (simple) random sampling
    • advanced samplings:
      • random assignment
      • stratified sampling

Introduction to Machine Learning and Data Mining

Statistics: the basics

13

sample bias

14 of 26

Sampling – examples (1)

  • Random sampling:
    • each individual from the population has to have the same probability of being chosen (in the sample)
    • The selection of one individual must not affect the selection of the others = independence

Example:

Among the Slovenian population, aged 19 to 35 years we survey all those individuals whose last name begins with the letter “Z”, but just every hundredth such person.

What is the problem?

Introduction to Machine Learning and Data Mining

Statistics: the basics

14

15 of 26

Sampling – examples (2)

  • The size of a sample:
    • Small samples are often non-representative = they do not represent the properties of the entire population

Example:

We infer the probabilities of a fair coin toss "coming out" head or tails form tossing such a coin 10 times.

What is the problem?

Introduction to Machine Learning and Data Mining

Statistics: the basics

15

16 of 26

Sampling – examples (3)

  • Random assignment:
    • there is no actual population; we deal with a hypothetical population
    • the sample from this hypothetical population is randomly split in 2 or more groups = the individuals from the sample get randomly assigned to groups

Example:

When testing the effect of a drug, we split a sample of people into 2 groups. To one group (the controls) we give the placebo, to the other the actual drug. We then observe whether there are differences between the two groups.

What could be the problem?

Introduction to Machine Learning and Data Mining

Statistics: the basics

16

17 of 26

Sampling – examples (4)

  • Stratified sampling:
    • We sample in layers (stratus = layer) based on some property of the population

Example:

There are 1000 balls in the basket (population), �70% are red, 20% are green and 10% are blue. The property used �for stratification is thus the color of the balls.

How to sample this population to get a representative sample?

Introduction to Machine Learning and Data Mining

Statistics: the basics

17

18 of 26

Variables / attributes

  • Also: properties, attributes, classes, …

  • They can be:
    • independent, dependent
    • qualitative, quantitative
    • discrete, continuous

  • More – a bit later in "measuring things"

Introduction to Machine Learning and Data Mining

Statistics: the basics

18

19 of 26

Percentiles

  • What is a percentile? – example:

Say, you did a test of motoric abilities and you scored 35 points out of a total of 50 points. What does this tell you about your motoric abilities? What are your motoric abilities compared to other participants on the testing?

A more informative indicator would be: “what percentage of people is (motorically) less capable than me?” 🡪 this percentage is called a percentile.

If your score is in the 65th percentile, this means that 65% of all people taking the test scored worse than you. In your case the�65th percentile = 35.

Introduction to Machine Learning and Data Mining

Statistics: the basics

19

20 of 26

3 definitions of a percentile

Definition 1:

The Nth percentile is the lowest value that is �strictly greater than N% of all values.

Definition 2:

The Nth percentile is the lowest value that is �greater than or equal to N% of all values.

Definition 3:

A weighted average of the percentiles from the first two definitions (the most accurate definition that we are going to use)

Introduction to Machine Learning and Data Mining

Statistics: the basics

20

21 of 26

Percentile definitions – example

Introduction to Machine Learning and Data Mining

Statistics: the basics

21

Value

Rank

3�5�7�8�9�11�13�15

1�2�3�4�5�6�7�8

25th percentile =

7

Definition 1

Definition 2

5

Definition 3

5.5

22 of 26

How do we measure things?

  • In science data often come from measurings
  • How can we measure?
    • Nominal (descriptive) values
    • Ordinal (ordered) values
    • Interval values
    • Ratio values
  • Transformations between different types

= basis of data collection / errors

Introduction to Machine Learning and Data Mining

Statistics: the basics

22

discrete

continuous

Information

23 of 26

Distributions of discrete variables

Frequency distribution:

Probability distribution:

Introduction to Machine Learning and Data Mining

Statistics: the basics

23

color

no. of M&Ms

brown

17

red

18

yellow

7

green

7

blue

2

orange

4

color

probability

brown

0,31

red

0,33

yellow

0,13

green

0,13

blue

0,03

orange

0,07

24 of 26

Distributions of continuous variables

  • Grouped frequency distribution
    • graphic 🡪 histogram

Introduction to Machine Learning and Data Mining

Statistics: the basics

24

Time in ms

568

577

581

640

641

645

657

673

696

703

720

728

729

777

808

824

825

865

875

1007

Interval

Frequency

 500-600

3

 600-700

6

 700-800

5

 800-900

5

 900-1000

0

1000-1100

1

25 of 26

Probability density

Introduction to Machine Learning and Data Mining

Statistics: the basics

25

26 of 26

Linear transformations

  • Transformation = to change/transform
  • Linear = using only multiplication /w constant� and/or adding a constant
    • if “original” and transformed values are depicted as a scatter plot, we “observe” a linear function.
  • Examples:
    • Transformation of inches into centimeters (x 2.54)
    • Transformation from ºF into ºC (x 9/5 + 32)

Introduction to Machine Learning and Data Mining

Statistics: the basics

26