1 of 20

DATA SCIENCE USING R

VIII SEMESTER

DS-427T

UNIT-4

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

1

12/10/24

2 of 20

Obtaining Data

Data Collection is the process of collecting information from relevant sources to find a solution to the given statistical inquiry. Collection of Data is the first and foremost step in a statistical investigation. It’s an essential step because it helps us make informed decisions, spot trends, and measure progress.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

2

12/10/24

3 of 20

Sampling Data

What are sampling methods?

  • Sampling methods are ways to select a sample of data from a given population (every individual in the whole group).
  • It is unrealistic to collect data from the entire population because it:
  • is too big
  • takes too much time
  • costs too much money
  • We therefore take an appropriate sized sample as a way of representing the population.
  • Depending on the situation, some sampling methods will be more suitable than others. Whichever sampling method is used it is important to justify why that method has been used.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

3

12/10/24

4 of 20

Sampling Data

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

4

12/10/24

5 of 20

Measuring Statistics

Normally, when one hears the term measurement, they may think in terms of measuring the length of something (ie. the length of a piece of wood) or measuring a quantity of something (ie. a cup of flour). This represents a limited use of the term measurement. In statistics, the term measurement is used more broadly and is more appropriately termed scales of measurement. Scales of measurement refer to ways in which variables/numbers are defined and categorized. Each scale of measurement has certain properties which in turn determines the appropriateness for use of certain statistical analyses. The four scales of measurement are nominal, ordinal, interval, and ratio.

Nominal: Categorical data and numbers that are simply used as identifiers or names represent a nominal scale of measurement. Numbers on the back of a baseball jersey (St. Louis Cardinals 1 = Ozzie Smith) and your social security number are examples of nominal data. If I conduct a study and I'm including gender as a variable, I will code Female as 1 and Male as 2 or visa versa when I enter my data into the computer. Thus, I am using the numbers 1 and 2 to represent categories of data.

Ordinal: An ordinal scale of measurement represents an ordered series of relationships or rank order. Individuals competing in a contest may be fortunate to achieve first, second, or third place. First, second, and third place represent ordinal data. If Roscoe takes first and Wilbur takes second, we do not know if the competition was close; we only know that Roscoe outperformed Wilbur. Likert-type scales (such as "On a scale of 1 to 10 with one being no pain and ten

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

5

12/10/24

6 of 20

Measuring Statistics

being high pain, how much pain are you in today?") also represent ordinal data. Fundamentally, these scales do not represent a measurable quantity. An individual may respond 8 to this question and be in less pain than someone else who responded 5. A person may not be in half as much pain if they responded 4 than if they responded 8. All we know from this data is that an individual who responds 6 is in less pain than if they responded 8 and in more pain than if they responded 4. Therefore, Likert-type scales only represent a rank ordering.

Interval: A scale which represents quantity and has equal units but for which zero represents simply an additional point of measurement is an interval scale. The Fahrenheit scale is a clear example of the interval scale of measurement. Thus, 60 degree Fahrenheit or -10 degrees Fahrenheit are interval data. Measurement of Sea Level is another example of an interval scale. With each of these scales there is direct, measurable quantity with equality of units. In addition, zero does not represent the absolute lowest value. Rather, it is point on the scale with numbers both above and below it (for example, -10 degrees Fahrenheit).

Ratio: The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has equality of units. However, this scale also has an absolute zero (no numbers exist below the zero). Very often, physical measures will represent ratio data (for example, height and weight). If one is measuring the length of a piece of wood in centimeters, there is quantity, equal units, and that measure can not go below zero centimeters. A negative length is not possible.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

6

12/10/24

7 of 20

Measuring Statistics

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

7

12/10/24

8 of 20

The Empirical Rule

  • Empirical Rule, also known as the 68-95-99.7 rule, states that in a normal distribution, approximately 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.
  • Empirical Rule, also known as the 68-95-99.7 Rule, is a statistical guideline that describes the distribution of data in a normal distribution. It states that in a bell-shaped curve, approximately 68% of the data falls within one standard deviation from the mean, about 95% within two standard deviations, and nearly 99.7% within three standard deviations. This rule provides a quick way to understand the spread of data and is applicable in various fields for analyzing and interpreting distributions.
  • Formula of Empirical Rule

Empirical Rule Formula is as follows:

  • One Standard Deviation (µ ± σ): 68% of Data
  • Two Standard Deviation (µ ± 2σ): 95% of Data
  • Three Standard Deviation (µ ± 3σ): 99.7% of Data
  • About 68% of data falls within one standard deviation of mean.
  • Approximately 95% of data falls within two standard deviations of mean.
  • Nearly 99.7% of data falls within three standard deviations of mean.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

8

12/10/24

9 of 20

The Empirical Rule

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

9

12/10/24

This rule applies to data that follows a normal distribution, represented by a bell-shaped curve. It provides a guideline for understanding the distribution of data points around the mean. The Empirical Rule is useful for analyzing and interpreting data, helping to identify trends, outliers, and patterns in datasets.

For Example: Suppose we have a dataset representing heights of students in a class, and data follows a normal distribution with a mean height of 65 inches and a standard deviation of 3 inches.

About 68% of students' heights fall within one standard deviation of the mean. Using Empirical Rule, we can calculate that the range of heights within one standard deviation of mean is from 62 inches to 68 inches (65 ± 3). So, approximately 68% of the students have heights between 62 inches and 68 inches.

Approximately 95% of students' heights fall within two standard deviations of the mean. With a standard deviation of 3 inches, the range within two standard deviations of the mean is from 59 inches to 71 inches (65 ± 2 × 3). Therefore, nearly 95% of the students have heights between 59 inches and 71 inches.

Nearly 99.7% of students' heights fall within three standard deviations of the mean. The range within three standard deviations of the mean is from 56 inches to 74 inches (65 ± 3 × 3). Hence, almost all students, about 99.7%, have heights between 56 inches and 74 inches.

10 of 20

The Empirical Rule

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

10

12/10/24

11 of 20

Point Estimates

  • In inferential statistics, we take our sample data and we calculate our sample statistics.
  • We can then use those sample statistics to estimate the population parameter; which is often times what we’re really looking to understand.
  • These sample statistics are used within this concept of an estimate, where  there are two types of estimates, Point Estimates & Interval Estimates.
  • point estimate is a type of estimation that uses a single value, oftentimes a sample statistic, to infer information about the population parameter as a single value or point.
  • An interval estimate is a type of estimation that uses a range (or interval) of values, based on sampling information, to “capture” or “cover” the true population parameter being inferred.
  • point estimate is a type of estimation that uses a single value, oftentimes a sample statistic, to infer information about the population parameter.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

11

12/10/24

12 of 20

Point Estimates

  • Point Estimate for the Population Mean
  • So let’s say we’ve recently purchased 5,000 widgets to be consumed in our next manufacturing order, and we require that the average length of the widget of the 5,000 widgets is 2 inches.
  • Instead of measuring all 5,000 units, which would be extremely time consuming and costly, and in other cases possibly destructive, we can take a sample from that population and measure the average length of the sample.
  • As you know, the sample mean can be calculated by simply summing up the individual values and dividing by the number of samples measured.

The point estimate of the population variance & standard deviation is simply the sample variance & sample standard deviation:

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

12

12/10/24

13 of 20

Sampling Distributions

  • A sampling distribution is the probability distribution of a sample statistic, such as a sample mean (�ˉxˉ) or a sample sum (Σ�Σx​).
  • Here’s a quick example:
  • Imagine trying to estimate the mean income of commuters who take the New Jersey Transit rail system into New York City. More than one hundred thousand commuters take these trains each day, so there’s no way you can survey every rider. Instead, you draw a random sample of 80 commuters from this population and ask each person in the sample what their household income is. You find that the mean household income for the sample is �ˉ1xˉ1​ = $92,382. This figure is a sample statistic. It’s a number that summarizes your sample data, and you can use it to estimate the population parameter. In this case, the population parameter you are interested in is the mean income of all commuters who use the New Jersey Transit rail system to get to New York City.
  • Now that you’ve drawn one sample, say you draw 99 more. You now have 100 random samples of sample size n=80, and for each sample, you can calculate a sample mean. We’ll denote these means as �ˉ1xˉ1​, �ˉ2xˉ2​, … �ˉ100xˉ100​, where the subscript indicates the sample for which the mean was calculated. The value of these means will vary. For the first sample, we found a mean income of $92,382, but in another sample, the mean may be higher or lower depending on who gets sampled. In this way, the sample statistic �ˉxˉ becomes its own random variable with its own probability distribution. Tallying the values of the sample means and plotting them on a relative frequency histogram gives you the sampling distribution of �ˉxˉ (the sampling distribution of the sample mean).

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

13

12/10/24

14 of 20

Sampling Distributions

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

14

12/10/24

15 of 20

Types of Sampling Distributions

  • Sampling distributions can be constructed for any random-sample-based statistic, so there are many types of sampling distributions. We’ll end this article by briefly exploring the characteristics of two of the most commonly used sampling distributions: the sampling distribution of sample means and the sampling distribution of sample sums. Both of these sampling distributions approach a normal distribution with a particular mean and standard deviation. The standard deviation of a sampling distribution is called the standard error.
  • If the central limit theorem holds, the sampling distribution of sample means will approach a normal distribution with a mean equal to the population mean, �μ, and a standard error equal to the population standard deviation divided by the square root of the sample size, ��nσ​.
  • The fact that the distribution of sample means is centered around the population mean is an important one. This means that the expectation of a sample mean is the true population mean, �μ, and using the empirical rule, we can assert that if large enough samples of size n are drawn with replacement, 99.7% of the sample means will fall within 3 standard errors of the population mean. Lastly, sampling distribution of means allows you to use z-transformations to make probability statements about the likelihood that a sample mean, �ˉxˉ, calculated from a sample of size n, will be between, greater than, or equal to some value(s).

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

15

12/10/24

16 of 20

Confidence Intervals

  • The confidence interval formula is used in statistics for describing the amount of uncertainty associated with a sample estimate of a population parameter. It is used to describe the uncertainty associated with a sampling method. 
  • To recall, the confidence interval is a range within which most possible values would occur. To calculate the confidence interval, one needs to set the confidence level as 90%, 95%, or 99%, etc. A 90% confidence level means that we would expect 90% of the interval estimates to include the population parameter; 95% of the intervals would include the parameter and so on. 
  • A confidence interval gives the probability within which the true value of the parameter will lie. The confidence level (in percentage) is selected by the investigator. The higher the confidence level is the wider is the confidence interval (less precise). Before learning the confidence interval, one must understand the basic statistics formulas and z-score formula. The formula for the confidence interval is given below:
  • Confidence Interval Formulas:
  • If n ≥ 30, Confidence Interval = x̄ ± zc(σ/√n)
  • If n<30, Confidence Interval = x̄ ± tc(S/√n)
  • Where,
  • n = Number of terms
  • x̄ = Sample Mean
  • σ = Standard Deviation 
  • zc = Value corresponding to confidence interval in z table
  • tc = Value corresponding to confidence interval in t table

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

16

12/10/24

17 of 20

Hypothesis Tests

  • Hypothesis testing is a structured method used to determine if the findings of a study provide evidence to support a specific theory relevant to a larger population.
  • Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a population parameter to the test. It is used to estimate the relationship between 2 statistical variables.
  • Let's discuss few examples of statistical hypothesis from real-life - 
  • A teacher assumes that 60% of his college's students come from lower-middle-class families.
  • A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.
  • Statistical analysts validate assumptions by collecting and evaluating a representative sample from the data set under study. 
  • The process of hypothesis testing involves four key steps: defining the hypotheses, developing a plan for analysis, examining the sample data, and interpreting the final results.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

17

12/10/24

18 of 20

Hypothesis Tests

  • The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no bearing on the study's outcome unless it is rejected.
  • H0 is the symbol for it, and it is pronounced H-naught.
  • The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.
  • Let's understand this with an example.
  • A sanitizer manufacturer claims that its product kills 95 percent of germs on average. 
  • To put this company's claim to the test, create a null and alternate hypothesis.
  • H0 (Null Hypothesis): Average = 95%.
  • Alternative Hypothesis (H1): The average is less than 95%.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

18

12/10/24

19 of 20

Hypothesis Tests

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

19

12/10/24

20 of 20

THANKS….

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla

20

12/10/24