DATA SCIENCE USING R
VIII SEMESTER
DS-427T
UNIT-4
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
1
12/10/24
Obtaining Data
Data Collection is the process of collecting information from relevant sources to find a solution to the given statistical inquiry. Collection of Data is the first and foremost step in a statistical investigation. It’s an essential step because it helps us make informed decisions, spot trends, and measure progress.
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
2
12/10/24
Sampling Data
What are sampling methods?
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
3
12/10/24
Sampling Data
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
4
12/10/24
Measuring Statistics
Normally, when one hears the term measurement, they may think in terms of measuring the length of something (ie. the length of a piece of wood) or measuring a quantity of something (ie. a cup of flour). This represents a limited use of the term measurement. In statistics, the term measurement is used more broadly and is more appropriately termed scales of measurement. Scales of measurement refer to ways in which variables/numbers are defined and categorized. Each scale of measurement has certain properties which in turn determines the appropriateness for use of certain statistical analyses. The four scales of measurement are nominal, ordinal, interval, and ratio.
Nominal: Categorical data and numbers that are simply used as identifiers or names represent a nominal scale of measurement. Numbers on the back of a baseball jersey (St. Louis Cardinals 1 = Ozzie Smith) and your social security number are examples of nominal data. If I conduct a study and I'm including gender as a variable, I will code Female as 1 and Male as 2 or visa versa when I enter my data into the computer. Thus, I am using the numbers 1 and 2 to represent categories of data.
Ordinal: An ordinal scale of measurement represents an ordered series of relationships or rank order. Individuals competing in a contest may be fortunate to achieve first, second, or third place. First, second, and third place represent ordinal data. If Roscoe takes first and Wilbur takes second, we do not know if the competition was close; we only know that Roscoe outperformed Wilbur. Likert-type scales (such as "On a scale of 1 to 10 with one being no pain and ten
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
5
12/10/24
Measuring Statistics
being high pain, how much pain are you in today?") also represent ordinal data. Fundamentally, these scales do not represent a measurable quantity. An individual may respond 8 to this question and be in less pain than someone else who responded 5. A person may not be in half as much pain if they responded 4 than if they responded 8. All we know from this data is that an individual who responds 6 is in less pain than if they responded 8 and in more pain than if they responded 4. Therefore, Likert-type scales only represent a rank ordering.
Interval: A scale which represents quantity and has equal units but for which zero represents simply an additional point of measurement is an interval scale. The Fahrenheit scale is a clear example of the interval scale of measurement. Thus, 60 degree Fahrenheit or -10 degrees Fahrenheit are interval data. Measurement of Sea Level is another example of an interval scale. With each of these scales there is direct, measurable quantity with equality of units. In addition, zero does not represent the absolute lowest value. Rather, it is point on the scale with numbers both above and below it (for example, -10 degrees Fahrenheit).
Ratio: The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has equality of units. However, this scale also has an absolute zero (no numbers exist below the zero). Very often, physical measures will represent ratio data (for example, height and weight). If one is measuring the length of a piece of wood in centimeters, there is quantity, equal units, and that measure can not go below zero centimeters. A negative length is not possible.
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
6
12/10/24
Measuring Statistics
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
7
12/10/24
The Empirical Rule
Empirical Rule Formula is as follows:
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
8
12/10/24
The Empirical Rule
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
9
12/10/24
This rule applies to data that follows a normal distribution, represented by a bell-shaped curve. It provides a guideline for understanding the distribution of data points around the mean. The Empirical Rule is useful for analyzing and interpreting data, helping to identify trends, outliers, and patterns in datasets.
For Example: Suppose we have a dataset representing heights of students in a class, and data follows a normal distribution with a mean height of 65 inches and a standard deviation of 3 inches.
About 68% of students' heights fall within one standard deviation of the mean. Using Empirical Rule, we can calculate that the range of heights within one standard deviation of mean is from 62 inches to 68 inches (65 ± 3). So, approximately 68% of the students have heights between 62 inches and 68 inches.
Approximately 95% of students' heights fall within two standard deviations of the mean. With a standard deviation of 3 inches, the range within two standard deviations of the mean is from 59 inches to 71 inches (65 ± 2 × 3). Therefore, nearly 95% of the students have heights between 59 inches and 71 inches.
Nearly 99.7% of students' heights fall within three standard deviations of the mean. The range within three standard deviations of the mean is from 56 inches to 74 inches (65 ± 3 × 3). Hence, almost all students, about 99.7%, have heights between 56 inches and 74 inches.
The Empirical Rule
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
10
12/10/24
Point Estimates
�
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
11
12/10/24
Point Estimates
�The point estimate of the population variance & standard deviation is simply the sample variance & sample standard deviation:
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
12
12/10/24
Sampling Distributions
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
13
12/10/24
Sampling Distributions
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
14
12/10/24
Types of Sampling Distributions
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
15
12/10/24
Confidence Intervals
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
16
12/10/24
Hypothesis Tests
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
17
12/10/24
Hypothesis Tests
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
18
12/10/24
Hypothesis Tests
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
19
12/10/24
THANKS….
Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Dr.Shyla
20
12/10/24