Descriptive Statistics
Atul Nag
Associate Professor
KISS-DU
Overview of Using Data: Definitions and Goals
Overview of Using Data: Definitions and Goals
Overview of Using Data: Definitions and Goals
Variables
Observations
Overview of Using Data: Definitions and Goals
Overview of Using Data: Definitions and Goals
Types of Data
Population and Sample Data
It is very important to collect sample data that are representative of the population data so that generalizations can be made from them.
Data can be categorized in several ways based on how they are collected, and the type collected.
Not feasible to collect data from the population of all elements of interest.
we collect data from a subset of the population known as a sample. E.g. Dow
Population and Sample Data
In most cases (although not true of the Dow), a representative sample can be gathered by random sampling from the population data.
Dealing with populations and samples can introduce subtle differences in how we calculate and interpret summary statistics.
In almost all practical applications of business analytics, we will be dealing with sample data.
Quantitative and Categorical Data
Data are considered quantitative data if numeric and arithmetic operations, such as addition, subtraction, multiplication, and division, can be performed on them. E.g. sum the values for Volume in the Dow data
If arithmetic operations cannot be performed on the data, they are considered categorical data.
Cross-Sectional and Time Series Data
Cross-sectional data are collected from several entities at the same, or approximately the same, point in time.
Time series data are collected over several time periods.
Cross-Sectional and Time Series Data
Sources of Data
Experimental Study
Experimental Study
Experimental Study
The dosage level of the new drug is controlled by giving different dosages to the different groups of individuals.
Before and after the study, data on blood pressure are collected for each group.
Statistical analysis of these experimental data can help determine how the new drug affects blood pressure.
Nonexperimental, or observational
Nonexperimental, or observational, studies make no attempt to control the variables of interest.
A survey is perhaps the most common type of observational study.
Nonexperimental, or observational
Sources of Data
Sources of Data
Anyone who wants to use data and statistical analysis to aid in decision making must be aware of the time and cost required to obtain the data.
The use of existing data sources is desirable when data must be obtained in a relatively short period of time.
Sources of Data
If important data are not readily available from a reliable existing source, the additional time and cost involved in obtaining the data must be considered.
1
In all cases, the decision-maker should consider the potential contribution of the statistical analysis to the decision-making process.
2
The cost of data acquisition and the subsequent statistical analysis should not exceed the savings generated by using the information to make a better decision
3
Generating Data
Generating Data
Government agencies are another important source of existing data.
For instance, the website data.gov.in was launched by the Indian government in 2012 to make it easier for the public to access data.
The data.gov.in website includes over 150,000 data sets from a variety of Indian departments and agencies, but many other federal agencies maintain their own websites and data repositories.
Many state and local governments are also now providing data sets online. For example, the state of Odisha maintains open data portals at odisha.data.gov.in.
In general, the Internet is an important source of data and statistical information. One can obtain access to stock quotes, meal prices at restaurants, salary data, and a wide array of other information simply by performing an Internet search.
Modifying Data in Excel
Sorting and Filtering Data in Excel
Per cent Change
Sort on 2018 sales
Sort on 2018 sales
Filter
Conditional Formatting of Data in Excel
Conditional formatting in Excel can make it easy to identify data that satisfy certain conditions in a data set.
For instance, suppose that we wanted to quickly identify the automobile models for which sales had decreased from February 2018 to February 2019. We can quickly highlight these models:
Conditional Formatting of Data in Excel
Step 1. Starting with the original data, select cells F1:F21
Step 2. Click the Home tab in the Ribbon
Step 3. Click Conditional Formatting in the Styles group
Step 4. Select Highlight Cells Rules, and click Less Than . . . from the dropdown menu
Step 5. Enter 0% in the Format cells that are LESS THAN: box
Step 6. Click OK
Conditional Formatting of Data in Excel
Conditional Formatting of Data in Excel
Conditional Formatting of Data in Excel
Data bars are essentially a bar chart input into the cells that shows the magnitude of the cell values.
The widths of the bars in this display are comparable to the values of the variable for which the bars have been drawn; a value of 20 creates a bar twice as wide as that for a value of 10.
Again, we can easily see which models had decreasing sales, but Data
Bars also provide us with a visual representation of the magnitude of the change in sales.
Creating Distributions from Data
Creating Distributions from Data
Frequency Distributions for Categorical Data
Frequency Distributions for Categorical Data
Frequency Distributions for Categorical Data
Frequency Distributions for Categorical Data
Relative Frequency and Percent Frequency Distributions
Relative Frequency and Percent Frequency Distributions
Relative Frequency and Percent Frequency Distributions
Frequency Distributions for Quantitative Data
The three steps necessary to define the classes for a frequency distribution with quantitative data are as follows:
Determine the number of nonoverlapping bins.
Determine the width of each bin.
Determine the bin limits.
We can also create frequency distributions for quantitative data, but we must be more careful in defining the nonoverlapping bins to be used in the frequency distribution.
Frequency Distributions for Quantitative Data
Frequency Distributions for Quantitative Data
Number of Bins
Bins are formed by specifying the ranges used to group the data.
As a general guideline, use from 5 to 20 bins.
Width of the Bins
As a general guideline, we recommend that the width be the same for each bin.
Frequency Distributions for Quantitative Data
Frequency Distributions for Quantitative Data in Excel
Frequency Distributions for Quantitative Data in Excel
Histograms
Histograms
Histograms
Data Analysis Toolpak
Histograms
Histograms
We have also removed the gaps between the columns in the histogram in Excel to match the traditional format of histograms.
To remove the gaps between the columns in the histogram created by Excel, follow these steps:
Histograms
Skewness
One of the most important uses of a histogram is to provide information about the shape, or form, of a distribution.
Skewness, or the lack of symmetry, is an important characteristic of the shape of a distribution.
Skewness
Skewness
Panel A shows the histogram for a set of data moderately skewed to the left. A histogram is said to be skewed to the left if its tail extends farther to the left than to the right.
Panel B shows the histogram for a set of data moderately skewed to the right. A histogram is said to be skewed to the right if its tail extends farther to the right than to the left.
Skewness
Panel C shows a symmetric histogram, in which the left tail mirrors the shape of the right tail.
Panel D shows a histogram highly skewed to the right.
Cumulative Distributions
A variation of the frequency distribution that provides another tabular summary of quantitative data is the cumulative frequency distribution, which uses widths, and class limits developed for the frequency
However, rather than showing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class..
Cumulative Distribution
Cumulative Distribution
cumulative relative frequencies in column 3 by dividing the cumulative frequencies in column 2 by the total number of items ( n = 20). The cumulative percent frequencies were again computed by multiplying the relative frequencies by 100.
Measures of Location
Mean (Arithmetic Mean)
Data
Mean (Arithmetic Mean)
Mean (Arithmetic Mean)
Median
Median
Mean vs Median
Mode
Mode
If data contain at least two modes, we say that they are multimodal.
A special case of multimodal data occurs when the data contain exactly two modes; in such cases, we say that the data are bimodal.
In multimodal cases when there are more than two modes, the mode is almost never reported because listing three or more modes is not particularly helpful in describing a location for the data.
Also, if no value in the data occurs more than once, we say the data have no mode.
Mode
The Excel MODE.SNGL function will return only a single most-often-occurring value.
For multimodal distributions, we must use the MODE.MULT command in Excel to return more than one mode.
For example, two selling prices occur twice in HomeSales Data: $138,000 and $254,000. Hence, these data are bimodal.
To find both of the modes in Excel, we take these steps:
Measures of variability
Dataset
Variability of Dataset
Range
Variance
Variance
Variance
Variance
Standard Deviation
Coefficient of Variation
Coefficient of Variation
Coefficient of Variation
Analysing Distributions
Distributions are very useful for interpreting and analyzing data.
A distribution describes the overall variability of the observed values of a variable.
In this section we introduce additional ways of analyzing distributions.
Percentile
Percentile
Percentile
Percentile
Percentile
Quartiles
Quartiles
Quartiles
Quartiles
Quartiles
Quartiles
Quartiles
z-Scores
Z-scores
Z-scores
Z-scores
Empirical Rule
Empirical Rule
Identifying Outliers
Identifying Outliers
Boxplots
Constructing a boxplot
Constructing a boxplot
Constructing a boxplot
Boxplots
Boxplot outliers
Boxplot in Excel
Boxplot in Excel
Measures of Association Between Two Variables
We may be interested in the relationship between two variables
Dataset
Scatter Charts
Covariance
Covariance
Covariance in Excel
Scatter Charts and Associated Covariance Values for Different Variable Relationships
Problems with Covariance
Correlation Coefficient
Correlation Coefficient
Correlation Coefficient
Correlation Coefficient
Correlation Coefficient
Data Cleansing
The data in a data set are often said to be “dirty” and “raw” before they have been put into a form that is best suited for investigation, analysis, and modelling.
Missing Data
Remedial Measures
Missing Data Strategy
Missing completely at random (MCAR)
If the tendency for an observation to be missing the value for some variable is entirely random, then whether data are missing does not depend on either the value of the missing data or the value of any other variable in the data.
Missing at Random (MAR)
If the tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data
Missing not at random (MNAR)
Data is MNAR if the tendency for the value of a variable to be missing is related to the value that is missing.
Missing Data Strategy
If a variable has observations for which the missing values are MCAR or MAR and only a relatively small number of observations are missing values, the observations that are missing values can be ignored.
1
If a variable has observations for which the missing values are MNAR, the observation with missing values cannot be ignored because any analysis that includes the variable with MNAR values will be biased.
2
Determining the missing value
If the missing values cannot be determined and ignoring missing values or removing a variable with missing values from consideration is not an option, imputation (systematic replacement of missing values with values that seems reasonable) may be useful.
Options for replacing the missing entries for a variable include replacing the missing value with the variable’s mode, mean, or median.
Determining the missing value
Imputing values in this manner is truly valid only if variable values are MCAR; otherwise, we may be introducing misleading information into the data.
If missing values are particularly troublesome and MAR, it may be possible to build a model to predict a variable with missing values and then to use these predictions in place of the missing entries.
How to deal with missing values is fairly subjective, and caution must be used to not induce bias by replacing missing values.
Blakely Tires
Tire tread depth
Count Blanks
Count Blanks
Missing Values
Missing Values
Missing Values
Missing Values
Identification of Erroneous Outliers and Other Erroneous Values
Identification of Erroneous Outliers and Other Erroneous Values
Identification of Erroneous Outliers and Other Erroneous Values
Identification of Erroneous Outliers and Other Erroneous Values
Identification of Erroneous Outliers and Other Erroneous Values
Identification of Erroneous Outliers and Other Erroneous Values
Identification of Erroneous Outliers and Other Erroneous Values
Identification of Erroneous Outliers and Other Erroneous Values
Identification of Erroneous Outliers and Other Erroneous Values
Identification of Erroneous Outliers and Other Erroneous Values