1 of 154

Descriptive Statistics

Atul Nag

Associate Professor

KISS-DU

2 of 154

Overview of Using Data: Definitions and Goals

  • Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation.
    • Table below shows a data set containing information for stocks in the Dow Jones Industrial Index (or simply “the Dow”) on June 25, 2019.

3 of 154

Overview of Using Data: Definitions and Goals

  • The Dow is tracked by many financial advisors and investors as an indication of the state of the overall financial markets and the economy in the United States.
  • The share prices for the 30 companies listed are the basis for computing the Dow Jones Industrial Average (DJI), which is tracked continuously by virtually every financial publication.
  • The index is named for Charles Dow and Edward Jones who first began calculating the DJI in 1896.

4 of 154

Overview of Using Data: Definitions and Goals

  • A characteristic or a quantity of interest that can take on different values is known as a variable; for the data in Table, the variables are Symbol, Industry, Share Price, and Volume.
  • An observation is a set of values corresponding to a set of variables; each row in the Table corresponds to an observation.
  • We are concerned with how the value of a variable can vary; variation is the difference in a variable measured over observations (time, customers, items, etc.).

5 of 154

Variables

Observations

6 of 154

Overview of Using Data: Definitions and Goals

  • The role of descriptive analytics is to collect and analyze data to gain a better understanding of variation and its impact on the business setting.
  • The values of some variables are under the direct control of the decision-maker (these are often called decision variables).
  • The values of other variables may fluctuate with uncertainty because of factors outside the direct control of the decision-maker.

7 of 154

Overview of Using Data: Definitions and Goals

  • In general, a quantity whose values are not known with certainty is called a random variable, or uncertain variable.
  • When we collect data, we are gathering past observed values or realizations of a variable.
  • By collecting these past realizations of one or more variables, our goal is to learn more about the variation of a particular business situation.

8 of 154

Types of Data

9 of 154

Population and Sample Data

It is very important to collect sample data that are representative of the population data so that generalizations can be made from them.

Data can be categorized in several ways based on how they are collected, and the type collected.

Not feasible to collect data from the population of all elements of interest.

we collect data from a subset of the population known as a sample. E.g. Dow

10 of 154

Population and Sample Data

In most cases (although not true of the Dow), a representative sample can be gathered by random sampling from the population data.

Dealing with populations and samples can introduce subtle differences in how we calculate and interpret summary statistics.

In almost all practical applications of business analytics, we will be dealing with sample data.

11 of 154

Quantitative and Categorical Data

Data are considered quantitative data if numeric and arithmetic operations, such as addition, subtraction, multiplication, and division, can be performed on them. E.g. sum the values for Volume in the Dow data

If arithmetic operations cannot be performed on the data, they are considered categorical data.

    • We can summarize categorical data by counting the number of observations or computing the proportions of observations in each category. E.g. count the number of companies in the Dow that are in the telecommunications industry.

12 of 154

Cross-Sectional and Time Series Data

Cross-sectional data are collected from several entities at the same, or approximately the same, point in time.

    • The data in Table are cross-sectional because they describe the 30 companies that comprise the Dow at the same point in time (June 2019)

Time series data are collected over several time periods.

    • Graphs of time series data are frequently found in business and economic publications.
    • Such graphs help analysts understand what happened in the past, identify trends over time, and project future levels for the time series.

13 of 154

Cross-Sectional and Time Series Data

14 of 154

Sources of Data

  • Data necessary to analyse a business problem or opportunity can often be obtained with an appropriate study; such statistical studies can be classified as either experimental or observational

15 of 154

Experimental Study

  • In an experimental study, a variable of interest is first identified.
  • Then one or more other variables are identified and controlled or manipulated to obtain data about how these variables influence the variable of interest.

16 of 154

Experimental Study

  • For example, if a pharmaceutical firm conducts an experiment to learn about how a new drug affects blood pressure, then blood pressure is the variable of interest.
  • The dosage level of the new drug is another variable that is hoped to have a causal effect on blood pressure.
  • To obtain data about the effect of the new drug, researchers select a sample of individuals.

17 of 154

Experimental Study

The dosage level of the new drug is controlled by giving different dosages to the different groups of individuals.

Before and after the study, data on blood pressure are collected for each group.

Statistical analysis of these experimental data can help determine how the new drug affects blood pressure.

18 of 154

Nonexperimental, or observational

Nonexperimental, or observational, studies make no attempt to control the variables of interest.

A survey is perhaps the most common type of observational study.

    • For instance, in a personal interview survey, research questions are first identified.
    • Then a questionnaire is designed and administered to a sample of individuals.
    • Some restaurants use observational studies to obtain data about customer opinions regarding the quality of food, quality of service, atmosphere, and so on.

19 of 154

Nonexperimental, or observational

20 of 154

Sources of Data

  • In some cases, the data needed for a particular application exist from an experimental or observational study that has already been conducted.
    • For example, companies maintain a variety of databases about their employees, customers, and business operations.
    • Data on employee salaries, ages, and years of experience can usually be obtained from internal personnel records.
    • Other internal records contain data on sales, advertising expenditures, distribution costs, inventory levels, and production quantities.
    • Most companies also maintain detailed data about their customer

21 of 154

Sources of Data

Anyone who wants to use data and statistical analysis to aid in decision making must be aware of the time and cost required to obtain the data.

The use of existing data sources is desirable when data must be obtained in a relatively short period of time.

22 of 154

Sources of Data

If important data are not readily available from a reliable existing source, the additional time and cost involved in obtaining the data must be considered.

1

In all cases, the decision-maker should consider the potential contribution of the statistical analysis to the decision-making process.

2

The cost of data acquisition and the subsequent statistical analysis should not exceed the savings generated by using the information to make a better decision

3

23 of 154

Generating Data

  • Organizations that specialize in collecting and maintaining data make available substantial amounts of business and economic data.
  • Companies can access these external data sources through leasing arrangements or by purchase.
  • Dun & Bradstreet, Bloomberg, and Dow Jones & Company are three firms that provide extensive business database services to clients.
  • Nielsen and Ipsos are two companies that have built successful businesses collecting and processing data that they sell to advertisers and product manufacturers.
  • Data are also available from a variety of industry associations and special-interest organizations.

24 of 154

Generating Data

Government agencies are another important source of existing data.

For instance, the website data.gov.in was launched by the Indian government in 2012 to make it easier for the public to access data.

The data.gov.in website includes over 150,000 data sets from a variety of Indian departments and agencies, but many other federal agencies maintain their own websites and data repositories.

Many state and local governments are also now providing data sets online. For example, the state of Odisha maintains open data portals at odisha.data.gov.in.

In general, the Internet is an important source of data and statistical information. One can obtain access to stock quotes, meal prices at restaurants, salary data, and a wide array of other information simply by performing an Internet search.

25 of 154

Modifying Data in Excel

26 of 154

Sorting and Filtering Data in Excel

  • Excel contains many useful features for sorting and filtering data so that one can more easily identify patterns.

27 of 154

Per cent Change

  • The per cent change in sales for each model from February 2018 to February 2019 has been calculated.
  • This is done by entering the formula =(D2-E2)/E2 in cell F2 and then copying the contents of this cell to cells F3 to F20.

28 of 154

Sort on 2018 sales

  • To do this, we use Excel’s Sort function, as shown in the following steps.
    • Step 1. Select cells A1:F21
    • Step 2. Click the Data tab in the Ribbon
    • Step 3. Click Sort in the Sort & Filter group
    • Step 4. Select the check box for My data has headers
    • Step 5. In the first Sort by dropdown menu, select Sales (February 2018)
    • Step 6. In the Order dropdown menu, select Largest to Smallest
    • Step 7. Click OK

29 of 154

Sort on 2018 sales

30 of 154

Filter

  • Suppose that we are interested only in seeing the sales of models made by Nissan. We can do this using Excel’s Filter function:
    • Step 1. Select cells A1:F21
    • Step 2. Click the Data tab in the Ribbon
    • Step 3. Click Filter in the Sort & Filter group
    • Step 4. Click on the Filter Arrow in column B, next to Manufacturer
    • Step 5. If all choices are checked, you can easily deselect all choices by unchecking (Select All). Then select only the check box for Nissan.
    • Step 6. Click OK

31 of 154

Conditional Formatting of Data in Excel

Conditional formatting in Excel can make it easy to identify data that satisfy certain conditions in a data set.

For instance, suppose that we wanted to quickly identify the automobile models for which sales had decreased from February 2018 to February 2019. We can quickly highlight these models:

32 of 154

Conditional Formatting of Data in Excel

Step 1. Starting with the original data, select cells F1:F21

Step 2. Click the Home tab in the Ribbon

Step 3. Click Conditional Formatting in the Styles group

Step 4. Select Highlight Cells Rules, and click Less Than . . . from the dropdown menu

Step 5. Enter 0% in the Format cells that are LESS THAN: box

Step 6. Click OK

33 of 154

Conditional Formatting of Data in Excel

34 of 154

Conditional Formatting of Data in Excel

  • Instead of highlighting only models with decreasing sales, we could instead choose Data Bars from the Conditional Formatting dropdown menu in the Styles group of the Home tab in the Ribbon.

35 of 154

Conditional Formatting of Data in Excel

Data bars are essentially a bar chart input into the cells that shows the magnitude of the cell values.

The widths of the bars in this display are comparable to the values of the variable for which the bars have been drawn; a value of 20 creates a bar twice as wide as that for a value of 10.

Again, we can easily see which models had decreasing sales, but Data

Bars also provide us with a visual representation of the magnitude of the change in sales.

36 of 154

Creating Distributions from Data

37 of 154

Creating Distributions from Data

  • Distributions help summarize many characteristics of a data set by describing how often certain values for a variable appear in that data set.
  • Distributions can be created for both categorical and quantitative data, and they assist the analyst in gauging variation

38 of 154

Frequency Distributions for Categorical Data

  • It is often useful to create a frequency distribution for a data set.
  • A frequency distribution is a summary of data that shows the number (frequency) of observations in each of several non-overlapping classes, typically referred to as bins.

39 of 154

Frequency Distributions for Categorical Data

40 of 154

Frequency Distributions for Categorical Data

  • We can use Excel to calculate the frequency of categorical observations occurring in a data set using the COUNTIF function.
  • In cell E2, we enter the formula =COUNTIF($A$2:$B$26, D2), where A2:B26 is the range for the sample data, and D2 is the bin (Coca-Cola) that we are trying to match.
  • The COUNTIF function in Excel counts the number of times a certain value appears in the indicated range

41 of 154

Frequency Distributions for Categorical Data

42 of 154

Relative Frequency and Percent Frequency Distributions

  • A frequency distribution shows the number (frequency) of items in each of several non-overlapping bins.
  • However, we are often interested in the proportion, or percentage, of items in each bin.
  • The relative frequency of a bin equals the fraction or proportion of items belonging to a class.
  • For a data set with n observations, the relative frequency of each bin can be determined as follows:

43 of 154

Relative Frequency and Percent Frequency Distributions

  • A relative frequency distribution is a tabular summary of data showing the relative frequency for each bin.
  • A percent frequency distribution summarizes the percent frequency of the data for each bin.

44 of 154

Relative Frequency and Percent Frequency Distributions

45 of 154

Frequency Distributions for Quantitative Data

The three steps necessary to define the classes for a frequency distribution with quantitative data are as follows:

Determine the number of nonoverlapping bins.

Determine the width of each bin.

Determine the bin limits.

We can also create frequency distributions for quantitative data, but we must be more careful in defining the nonoverlapping bins to be used in the frequency distribution.

46 of 154

Frequency Distributions for Quantitative Data

  • These data show the time in days required to complete year-end audits for a sample of 20 clients

47 of 154

Frequency Distributions for Quantitative Data

Number of Bins

Bins are formed by specifying the ranges used to group the data.

As a general guideline, use from 5 to 20 bins.

Width of the Bins

As a general guideline, we recommend that the width be the same for each bin.

48 of 154

Frequency Distributions for Quantitative Data

  • Bin Limits
    • Bin limits must be chosen so that each data item belongs to one and only one class.
    • The lower bin limit identifies the smallest possible data value assigned to the bin.
    • The upper bin limit identifies the largest possible data value assigned to the class.

49 of 154

Frequency Distributions for Quantitative Data in Excel

  • The sample of 20 audit times is contained in cells A2:A21.
  • The upper limits of the defined bins are in cells C2:C6.
  • We can use the FREQUENCY function in Excel to count the number of observations in each bin.
    • Step 1. Select cells D2:D6
    • Step 2. Type the formula =FREQUENCY(A2:A21, C2:C6). The range A2:A21 defines the data set, and the range C2:C6 defines the bins.
    • Step 3. Press CTRL+SHIFT+ENTER after typing the formula in Step 2

50 of 154

Frequency Distributions for Quantitative Data in Excel

51 of 154

Histograms

  • A common graphical presentation of quantitative data is a histogram.
  • This graphical summary can be prepared for data previously summarized in either a frequency, a relative frequency, or a percent frequency distribution.
  • A histogram is constructed by placing the variable of interest on the horizontal axis and the selected frequency measure (absolute frequency, relative frequency, or percent frequency) on the vertical axis.

52 of 154

Histograms

  • The frequency measure of each class is shown by drawing a rectangle whose base is the class limits on the horizontal axis and whose height is the corresponding frequency measure.

53 of 154

Histograms

  • Histograms can be created in Excel using the Data Analysis ToolPak.
  • Step 1. Click the Data tab in the Ribbon
  • Step 2. Click Data Analysis in the Analyze group
  • Step 3. When the Data Analysis dialog box opens, choose Histogram from the list of Analysis Tools, and click OK
    • In the Input Range: box, enter A2:A21
    • In the Bin Range: box, enter C2:C6
    • Under Output Options:, select New Worksheet Ply: Select the check box for Chart Output �Click OK

54 of 154

Data Analysis Toolpak

  • If Data Analysis does not appear in your Analyze group then you need to include the Data Analysis ToolPak Add-In.
  • To do so, click on the File tab in the Ribbon and choose Options.
  • When the Excel Options dialog box opens, click Add-Ins.
  • At the bottom of the Excel Options dialog box, where it says Manage: Excel Add-ins, click Go.…
  • Select the check box for Analysis ToolPak, and click OK.

55 of 154

Histograms

56 of 154

Histograms

We have also removed the gaps between the columns in the histogram in Excel to match the traditional format of histograms.

To remove the gaps between the columns in the histogram created by Excel, follow these steps:

    • Step 1. Right-click on one of the columns in the histogram. Select Format Data Series
    • Step 2. When the Format Data Series pane opens, click the Series Options button, Set the Gap Width to 0%

57 of 154

Histograms

58 of 154

Skewness

One of the most important uses of a histogram is to provide information about the shape, or form, of a distribution.

Skewness, or the lack of symmetry, is an important characteristic of the shape of a distribution.

59 of 154

Skewness

60 of 154

Skewness

Panel A shows the histogram for a set of data moderately skewed to the left. A histogram is said to be skewed to the left if its tail extends farther to the left than to the right.

    • This histogram is typical for exam scores, with no scores above 100%, most of the scores above 70%, and only a few really low scores.

Panel B shows the histogram for a set of data moderately skewed to the right. A histogram is said to be skewed to the right if its tail extends farther to the right than to the left.

    • An example of this type of histogram would be for data such as housing prices; a few expensive houses create skewness in the right tail.

61 of 154

Skewness

Panel C shows a symmetric histogram, in which the left tail mirrors the shape of the right tail.

    • Histograms for data found in applications are never perfectly symmetric, but the histogram for many applications may be roughly symmetric. Data for SAT scores, the heights and weights of people, and so on lead to histograms that are roughly symmetric.

Panel D shows a histogram highly skewed to the right.

    • This histogram was constructed from data on the amount of customer purchases in one day at a women’s apparel store. Data from applications in business and economics often lead to histograms that are skewed to the right. For instance, data on housing prices, salaries, purchase amounts, and so on often result in histograms skewed to the right.

62 of 154

Cumulative Distributions

A variation of the frequency distribution that provides another tabular summary of quantitative data is the cumulative frequency distribution, which uses widths, and class limits developed for the frequency

However, rather than showing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class..

63 of 154

Cumulative Distribution

  • As a final point, a cumulative relative frequency distribution shows the proportion of data items, and a cumulative percent frequency distribution shows the percentage of data items with values less than or equal to the upper limit of each class.
  • The cumulative relative frequency distribution can be computed either by summing the relative frequencies in the relative frequency distribution or by dividing the cumulative frequencies by the total number of items.

64 of 154

Cumulative Distribution

cumulative relative frequencies in column 3 by dividing the cumulative frequencies in column 2 by the total number of items ( n = 20). The cumulative percent frequencies were again computed by multiplying the relative frequencies by 100.

65 of 154

Measures of Location

66 of 154

Mean (Arithmetic Mean)

  •  

67 of 154

Data

68 of 154

Mean (Arithmetic Mean)

69 of 154

Mean (Arithmetic Mean)

  •  

70 of 154

Median

  • The median, another measure of the central location, is the value in the middle when the data are arranged in ascending order (smallest to largest value).
    • With an odd number of observations, the median is the middle value.
    • An even number of observations has no single middle value.

71 of 154

Median

  •  

72 of 154

Mean vs Median

  • Although the mean is the more commonly used measure of the central location, in some situations the median is preferred.
  • The mean is influenced by extremely small and large data values.
  • We can generalize, saying that whenever a data set contains extreme values or is severely skewed, the median is often the preferred measure of central location.

73 of 154

Mode

  • mode, is the value that occurs most frequently in a dataset.
  • Consider the sample of five class sizes: 32 42 46 46 54
  • The only value that occurs more than once is 46.
  • Because this value, occurring with a frequency of 2, has the greatest frequency, it is the mode.

74 of 154

Mode

If data contain at least two modes, we say that they are multimodal.

A special case of multimodal data occurs when the data contain exactly two modes; in such cases, we say that the data are bimodal.

In multimodal cases when there are more than two modes, the mode is almost never reported because listing three or more modes is not particularly helpful in describing a location for the data.

Also, if no value in the data occurs more than once, we say the data have no mode.

75 of 154

Mode

The Excel MODE.SNGL function will return only a single most-often-occurring value.

For multimodal distributions, we must use the MODE.MULT command in Excel to return more than one mode.

For example, two selling prices occur twice in HomeSales Data: $138,000 and $254,000. Hence, these data are bimodal.

To find both of the modes in Excel, we take these steps:

    • Step 1. Select cells E4 and E5
    • Step 2. Type the formula =MODE.MULT(B2:B13)
    • Step 3. Press CTRL+SHIFT+ENTER after typing the formula in Step 2.

76 of 154

Measures of variability

77 of 154

Dataset

  • It is often desirable to consider measures of variability, or dispersion

78 of 154

Variability of Dataset

79 of 154

Range

  •  

80 of 154

Variance

  •  

81 of 154

Variance

  • To illustrate the computation of the sample variance, we will use the data on class size for the sample of five college classes.
  • A summary of the data, including the computation of the deviations about the mean and the squared deviations about the mean, is shown

82 of 154

Variance

  •  

83 of 154

Variance

  •  

84 of 154

Standard Deviation

  •  

85 of 154

Coefficient of Variation

  • In some situations we may be interested in a descriptive statistic that indicates how large the standard deviation is relative to the mean.
  • This measure is called the coefficient of variation and is usually expressed as a percentage.

86 of 154

Coefficient of Variation

  •  

87 of 154

Coefficient of Variation

  • In general, the coefficient of variation is a useful statistic for comparing the relative variability of different variables, each with different standard deviations and different means.

88 of 154

Analysing Distributions

Distributions are very useful for interpreting and analyzing data.

A distribution describes the overall variability of the observed values of a variable.

In this section we introduce additional ways of analyzing distributions.

89 of 154

Percentile

  • A percentile is the value of a variable at which a specified (approximate) percentage of observations are below that value.
  • The pth percentile tells us the point in the data where approximately p% of the observations have values less than the pth percentile; hence, approximately (100 − p)% of the observations have values greater than the pth percentile.

90 of 154

Percentile

  •  

91 of 154

Percentile

  •  

92 of 154

Percentile

  •  

93 of 154

Percentile

  •  

94 of 154

Quartiles

  • It is often desirable to divide data into four parts, with each part containing approximately one-fourth, or 25 percent, of the observations.
  • These division points are referred to as the quartiles and are defined as follows:
    • Q1 = first quartile, or 25th percentile
    • Q2 = second quartile, or 50th percentile (also the median)
    • Q3 = third quartile, or 75th percentile

95 of 154

Quartiles

  • To demonstrate quartiles, the home sales data are again arranged in ascending order.

  • We already identified Q2, the second quartile (median), as 203,750. To find Q1 and Q3 we must find the 25th and 75th percentiles.

96 of 154

Quartiles

  •  

97 of 154

Quartiles

  •  

98 of 154

Quartiles

  • The quartiles divide the home sales data into four parts, with each part containing 25% of the observations.

99 of 154

Quartiles

  •  

100 of 154

Quartiles

  •  

101 of 154

z-Scores

  •  

102 of 154

Z-scores

  •  

103 of 154

Z-scores

104 of 154

Z-scores

  •  

105 of 154

Empirical Rule

  • When the distribution of data exhibits a symmetric bell-shaped distribution, the empirical rule can be used to determine the percentage of data values that are within a specified number of standard deviations of the mean.
  • Many, but not all, distributions of data found in practice exhibit a symmetric bell-shaped distribution.

106 of 154

Empirical Rule

  • For data having a bell-shaped distribution:
    • Approximately 68% of the data values will be within 1 standard deviation of the mean.
    • Approximately 95% of the data values will be within 2 standard deviations of the mean.
    • Almost all of the data values will be within 3 standard deviations of the mean.

107 of 154

Identifying Outliers

  • Sometimes a data set will have one or more observations with unusually large or unusually small values. These extreme values are called outliers.
  • Reasons:
    • An outlier may be a data value that has been incorrectly recorded; if so, it can be corrected before the data are analyzed further.
    • An outlier may also be from an observation that doesn’t belong to the population we are studying and was incorrectly included in the data set; if so, it can be removed.
    • Finally, an outlier may be an unusual data value that has been recorded correctly and is a member of the population we are studying. In such cases, the observation should remain.

108 of 154

Identifying Outliers

  • Standardized values (z-scores) can be used to identify outliers.
  • Recall that the empirical rule allows us to conclude that for data with a bell-shaped distribution, almost all the data values will be within 3 standard deviations of the mean.
  • Hence, in using z-scores to identify outliers, we recommend treating any data value with a z-score less than −3 or greater than +3 as an outlier.
  • Such data values can then be reviewed to determine their accuracy and whether they belong in the data set.

109 of 154

Boxplots

  • A boxplot is a graphical summary of the distribution of data. A boxplot is developed from the quartiles for a data set.

110 of 154

Constructing a boxplot

  •  

111 of 154

Constructing a boxplot

  •  

112 of 154

Constructing a boxplot

  1. The dashed lines are called whiskers. The whiskers are drawn from the ends of the box to the smallest and largest values inside the limits computed in Step 3. Thus, the whiskers end at home sales values of 108,000 and 298,000.
  2. Finally, the location of each outlier is shown with an asterisk (*). In the data, we see one outlier, 456,250.

113 of 154

Boxplots

  • Boxplots are also very useful for comparing different data sets. For instance, if we want to compare home sales from several different communities, we could create boxplots for recent home sales in each community.

114 of 154

Boxplot outliers

  • Note that boxplots use a different definition of an outlier than what we described for using z-scores because the distribution of the data in a boxplot is not assumed to follow a bell-shaped curve.
  • However, the interpretation is the same.
  • The outliers in a boxplot are extreme values that should be investigated to ensure data accuracy.

115 of 154

Boxplot in Excel

  • First we will create a boxplot for a single variable using the HomeSales file.
  • Step 1. Select cells B1:B13
  • Step 2. Click the Insert tab on the Ribbon
    • Click the Insert Statistic Chart button in the Charts group
    • Choose the Box and Whisker chart from the drop-down menu
  • Excel orients the boxplot vertically, and by default, it also includes a marker for the mean.

116 of 154

Boxplot in Excel

  • Next we will use the HomeSalesComparison file to create boxplots in Excel for multiple variables.
  • Step 1. Select cells B1:F11
  • Step 2. Click the Insert tab on the Ribbon
    • Click the Insert Statistic Chart button in the Charts group
    • Choose the Box and Whisker chart from the drop-down menu
  • Excel again orients the boxplot vertically.
  • The different selling locations are shown in the Legend at the top of the figure, and different colours are used for each boxplot.

117 of 154

Measures of Association Between Two Variables

We may be interested in the relationship between two variables

118 of 154

Dataset

  • Sales manager of Queensland Amusement Park, who is in charge of ordering bottled water to be purchased by park customers.
  • The sales manager believes that daily bottled water sales in the summer are related to the outdoor temperature.
  • The table shows data for high temperatures and bottled water sales for 14 summer days.
  • The data have been sorted by high temperature from lowest value to highest value.

119 of 154

Scatter Charts

  • A scatter chart is a useful graph for analysing the relationship between two variables.
  • The scatter chart in the figure suggests that higher daily high temperatures are associated with higher bottled water sales.
  • This is an example of a positive relationship because when one variable (high temperature) increases, the other variable (sales of bottled water) generally also increases.
  • The scatter chart also suggests that a straight line could be used as an approximation for the relationship between high temperature and sales of bottled water.

120 of 154

Covariance

  •  

121 of 154

Covariance

  •  

122 of 154

Covariance in Excel

  •  

123 of 154

Scatter Charts and Associated Covariance Values for Different Variable Relationships

124 of 154

Problems with Covariance

  •  

125 of 154

Correlation Coefficient

  • The correlation coefficient measures the relationship between two variables, and, unlike covariance, the relationship between two variables is not affected by the units of measurement for x and y.

126 of 154

Correlation Coefficient

  •  

 

 

 

127 of 154

Correlation Coefficient

  •  

128 of 154

Correlation Coefficient

  •  

129 of 154

Correlation Coefficient

  •  

130 of 154

Data Cleansing

The data in a data set are often said to be “dirty” and “raw” before they have been put into a form that is best suited for investigation, analysis, and modelling.

131 of 154

Missing Data

  • Data sets commonly include observations with missing values for one or more variables.
  • In some cases missing data naturally occur; these are called legitimately missing data.
  • In other cases missing data occur for different reasons; these are called illegitimately missing data.
    • These cases can result for a variety of reasons,
      • a respondent electing not to answer a question that she or he is expected to answer,
      • a respondent dropping out of a study before its completion or sensors
      • other electronic data collection equipment failing during a study.

132 of 154

Remedial Measures

  • The primary options for addressing such missing data are
  • to discard observations (rows) with any missing values,
  • to discard any variable (column) with missing values,
  • to fill in missing entries with estimated values,
  • to apply a data-mining algorithm (such as classification and regression trees) that can handle missing values.

133 of 154

Missing Data Strategy

Missing completely at random (MCAR)

If the tendency for an observation to be missing the value for some variable is entirely random, then whether data are missing does not depend on either the value of the missing data or the value of any other variable in the data.

Missing at Random (MAR)

If the tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data

Missing not at random (MNAR)

Data is MNAR if the tendency for the value of a variable to be missing is related to the value that is missing.

134 of 154

Missing Data Strategy

If a variable has observations for which the missing values are MCAR or MAR and only a relatively small number of observations are missing values, the observations that are missing values can be ignored.

1

If a variable has observations for which the missing values are MNAR, the observation with missing values cannot be ignored because any analysis that includes the variable with MNAR values will be biased.

2

135 of 154

Determining the missing value

If the missing values cannot be determined and ignoring missing values or removing a variable with missing values from consideration is not an option, imputation (systematic replacement of missing values with values that seems reasonable) may be useful.

Options for replacing the missing entries for a variable include replacing the missing value with the variable’s mode, mean, or median.

136 of 154

Determining the missing value

Imputing values in this manner is truly valid only if variable values are MCAR; otherwise, we may be introducing misleading information into the data.

If missing values are particularly troublesome and MAR, it may be possible to build a model to predict a variable with missing values and then to use these predictions in place of the missing entries.

How to deal with missing values is fairly subjective, and caution must be used to not induce bias by replacing missing values.

137 of 154

Blakely Tires

  • In an attempt to learn about the conditions of its tires on automobiles in Texas, Blakely has obtained information for each of the four tires from 116 automobiles with Blakely brand tires.
  • The data obtained by Blakely includes the position of the tire on the automobile (left front, left rear, right front, right rear), age of the tire, mileage on the tire, and depth of the remaining tread on the tire.
  • Before Blakely management attempts to learn more about its tires on automobiles in Texas, it wants to assess the quality of these data.

138 of 154

Tire tread depth

  • The tread depth of a tire is a vertical measurement between the top of the tread rubber to the bottom of the tire’s deepest grooves and is measured in 32nds of an inch in the United States.
  • New Blakely brand tires have a tread depth of 10/32nds of an inch, and a tire’s tread depth is considered insufficient if it is 2/32nds of an inch or less.
  • Shallow tread depth is dangerous as it results in poor traction and so makes steering the automobile more difficult.
  • Blakely’s tires generally last for four to five years or 40,000 to 60,000 miles.

139 of 154

Count Blanks

  • We begin assessing the quality of these data by determining which (if any) observations have missing values for any of the variables in the TreadWear data. We can do so using Excel’s COUNTBLANK function.
  • After opening the file TreadWear
    • Step 1. Enter the heading # of Missing Values in cell G2
    • Step 2. Enter the heading Life of Tire (Months) in cell H1
    • Step 3. Enter =COUNTBLANK(C2:C457) in cell H2
  • Repeat this process for the remaining quantitative variables in the data (Tread Depth and Miles) in columns I and J

140 of 154

Count Blanks

141 of 154

Missing Values

  • Next we sort all of Blakely’s data on Miles from smallest to largest value to determine which observation is missing its value of this variable.
  • Because only one of the 456 observations is missing its value for Miles, this is likely MCAR and so ignoring the observation would not likely bias any analysis we wish to undertake with these data.

142 of 154

Missing Values

  • However, we may be able to salvage this observation by logically determining a reasonable value to substitute for this missing value.
  • It is sensible to suspect that the value of Miles for the left front tire of the automobile with the ID Number 3354942 would be identical to the value of miles for the other three tires on this automobile, so we sort all the data on ID number and scroll through the data to find the four tires that belong to the automobile with the ID Number 3354942.

143 of 154

Missing Values

144 of 154

Missing Values

145 of 154

Identification of Erroneous Outliers and Other Erroneous Values

  • Examining the variables in the data set by use of summary statistics, frequency distributions, bar charts and histograms, z-scores, scatter charts, correlation coefficients, and other tools can uncover data-quality issues and outliers.
  • For example, finding the minimum or maximum value for Tread Depth in the TreadWear data may reveal unrealistic values—perhaps even negative values—for Tread Depth, which would indicate a problem for the value of Tread Depth for any such observation.

146 of 154

Identification of Erroneous Outliers and Other Erroneous Values

  • It is important to note here that many software, including Excel, ignore missing values when calculating various summary statistics such as the mean, standard deviation, minimum, and maximum. However, if missing values in a data set are indicated with a unique value (such as 9999999), these values may be used by software when calculating various summary statistics such as the mean, standard deviation, minimum, and maximum.
  • Both cases can result in misleading values for summary statistics, which is why many analysts prefer to deal with missing data issues prior to using summary statistics to attempt to identify erroneous outliers and other erroneous values in the data.

147 of 154

Identification of Erroneous Outliers and Other Erroneous Values

  • We calculate the mean and standard deviation of each variable (age of the tire, mileage on the tire, and depth of the remaining tread on the tire) to assess whether the values of these variables are reasonable in general.
  • Return to the file TreadWear and complete the following steps:
    • Step 1. Enter the heading Mean in cell G3
    • Step 2. Enter the heading Standard Deviation in cell G4
    • Step 3. Enter =AVERAGE(C2:C457) in cell H3
    • Step 4. Enter =STDEV.S(C2:C457) in cell H4
  • By repeating this process for the remaining variables in the data (Tread Depth and Miles) in columns I and J, we determine that the mean and standard deviation.

148 of 154

Identification of Erroneous Outliers and Other Erroneous Values

149 of 154

Identification of Erroneous Outliers and Other Erroneous Values

  • Summary statistics only provide an overall perspective on the data.
  • We also need to attempt to determine if there are any erroneous individual values for our three variables.
  • We start by finding the minimum and maximum values for each variable.
  • Return again to the file TreadWear and complete the following steps:
    • Step 1. Enter the heading Minimum in cell G5
    • Step 2. Enter the heading Maximum in cell G6
    • Step 3. Enter =MIN(C2:C457) in cell H5
    • Step 4. Enter =MAX(C2:C457) in cell H6

150 of 154

Identification of Erroneous Outliers and Other Erroneous Values

151 of 154

Identification of Erroneous Outliers and Other Erroneous Values

  • Not all erroneous values in a data set are extreme; these erroneous values are much more difficult to find.
  • However, if the variable with suspected erroneous values has a relatively strong relationship with another variable in the data, we can use this knowledge to look for erroneous values.
  • Here we will consider the variables Tread Depth and Miles; because more miles driven should lead to less tread depth on an automobile tire, we expect these two variables to have a negative relationship.

152 of 154

Identification of Erroneous Outliers and Other Erroneous Values

  • A scatter chart will enable us to see whether any of the tires in the data set have values for Tread Depth and Miles that are counter to this expectation.

153 of 154

Identification of Erroneous Outliers and Other Erroneous Values

  • The points that lie outside of this ellipse have values for at least one of these variables that is inconsistent with the negative relationship exhibited by the points inside the ellipse.
  • If we position the cursor over one of the points outside the ellipse, Excel will generate a pop-up box that shows that the values of Tread Depth and Miles for this point are 1.0 and 1472.1, respectively.
  • The tire represented by this point has very little tread and has been driven relatively few miles, which suggests that the value of one or both of these two variables for this tire may be inaccurate and should be investigated.

154 of 154

Identification of Erroneous Outliers and Other Erroneous Values

  • Closer examination of outliers and potential erroneous values may reveal an error or a need for further investigation to determine whether the observation is relevant to the current analysis.
  • A conservative approach is to create two data sets, one with and one without outliers and potentially erroneous values, and then construct a model on both data sets.
  • If a model’s implications depend on the inclusion or exclusion of outliers and erroneous values, then you should spend additional time to track down the cause of the outliers.