1 of 54

Descriptive Analysis

Data Analysis

2 of 54

Did they

summarize

the data?

Did they report the summaries without interpretation?

Not a data analysis

Descriptive

Exploratory

Are they trying to predict measurement(s) for individuals?

Did they quantify whether the discoveries are likely to hold in a new sample?

Are they trying to figure out how changing the average of one measurement affects another?

Is the effect they are looking for an average effect or a deterministic effect?

Inferential

Predictive

Causal

Mechanistic

No

No

No

No

No

Yes

Yes

Yes

Yes

Yes

Average

Deterministic

3 of 54

4 of 54

The data set

You

5 of 54

Did they

summarize

the data?

Did they report the summaries without interpretation?

Not a data analysis

Descriptive

Exploratory

Are they trying to predict measurement(s) for individuals?

Did they quantify whether the discoveries are likely to hold in a new sample?

Are they trying to figure out how changing the average of one measurement affects another?

Is the effect they are looking for an average effect or a deterministic effect?

Inferential

Predictive

Causal

Mechanistic

No

No

No

No

No

Yes

Yes

Yes

Yes

Yes

Average

Deterministic

6 of 54

Size

Missingness

Central Tendency

Shape

Variability

Descriptive Analysis

7 of 54

2010 US Census Data Summary Table

(broken down by age)

8 of 54

... and stratified by sex

9 of 54

10 of 54

## install and load package

install.packages("ggplot2")

library(ggplot2)

## assign to object `df`

df <- msleep

11 of 54

Size

12 of 54

Environment tab will tell you the size of your data frame

13 of 54

dim() tells us the number of rows and the number of columns

14 of 54

15 of 54

size of dataframe

variable names

Class of each variable

First few values of each variable

16 of 54

size of dataframe

class of each variable

First few values of each variable

variable names

17 of 54

Missingness

18 of 54

27 observations of brainwt are missing

That’s 32.5% of the observations in the dataset

19 of 54

## install naniar package

install.packages("naniar")

library(naniar)

## visualize missingness

viz_miss(df)

20 of 54

## visualize missingness

vis_miss(df)

21 of 54

## visualize relative missingness

gg_miss_var(df) + theme_bw()

22 of 54

Shape

23 of 54

A Normal Distribution

Curve is symmetric around middle

most values are near the central value

24 of 54

except for this bump here...

Approximately Normal and with most values centered around 10

ggplot(df, aes(sleep_total)) +

geom_density()

25 of 54

Although this tail is slightly longer than the other tail

ggplot(iris, aes(Sepal.Width)) +

geom_density()

26 of 54

A Skewed Distribution

skewed left

skewed right

most values fall to one extreme within the range

27 of 54

28 of 54

A Uniform Distribution

distribution of values is constant over the range of the variable

29 of 54

outlier

30 of 54

1000 people between the ages of 18 and 65

1 person aged 95 years old

1 person aged 1 year old

Your sample

31 of 54

Age

1 person aged 1 year old

1 person aged 95 years old

1000 people between the ages of 18 and 65

32 of 54

Caution: Observations should only be removed from your dataset if you have a valid reason to do so.

33 of 54

Outliers can occur due to...

  • Data entry errors
  • Poor sampling procedures
  • Technical or mechanical error
  • Unexpected changes in weather

34 of 54

library(ggplot)

ggplot(iris, aes(Petal.Length))+

geom_density()

35 of 54

ggplot(iris, aes(Species, Petal.Length))+

geom_boxplot()

36 of 54

IQR

outliers

37 of 54

Central Tendency

38 of 54

1 2 3 4 5 6

The mean is 3.5

Calculating the mean:

  1. Sum all values

1 + 2 + 3 + 4 + 5 +6 = 21

2. Divide sum by the number of observations (6)

21/6 = 3.5

39 of 54

1 2 3 3 4 5 6

The mean is 3.43

Calculating the mean:

  • Sum all values

1 + 2 + 3 + 3 + 4 + 5 +6 = 24

2. Divide sum by the number of observations (6)

24/7 = 3.43

40 of 54

41 of 54

1 2 3 4 5 6

The median is 3.5

42 of 54

1 2 3 4 5 6

The median is 3.5

1 2 3 3 4 5 6

The median is 3

43 of 54

44 of 54

ggplot(df, aes(bodywt)) +

geom_histogram()

Mammals with outlier body weights lead to an increase in the mean

mean = 166

median = 1.67

45 of 54

46 of 54

The mode is the most frequent category

47 of 54

The mode is the most frequent category

ggplot(df, aes(order)) +

geom_bar() +

theme(axis.text.x = element_text(angle = 90,

hjust = 1,

vjust = 0.5))

48 of 54

Variability

49 of 54

Large value leads to increased variance

variance is zero when every value is the same

50 of 54

The standard deviation is the square root of the variance

51 of 54

central tendency

variability

shape

missingness

52 of 54

53 of 54

The number (and %) of females and males in each group

54 of 54

Size

Missingness

Central Tendency

Shape

Variability

Descriptive Analysis