Descriptive Analysis
Data Analysis
Did they
summarize
the data?
Did they report the summaries without interpretation?
Not a data analysis
Descriptive
Exploratory
Are they trying to predict measurement(s) for individuals?
Did they quantify whether the discoveries are likely to hold in a new sample?
Are they trying to figure out how changing the average of one measurement affects another?
Is the effect they are looking for an average effect or a deterministic effect?
Inferential
Predictive
Causal
Mechanistic
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Average
Deterministic
The data set
You
Did they
summarize
the data?
Did they report the summaries without interpretation?
Not a data analysis
Descriptive
Exploratory
Are they trying to predict measurement(s) for individuals?
Did they quantify whether the discoveries are likely to hold in a new sample?
Are they trying to figure out how changing the average of one measurement affects another?
Is the effect they are looking for an average effect or a deterministic effect?
Inferential
Predictive
Causal
Mechanistic
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Average
Deterministic
Size
Missingness
Central Tendency
Shape
Variability
Descriptive Analysis
2010 US Census Data Summary Table
(broken down by age)
... and stratified by sex
## install and load package
install.packages("ggplot2")
library(ggplot2)
## assign to object `df`
df <- msleep
Size
Environment tab will tell you the size of your data frame
dim() tells us the number of rows and the number of columns
size of dataframe
variable names
Class of each variable
First few values of each variable
size of dataframe
class of each variable
First few values of each variable
variable names
Missingness
27 observations of brainwt are missing
That’s 32.5% of the observations in the dataset
## install naniar package
install.packages("naniar")
library(naniar)
## visualize missingness
viz_miss(df)
## visualize missingness
vis_miss(df)
## visualize relative missingness
gg_miss_var(df) + theme_bw()
Shape
A Normal Distribution
Curve is symmetric around middle
most values are near the central value
except for this bump here...
Approximately Normal and with most values centered around 10
ggplot(df, aes(sleep_total)) +
geom_density()
Although this tail is slightly longer than the other tail
ggplot(iris, aes(Sepal.Width)) +
geom_density()
A Skewed Distribution
skewed left
skewed right
most values fall to one extreme within the range
A Uniform Distribution
distribution of values is constant over the range of the variable
outlier
1000 people between the ages of 18 and 65
1 person aged 95 years old
1 person aged 1 year old
Your sample
Age
1 person aged 1 year old
1 person aged 95 years old
1000 people between the ages of 18 and 65
Caution: Observations should only be removed from your dataset if you have a valid reason to do so.
Outliers can occur due to...
library(ggplot)
ggplot(iris, aes(Petal.Length))+
geom_density()
ggplot(iris, aes(Species, Petal.Length))+
geom_boxplot()
IQR
outliers
Central Tendency
1 2 3 4 5 6
The mean is 3.5
Calculating the mean:
1 + 2 + 3 + 4 + 5 +6 = 21
2. Divide sum by the number of observations (6)
21/6 = 3.5
1 2 3 3 4 5 6
The mean is 3.43
Calculating the mean:
1 + 2 + 3 + 3 + 4 + 5 +6 = 24
2. Divide sum by the number of observations (6)
24/7 = 3.43
1 2 3 4 5 6
The median is 3.5
1 2 3 4 5 6
The median is 3.5
1 2 3 3 4 5 6
The median is 3
ggplot(df, aes(bodywt)) +
geom_histogram()
Mammals with outlier body weights lead to an increase in the mean
mean = 166
median = 1.67
The mode is the most frequent category
The mode is the most frequent category
ggplot(df, aes(order)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5))
Variability
Large value leads to increased variance
variance is zero when every value is the same
The standard deviation is the square root of the variance
central tendency
variability
shape
missingness
The number (and %) of females and males in each group
Size
Missingness
Central Tendency
Shape
Variability
Descriptive Analysis