Exploratory Data Analysis
Using R
- Satishkuar L. Varma
BDA
Introduction to Data Science: Exploratory data analysis
Satishkumar Varma, PCE
2
Introduction to Data Science: Exploratory data analysis
Satishkumar Varma, PCE
3
dim() used to obtain the dimensions of the data frame (number of rows and number of columns). The output is a vector.
> dim(InsectSprays)
[1] 72 2
nrow() and ncol() are used to get the number of rows and number of columns, respectively. Also, get the same info by extracting first and second element of output vector from dim().
> nrow(InsectSprays)
# same as dim(InsectSprays)[1]
[1] 72
> ncol(InsectSprays)
# same as dim(InsectSprays)[2]
[1] 2
Introduction to Data Science: Exploratory data analysis
Satishkumar Varma, PCE
4
head() used to obtain the first n observations and
tail() to obtain the last n observations; by default, n = 6.
Good commands for obtaining an intuitive idea of what the data look like without revealing the entire data set, which could have millions of rows and thousands of columns.
> head(InsectSprays, n = 5)
count spray
1 10 A
2 7 A
3 20 A
4 14 A
5 14 A
Introduction to Data Science: Exploratory data analysis
Satishkumar Varma, PCE
5
Let s be no. of observations. If we use a negative number for the “n” option in head(),
we will obtain the first s+n observations.
Example: since s = 72 and s = -62, the following command will return the first 10 observations; the calculation is s+n = 72 + (-62) = 10.
> head(InsectSprays, n = -62)
count spray
1 10 A
2 7 A
3 20 A
4 14 A
5 14 A
6 12 A
7 10 A
8 23 A
9 17 A
10 20 A
Introduction to Data Science: Exploratory data analysis
Satishkumar Varma, PCE
6
Analogously, if we use a negative number for the “n” option in tail(),
We will get the last s+n observations.
Example, the following command will return the last 10 observations.
> tail(InsectSprays, n = -62)
count spray
63 15 F
64 22 F
65 15 F
66 16 F
67 13 F
68 10 F
69 26 F
70 26 F
71 24 F
72 13 F
Introduction to Data Science: Exploratory data analysis
Satishkumar Varma, PCE
7
names() function will return the column headers.
> names(InsectSprays)
[1] "count" "spray“
str() function returns many useful pieces of information, including the above useful outputs and the types of data for each column.
Example: “num” denotes that the variable “count” is numeric (continuous), and “Factor” denotes that the variable “spray” is categorical with 6 categories or levels.
> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
Introduction to Data Science: Exploratory data analysis
Satishkumar Varma, PCE
8
Introduction to Data Science: Exploratory data analysis
Satishkumar Varma, PCE
9
To obtain all of the categories or levels of a categorical variable, use the levels() function.
> levels(InsectSprays$spray)
[1] "A" "B" "C" "D" "E" "F"
For a data frame, the summary() function is essentially applied to each column, and the results for all columns are shown together.
For a continuous (numeric) variable like “count”, it returns the 5-number summary.
Learn how fivenum() and summary() return different 5-number summaries.
If there are any missing values (denoted by “NA” for a particular datum), it would also provide a count for them.
Example: there are no missing values for “count”, so there is no display for # NA’s.
For a categorical variable like “spray”, it returns the levels and # data in each level.
> summary(InsectSprays)
count spray
Min. : 0.00 A:12
1st Qu.: 3.00 B:12
Median : 7.00 C:12
Mean : 9.50 D:12
3rd Qu.:14.25 E:12
Max. :26.00 F:12
Introduction to Data Science: Exploratory data analysis
Satishkumar Varma, PCE
10
x[i]
x[i, j]
x[[i]]
x[[i, j]]
x$a
x$"a"
References
Satishkumar Varma, PCE
11
1. Davy Cielen,Meysman,Mohamed Ali, “Introducing Data Science”, Dreamtech Press
2. Kevin P. Murphy, “Machine Learning a Probabilistic Perspective”, The MIT Press
3. Paul C. Zikopoulos, Chris Eaton, Dirk deRoos, Thomas Deutsch and George Lapis,
“Understanding Big Data: Analytics for Enterprise Class Hadoop and streaming Data”,
The McGraw Hill Companies, 2012 "Big Data: The next frontier for innovation, competition, and productivity". Rapporto McKinsey & Company, 2012.
4. Dean Abbott, “Applied Predictive Analytics: Principles and Techniques for the
Professional Data Analyst”, Wiley, 2014
5. Noel Cressie, Christopher K. Wikle , “Statistics for Spatio-Temporal Data, Wiley
6. Seema Acharya and Subhashini Chellappan, “Big Data and Analytics”, Wiley
7. Rachel Schutt and Cathy O’Neil, “Doing Data Science”, O’Reilly Media
8. Joel Grus, Data Science from Scratch: First Principles with Python, O'Reilly Media
9. EMC Education Services,”Data Science and Big Data Analytics”, Wiley
10. DT Editorial Services, “Big Data Black Book”, Dreamtech Press