2 of 41

About Me

Elijah Appiah from Ghana.

Ph.D. Economics at NIDA in Bangkok, Thailand.

Economist by profession and Data Scientist by passion.

Enthusiastic about working with data daily.

Technical skills in LATEX, Microsoft Office (Word, Excel, PowerPoint), SPSS, Stata, EViews, Python, R, MATLAB, Power BI, Tableau, and Google TensorFlow.

3 of 41

IDE and Packages for this Lecture

Main IDE for R is RStudio.
Packages:

benchmarkme
tidyverse
data.table
sparklyr
tidymodels
GGally
ggExtra

4 of 41

What is Big Data?

The three V’s of big data:

Volume: amount of data.
Variety: number of types of data.
Velocity: speed of data processing.

[Source: https://www.techtarget.com/whatis/definition/3Vs]

5 of 41

Why is Big Data Important?

Netflix: uses subscriptions and customers’ rating data for recommendation purposes.
Target: a big retailer in the US, uses data mining techniques to predict pregnancies of their shoppers, and send them a sale booklet for baby clothes, diapers, etc.
Car insurance companies analyze big data to understand how well their customers actually drive and how much to charge each customer to make a profit.
In education, student records, teacher observations, assessment results, and others.
New technologies such as facial recognition software and biometric signals, generate visual and audio data.

6 of 41

Analyzing Big Data - Steps

Read and extract the data.
Extract a subset, sample, or summary from the big data.
Repeat computation (e.g. fit a model) for many subgroups of the data.

7 of 41

Analyzing Big Data

You may need to store big data in a data warehouse (local database or cloud).

Then pass subsets of data from the warehouse to the local machine where the data are being analyzed.

What are the right tools for analyzing big data?

8 of 41

Analyzing Big Data

9 of 41

Analyzing Big Data

18 Best Open Source Tools for Big Data

https://www.datastackhub.com/top-tools/open-source-big-data-tools/

10 of 41

Analyzing Big Data – Suggestions

Obtain a strong computer (multiple and faster CPUs, more memory)

If memory is a problem, then access the data differently or split up the data.

Preview/visualize a subset of big data using a program, not the entire raw data.

Consider parallel computing and cloud computing.

Profile big tasks (in R) to cut down on computational time.

11 of 41

Now, let’s practice

12 of 41

Analyzing Big Data - Benchmarking

Assess the speed and performance of your system and compare the results with other systems.

benchmark_std() {benchmarkme}

benchmark_io() {benchmarkme}

13 of 41

Now, let’s practice

14 of 41

Analyzing Big Data – Exploratory Data Analysis (EDA)

Framework for data science:

Grolemund & Wickham (2017)

15 of 41

Analyzing Big Data – Data Wrangling with `data.table` Package

Data wrangling is a general term that refers to transforming data.
It involves subsetting, recoding, and transforming variables.

data.table {data.table}

16 of 41

Analyzing Big Data

Source: https://tinyurl.com/y366kvfx

17 of 41

Analyzing Big Data – Data Wrangling with `data.table` Package

This is how data.table works on a dataframe:

Source: https://tinyurl.com/yyepwjpt

18 of 41

Now, let’s practice

19 of 41

Analyzing Big Data – The `sparklyr` Package

Spark is a cluster computing platform.
Data and computations spread across several machines.

Removes the limit to the size of your datasets.

20 of 41

Analyzing Big Data – The `sparklyr` Package

`sparklyr` is R’s package that helps you to access Spark.
You get R’s easy-to-write syntax plus Spark’s unlimited data handling – uses `dplyr` syntax.

21 of 41

Analyzing Big Data – The `sparklyr` Package

To use Apache Spark, install “Java 8 JDK”.

https://www.oracle.com/java/technologies/downloads/?er=221886#java8

22 of 41

Analyzing Big Data – The `sparklyr` Package

Install the package on R.

install.packages(“sparklyr”)

library(sparklyr)

Then run this code:

spark_install()

Connect-work-disconnect

Connect using spark_connect()
Work with it
Disconnect using spark_disconnect()

23 of 41

Analyzing Big Data – The `sparklyr` Package

The `dplyr` syntax

Select columns with select()
Filter rows with filter()
Arrange rows with arrange()
Change or add columns with mutate()
Calculate summaries using summarize()

Can be used with group_by() for grouped summaries.

24 of 41

Now, let’s practice

25 of 41

Visualizing Big Data – The `ggplot2` Package

7 Layers of `ggplot2` Package:

Data
Aesthetics
Geometries
Facets
Coordinates
Statistics
Themes

26 of 41

Visualizing Big Data – The `ggplot2` Package

Some plots include:

Marginal plots – distribution of individual variables in dataset.
Conditional plots – examine relationship between continuous variables.
Plots for examining correlations – scatterplot matrix, correlation matrix
Plots for ordinal/categorical variables – barplots, alluvium plots

Other packages to supplement our visualization

{Ggally}, {ggExtras}, {ggalluvial}

27 of 41

Now, let’s practice

28 of 41

Modeling Big Data – Machine Learning (ML)

Traditional modeling techniques can struggle to find information from big data.
Many turn to ML to generate insights from big data.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions.

Source: Wikipedia

29 of 41

Modeling Big Data – Machine Learning (ML)

Main Types of Machine Learning

Supervised ML: learn from labeled data. Algorithm is given both input and output variables called “training data” to learn patterns from the data, and make inferences on unseen data called “test data”.

Unsupervised ML: learn from unlabeled data. Algorithm is given only input data from which it identifies patterns on its own.

30 of 41

Machine Learning Algorithms

Supervised ML

Regression

Linear Regression
Support Vector Regression
Decision Tree Regression
Random Forest Regression

Classification

Logistic Regression
K-Nearest Neighbors (K-NN)
Support Vector Machine (SVM)
Naïve Bayes
Decision Tree Classification
Random Forest Classification

Unsupervised ML

Clustering

K-Means Clustering
Hierarchical Clustering

Anomaly Detection

31 of 41

Model Evaluation - Classification

Confusion Matrix for Classification Problems

Confusion matrix: matrix counts of all combinations of actual and predicted outcomes.

Accuracy is used to calculate calculation ‘accuracy’.

Sensitivity is the proportion of all positive cases that were correctly classified.

Specificity is the proportion of all negative cases that were correctly classified.

False positive rate (FPR) is proportion of false positives among true negatives. (1-specificity).

32 of 41

Model Evaluation - Classification

Confusion Matrix for Classification Problems

33 of 41

Model Evaluation - Classification

Receiver Operating Characteristic (ROC) Curve is used to visualize model performance across probability thresholds.

Summary information of ROC curve is calculated using the ROC Area under the curve (ROC-AUC).

34 of 41

Model Evaluation – Regression

35 of 41

Typical Machine Learning Pipeline

36 of 41

Key Issues in ML

Generalization is how well a ML model can apply to the test data.
Overfitting

A model that performs too well on the training data but poorly on the test data.

Underfitting

A model that does not learn well on the training data, and does not generalize well on the test data.

Both overfitting and underfitting may cause poor performance of ML algorithms.

Some Remedies:

Resampling technique or cross-validation
Validation dataset

37 of 41

Modeling Big Data – Machine Learning (ML)

Main Package for ML in R:

tidymodels

39 of 41

Big Data Modeling – Project Time

TIME FOR A PROJECT!

1 of 41

2 of 41

3 of 41

4 of 41

5 of 41

6 of 41

7 of 41

8 of 41

9 of 41

10 of 41

11 of 41

12 of 41

13 of 41

14 of 41

15 of 41

16 of 41

17 of 41

18 of 41

19 of 41

20 of 41

21 of 41

22 of 41

23 of 41

24 of 41

25 of 41

26 of 41

27 of 41

28 of 41

29 of 41

30 of 41

31 of 41

32 of 41

33 of 41

34 of 41

35 of 41

36 of 41

37 of 41

38 of 41

39 of 41

40 of 41

41 of 41