1 of 41

Big Data Modeling with R

2 of 41

About Me

Elijah Appiah from Ghana.

Ph.D. Economics at NIDA in Bangkok, Thailand.

Economist by profession and Data Scientist by passion.

Enthusiastic about working with data daily.

Technical skills in LATEX, Microsoft Office (Word, Excel, PowerPoint), SPSS, Stata, EViews, Python, R, MATLAB, Power BI, Tableau, and Google TensorFlow.

3 of 41

IDE and Packages for this Lecture

  • Main IDE for R is RStudio.
  • Packages:

benchmarkme

tidyverse

data.table

sparklyr

tidymodels

GGally

ggExtra

4 of 41

What is Big Data?

  • The three V’s of big data:
    • Volume: amount of data.
    • Variety: number of types of data.
    • Velocity: speed of data processing.

[Source: https://www.techtarget.com/whatis/definition/3Vs]

5 of 41

Why is Big Data Important?

  • Netflix: uses subscriptions and customers’ rating data for recommendation purposes.
  • Target: a big retailer in the US, uses data mining techniques to predict pregnancies of their shoppers, and send them a sale booklet for baby clothes, diapers, etc.
  • Car insurance companies analyze big data to understand how well their customers actually drive and how much to charge each customer to make a profit.
  • In education, student records, teacher observations, assessment results, and others.
  • New technologies such as facial recognition software and biometric signals, generate visual and audio data.

6 of 41

Analyzing Big Data - Steps

  1. Read and extract the data.
  2. Extract a subset, sample, or summary from the big data.
  3. Repeat computation (e.g. fit a model) for many subgroups of the data.

7 of 41

Analyzing Big Data

  • You may need to store big data in a data warehouse (local database or cloud).

  • Then pass subsets of data from the warehouse to the local machine where the data are being analyzed.

  • What are the right tools for analyzing big data?

8 of 41

Analyzing Big Data

9 of 41

Analyzing Big Data

  • 18 Best Open Source Tools for Big Data

https://www.datastackhub.com/top-tools/open-source-big-data-tools/

10 of 41

Analyzing Big Data – Suggestions

  • Obtain a strong computer (multiple and faster CPUs, more memory)

  • If memory is a problem, then access the data differently or split up the data.

  • Preview/visualize a subset of big data using a program, not the entire raw data.

  • Consider parallel computing and cloud computing.

  • Profile big tasks (in R) to cut down on computational time.

11 of 41

Now, let’s practice

12 of 41

Analyzing Big Data - Benchmarking

  • Assess the speed and performance of your system and compare the results with other systems.

benchmark_std() {benchmarkme}

benchmark_io() {benchmarkme}

13 of 41

Now, let’s practice

14 of 41

Analyzing Big Data – Exploratory Data Analysis (EDA)

Framework for data science:

Grolemund & Wickham (2017)

15 of 41

Analyzing Big Data – Data Wrangling with `data.table` Package

  • Data wrangling is a general term that refers to transforming data.
  • It involves subsetting, recoding, and transforming variables.

data.table {data.table}

16 of 41

Analyzing Big Data

Source: https://tinyurl.com/y366kvfx

17 of 41

Analyzing Big Data – Data Wrangling with `data.table` Package

  • This is how data.table works on a dataframe:

Source: https://tinyurl.com/yyepwjpt

18 of 41

Now, let’s practice

19 of 41

Analyzing Big Data – The `sparklyr` Package

  • Spark is a cluster computing platform.
  • Data and computations spread across several machines.
    • Removes the limit to the size of your datasets.

20 of 41

Analyzing Big Data – The `sparklyr` Package

  • `sparklyr` is R’s package that helps you to access Spark.
  • You get R’s easy-to-write syntax plus Spark’s unlimited data handling – uses `dplyr` syntax.

21 of 41

Analyzing Big Data – The `sparklyr` Package

  • To use Apache Spark, install “Java 8 JDK”.

https://www.oracle.com/java/technologies/downloads/?er=221886#java8

22 of 41

Analyzing Big Data – The `sparklyr` Package

  • Install the package on R.

install.packages(“sparklyr”)

library(sparklyr)

    • Then run this code:

spark_install()

  • Connect-work-disconnect
    • Connect using spark_connect()
    • Work with it
    • Disconnect using spark_disconnect()

23 of 41

Analyzing Big Data – The `sparklyr` Package

The `dplyr` syntax

  • Select columns with select()
  • Filter rows with filter()
  • Arrange rows with arrange()
  • Change or add columns with mutate()
  • Calculate summaries using summarize()
    • Can be used with group_by() for grouped summaries.

24 of 41

Now, let’s practice

25 of 41

Visualizing Big Data – The `ggplot2` Package

  • 7 Layers of `ggplot2` Package:
    1. Data
    2. Aesthetics
    3. Geometries
    4. Facets
    5. Coordinates
    6. Statistics
    7. Themes

26 of 41

Visualizing Big Data – The `ggplot2` Package

  • Some plots include:
    • Marginal plots – distribution of individual variables in dataset.
    • Conditional plots – examine relationship between continuous variables.
    • Plots for examining correlations – scatterplot matrix, correlation matrix
    • Plots for ordinal/categorical variables – barplots, alluvium plots

  • Other packages to supplement our visualization

{Ggally}, {ggExtras}, {ggalluvial}

27 of 41

Now, let’s practice

28 of 41

Modeling Big Data – Machine Learning (ML)

  • Traditional modeling techniques can struggle to find information from big data.
  • Many turn to ML to generate insights from big data.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions.

Source: Wikipedia

29 of 41

Modeling Big Data – Machine Learning (ML)

  • Main Types of Machine Learning
    • Supervised ML: learn from labeled data. Algorithm is given both input and output variables called “training data” to learn patterns from the data, and make inferences on unseen data called “test data”.

    • Unsupervised ML: learn from unlabeled data. Algorithm is given only input data from which it identifies patterns on its own.

30 of 41

Machine Learning Algorithms

Supervised ML

  • Regression
    • Linear Regression
    • Support Vector Regression
    • Decision Tree Regression
    • Random Forest Regression
  • Classification
    • Logistic Regression
    • K-Nearest Neighbors (K-NN)
    • Support Vector Machine (SVM)
    • Naïve Bayes
    • Decision Tree Classification
    • Random Forest Classification

Unsupervised ML

  • Clustering
    • K-Means Clustering
    • Hierarchical Clustering
  • Anomaly Detection

31 of 41

Model Evaluation - Classification

Confusion Matrix for Classification Problems

  • Confusion matrix: matrix counts of all combinations of actual and predicted outcomes.

  • Accuracy is used to calculate calculation ‘accuracy’.

  • Sensitivity is the proportion of all positive cases that were correctly classified.

  • Specificity is the proportion of all negative cases that were correctly classified.

  • False positive rate (FPR) is proportion of false positives among true negatives. (1-specificity).

32 of 41

Model Evaluation - Classification

Confusion Matrix for Classification Problems

33 of 41

Model Evaluation - Classification

  • Receiver Operating Characteristic (ROC) Curve is used to visualize model performance across probability thresholds.

  • Summary information of ROC curve is calculated using the ROC Area under the curve (ROC-AUC).

34 of 41

Model Evaluation – Regression

  •  

35 of 41

Typical Machine Learning Pipeline

36 of 41

Key Issues in ML

  • Generalization is how well a ML model can apply to the test data.
  • Overfitting
    • A model that performs too well on the training data but poorly on the test data.
  • Underfitting
    • A model that does not learn well on the training data, and does not generalize well on the test data.
  • Both overfitting and underfitting may cause poor performance of ML algorithms.

  • Some Remedies:
    • Resampling technique or cross-validation
    • Validation dataset

37 of 41

Modeling Big Data – Machine Learning (ML)

  • Main Package for ML in R:

tidymodels

38 of 41

39 of 41

Big Data Modeling – Project Time

TIME FOR A PROJECT!

40 of 41

Now, let’s practice

41 of 41

THANK

YOU