1 of 11

Conformal Prediction in Spark

Tutorial session - COPA 2017

Marco Capuccini

PharmB.io

Uppsala University, Sweden

2 of 11

Who am I?

Background

Computer Science Bioinformatics

PhD student – Uppsala University

Department of Information Technology

Department of Pharmaceutical Biosciences

Rome

Uppsala

3 of 11

Today’s plan

Introduction to Apache Spark
Demo: CP in Spark using Scala-CP

GitHub: https://github.com/mcapuccini/scala-cp
M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth, "Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence," 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.

Hands-on/Hackaton

Install TheSparkBox: https://github.com/mcapuccini/TheSparkBox
Reproduce demo: link
Tune the Zeppelin notebook, try some of your use cases

Takeaways: build large-scale CP, large-scale interactive analysis and visualization

4 of 11

Today’s plan

Introduction to Apache Spark
Demo: CP in Spark using Scala-CP

GitHub: https://github.com/mcapuccini/scala-cp
M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth, "Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence," 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.

Hands-on/Hackaton

Install TheSparkBox: https://github.com/mcapuccini/TheSparkBox
Reproduce demo: link
Tune the Zeppelin notebook, try some of your use cases

Takeaways: build large-scale CP, large-scale interactive analysis and visualization

5 of 11

Why Apache Spark?

Apache Spark is the most active open source large-scale data processing engine

1000+ contributors from over 250 organizations

Originally born to overcome MapReduce lack of dataset caching

Spark: Cluster Computing with Working Sets, Zaharia et al. (2010)

It allows for interactive analysis

6 of 11

A unified computing engine

Spark Core

RDD API

Spark SQL

Spark Streaming

MLlb

GraphX

Data

sources

Environments

7 of 11

Apache Spark architecture (1)

Standalone cluster mode

Spark Master: it acts as a cluster manager, it maintains the workers quorum and it manages the resources
Spark Worker: it receive instructions from the Spark Master, it launches SparkExecutors

Spark Master

Spark Worker

Network

Driver Program

SparkContext

Spark Master

Spark Worker

Spark Executor

Spark Worker

Spark Executor

8 of 11

Apache Spark architecture (2)

Execution model

Driver Program: it is the program written by the Spark developer. It allocates a SparkConext, which is a conduit to access all of the Spark’s functionalities
Spark Executor: a container with an allocated amount of cores and memory. It executes Tasks and it stores Data Partitions

Spark Master

Spark Worker

Network

Driver Program

SparkContext

Spark Master

Spark Worker

Spark Executor

Spark Worker

Spark Executor

9 of 11

Today’s plan

Introduction to Apache Spark
Demo: CP in Spark using Scala-CP

GitHub: https://github.com/mcapuccini/scala-cp
M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth, "Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence," 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.

Hands-on/Hackaton

Install TheSparkBox: https://github.com/mcapuccini/TheSparkBox
Reproduce demo: link
Tune the Zeppelin notebook, try some of your use cases

Takeaways: build large-scale CP, large-scale interactive analysis and visualization

10 of 11

Today’s plan

Introduction to Apache Spark
Demo: CP in Spark using Scala-CP

GitHub: https://github.com/mcapuccini/scala-cp
M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth, "Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence," 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.

Hands-on/Hackaton

Install TheSparkBox: https://github.com/mcapuccini/TheSparkBox
Reproduce demo: link
Tune the Zeppelin notebook, try some of your use cases

Takeaways: build large-scale CP, large-scale interactive analysis and visualization

1 of 11

2 of 11

3 of 11

4 of 11

5 of 11

6 of 11

7 of 11

8 of 11

9 of 11

10 of 11

11 of 11