1 of 11

Conformal Prediction in Spark

Tutorial session - COPA 2017

Marco Capuccini

PharmB.io

Uppsala University, Sweden

2 of 11

Who am I?

Background

Computer Science Bioinformatics

PhD student – Uppsala University

Department of Information Technology

Department of Pharmaceutical Biosciences

Rome

Uppsala

3 of 11

Today’s plan

  1. Introduction to Apache Spark
  2. Demo: CP in Spark using Scala-CP
    1. GitHub: https://github.com/mcapuccini/scala-cp
    2. M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth, "Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence," 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.
  3. Hands-on/Hackaton

Takeaways: build large-scale CP, large-scale interactive analysis and visualization

4 of 11

Today’s plan

  • Introduction to Apache Spark
  • Demo: CP in Spark using Scala-CP
    • GitHub: https://github.com/mcapuccini/scala-cp
    • M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth, "Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence," 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.
  • Hands-on/Hackaton

Takeaways: build large-scale CP, large-scale interactive analysis and visualization

5 of 11

Why Apache Spark?

Apache Spark is the most active open source large-scale data processing engine

1000+ contributors from over 250 organizations

Originally born to overcome MapReduce lack of dataset caching

Spark: Cluster Computing with Working Sets, Zaharia et al. (2010)

It allows for interactive analysis

6 of 11

A unified computing engine

Spark Core

RDD API

Spark SQL

Spark Streaming

MLlb

GraphX

Data

sources

Environments

7 of 11

Apache Spark architecture (1)

Standalone cluster mode

  • Spark Master: it acts as a cluster manager, it maintains the workers quorum and it manages the resources
  • Spark Worker: it receive instructions from the Spark Master, it launches SparkExecutors

Spark Master

Spark Worker

Spark Worker

Network

Driver Program

SparkContext

Spark Master

Spark Worker

Spark Executor

Spark Worker

Spark Executor

8 of 11

Apache Spark architecture (2)

Execution model

  • Driver Program: it is the program written by the Spark developer. It allocates a SparkConext, which is a conduit to access all of the Spark’s functionalities
  • Spark Executor: a container with an allocated amount of cores and memory. It executes Tasks and it stores Data Partitions

Spark Master

Spark Worker

Spark Worker

Network

Driver Program

SparkContext

Spark Master

Spark Worker

Spark Executor

Spark Worker

Spark Executor

9 of 11

Today’s plan

  • Introduction to Apache Spark
  • Demo: CP in Spark using Scala-CP
    • GitHub: https://github.com/mcapuccini/scala-cp
    • M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth, "Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence," 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.
  • Hands-on/Hackaton

Takeaways: build large-scale CP, large-scale interactive analysis and visualization

10 of 11

Today’s plan

  • Introduction to Apache Spark
  • Demo: CP in Spark using Scala-CP
    • GitHub: https://github.com/mcapuccini/scala-cp
    • M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth, "Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence," 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.
  • Hands-on/Hackaton

Takeaways: build large-scale CP, large-scale interactive analysis and visualization

11 of 11

Questions?