1 of 36

Plug into Big Data

with Juju

2 of 36

Terms and Definitions

  • Service: a service is something that can be consumed by another service or by an entity external to the Juju environment.
  • Relation: a relation represents the ability for 2 services to communicate together. They are exposed by a service (to explain how it shall be consumed) on one end, and consumed by the counterpart on the other end.
  • Unit: A unit is the atomic representation of an element of service. It may be a VM, a container, a bare metal server, etc.

These 3 concepts are enough to represent any application. It’s Juju, it’s magic!

3 of 36

Juju

4 of 36

What is Juju?

  • Modeling language for service oriented environments
    • Deploy, relate, scale
    • Reliability
    • Repeatability
    • Observability

5 of 36

What is Juju? (contd)

  • Juju has two key components:
    • Charms model how a service is be deployed, scaled and integrated
      • Written in any language (Big Data charms are mostly python)
    • Bundles represent a set of charms integrated together
      • Solutions

6 of 36

Challenges of building Big Data solutions

7 of 36

Big Data ecosystem

  • Apache Core Hadoop Services
    • HDFS: Hadoop Distributed File System, manages data
    • YARN: Hadoop Resource Manager, manages jobs
    • Compute Slaves: Hadoop data processing units, run jobs
  • Apache Spark
    • In-memory data processing unit, integrates with YARN
  • Additional components
    • Data Ingestion: Flume, Kafka, etc.
    • Data Analysis: Spark, Hive, Pig, etc.
    • Data Visualization: Hue, Zeppelin, etc.

8 of 36

Challenges of building Big Data solutions

  • Many Hadoop distributions
  • Many Apache projects to integrate into solutions

9 of 36

Hadoop distributions

  • Similar to Linux, Hadoop has many distributions
    • Top commercial offerings: Cloudera, MapR, Hortonworks, IBM BigInsights
    • Open source distribution: Apache Hadoop
  • Issues
    • Each distribution has different packaging style
    • Each distribution has different installation blueprints
      • e.g., users, install locations, etc.
    • Different dependencies
      • e.g., IBM BigInsights requires IBM JAVA
    • Different hardware
      • e.g., POWER, x86, ARM

10 of 36

Big Data Solution Components

  • Core
  • Data Ingestion
  • Data Processing
  • Data Visualization

11 of 36

Pluggable model to enable the Big Data ecosystem

12 of 36

Pluggable Stack

  • Uses standard interfaces (dfs, map-reduce)
  • Enables diverse solutions regardless of core and surrounding services
    • Swappable components means rapid development at every layer
      • Core Infrastructure
      • Data Ingestion
      • Data Analysis
      • Data Visualizations

13 of 36

Pluggable Installation

  • Operating System independence
    • Tarballs
    • Eliminate OS packaging dependencies
  • Architecture independence
    • Determine requirements at deployment time
  • Example from Hive

resources:

hive-ppc64le:

url: http://<url>/apache-hive-0.13.0-bin.tar.gz

hash: 4c835644eb72a08df059b86c45fb159b95df08e831334cb57e24654ef078e7ee

hash_type: sha256

hive-x86_64:

url: http://<url>/apache-hive-1.0.0-bin.tar.gz

hash: b8e121f435defeb94d810eb6867d2d1c27973e4a3b4099f2716dbffafb274184

hash_type: sha256

14 of 36

Pluggable Configuration

  • Vendor properties
    • Provide sane defaults and allows fine-tuning
    • Allows vendor-specific configuration
  • Example from Hive

vendor: 'apache'

hadoop_version: '2.4.1'

packages:

- 'libmysql-java'

- 'mysql-client'

groups:

- 'hadoop'

users:

hive:

groups: ['hadoop']

dirs:

hive:

path: '/usr/lib/hive'

owner: 'hive'

group: 'hadoop'

ports:

hive:

port: 10000

15 of 36

Plugin Charm

  • Single, simplified connection point to Hadoop
  • Relating to plugin installs and manages:
  • Java Runtime
  • Access to interact with the data set
    • Hadoop API and CLI
    • Hadoop config (/etc/hadoop/conf, /etc/environment)
  • Allows charm reusability across Hadoop versions and distributions

includes: [‘interface:hadoop-plugin’]

@when(‘hadoop.yarn.ready’, ‘hadoop.hdfs.ready’)

def setup_pig(hadoop, *args):

pig.install()

pig.configure()

16 of 36

Hadoop Core

17 of 36

Big Data Core

  • Apache Hadoop and Spark Data Processing solutions provide the following fully configured services:
    • HDFS
      • Primary distributed storage used by Hadoop applications.
    • YARN
      • Architectural center of Hadoop that allows multiple data processing engines such as Mapreduce, Spark, real-time Spark Streaming, and other batch processing tools to process data stored in HDFS.
    • Spark
      • Fast and general engine for large-scale data processing.

18 of 36

Apache Hadoop Core

Hadoop Core Batch Processing

juju quickstart apache-core-batch-processing

Provides a distributed Hadoop Cluster ready for batch processing with Plugin capabilities to add in additional functionality

19 of 36

Data Ingest

20 of 36

Add Apache Flume

Data Ingest with Apache Flume

juju quickstart u/bigdata-dev/apache-ingestion-flume

Provides:

  • Hadoop Cluster
  • Flume for data ingest into HDFS

21 of 36

Add Apache Kafka

Data Ingest with Apache Kafka

juju quickstart u/bigdata-dev/apache-flume-ingestion-kafka

Provides:

  • Hadoop Cluster
  • Kafka for data ingest

22 of 36

Data Analysis

23 of 36

Add Apache Pig

Data Analysis with Apache Pig language

juju quickstart apache-analytics-pig

Provides:

  • Hadoop Cluster
  • Pig for Data Analysis

24 of 36

Add Apache Hive

Data Analytics with MySQL

juju quickstart apache-analytics-sql

Provides:

  • Hadoop Cluster
  • SQL Like Queries with Hive
  • MySQL DataStore

25 of 36

Data Visualization

26 of 36

Add iPy Notebook

Hadoop Core + Spark with Notebook Viz

juju quickstart apache-hadoop-spark-notebook

Provides:

  • Hadoop Cluster
  • Spark for real-time streaming
  • iPython Notebook for visualization

27 of 36

Add Apache Zeppelin

Hadoop Core + Spark with Notebook Viz

juju quickstart apache-hadoop-spark-zeppelin

Provides:

  • Hadoop Cluster
  • Spark for real-time streaming
  • iPython Notebook for visualization

28 of 36

+ Spark

29 of 36

Spark + Hadoop

  • Spark uses HDFS - It can use any Hadoop data source
  • Spark runs on YARN - It can run on the same cluster with mapreduce jobs, Hive, Pig, etc.

30 of 36

Spark Service

  • Our Apache Hadoop pluggable model provides four ways to interact with the Spark service
    1. pyspark
      1. Deployed and configured as part of apache-spark charm
      2. Preconfigured to connect to HDFS and YARN services
    2. spark-shell
      • Deployed and configured as part of apache-spark charm
      • Preconfigured to connect to HDFS and YARN services
    3. spark-submit
      • Deployed and configured as part of apache-spark charm
      • Preconfigured to connect to HDFS and YARN services
    4. Spark API (i.e. SparkStreaming, SparkSQL, SparkHive, SparkR, MLib, ..)
      • Deploy Spark-related application charms as subordinates to the Spark charm
        1. Apache Zeppelin, IPython Notebook for Spark, etc
  • Spark charm details: https://jujucharms.com/apache-spark/

31 of 36

Spark with Layers

https://github.com/johnsca/layer-apache-spark

Refactoring Spark charm using layers, to make it easier to extend

@when('bootstrapped')

@when_not('spark.installed')

def install_spark():

spark = Spark()

if spark.verify_resources():

hookenv.status_set('maintenance', 'Installing Apache Spark')

spark.install()

set_state('spark.installed')

@when('spark.installed', 'hadoop.yarn.ready', 'hadoop.hdfs.ready')

def start_spark(*args):

hookenv.status_set('maintenance', 'Setting up Apache Spark')

spark = Spark()

spark.configure()

spark.start()

spark.open_ports()

set_state('spark.started')

hookenv.status_set('active', 'Ready')

32 of 36

Build Your Solution

33 of 36

Build and Share Your Solution

Real-time Syslog Analytics

juju quickstart realtime-syslog-analytics

Provides:

  • Hadoop Cluster
  • Spark for real-time streaming
  • Zepplin for visualization
  • Flume for Syslog Data ingest

34 of 36

Ecosystem Solutions

35 of 36

References and Contact Info

36 of 36

Thanks!