1 of 36

Plug into Big Data

with Juju

2 of 36

Terms and Definitions

Service: a service is something that can be consumed by another service or by an entity external to the Juju environment.
Relation: a relation represents the ability for 2 services to communicate together. They are exposed by a service (to explain how it shall be consumed) on one end, and consumed by the counterpart on the other end.
Unit: A unit is the atomic representation of an element of service. It may be a VM, a container, a bare metal server, etc.

These 3 concepts are enough to represent any application. It’s Juju, it’s magic!

3 of 36

Juju

4 of 36

What is Juju?

Modeling language for service oriented environments

Deploy, relate, scale
Reliability
Repeatability
Observability

5 of 36

What is Juju? (contd)

Juju has two key components:

Charms model how a service is be deployed, scaled and integrated

Written in any language (Big Data charms are mostly python)

Bundles represent a set of charms integrated together

Solutions

6 of 36

Challenges of building Big Data solutions

7 of 36

Big Data ecosystem

Apache Core Hadoop Services

HDFS: Hadoop Distributed File System, manages data
YARN: Hadoop Resource Manager, manages jobs
Compute Slaves: Hadoop data processing units, run jobs

Apache Spark

In-memory data processing unit, integrates with YARN

Additional components

Data Ingestion: Flume, Kafka, etc.
Data Analysis: Spark, Hive, Pig, etc.
Data Visualization: Hue, Zeppelin, etc.

8 of 36

Challenges of building Big Data solutions

Many Hadoop distributions
Many Apache projects to integrate into solutions

9 of 36

Hadoop distributions

Similar to Linux, Hadoop has many distributions

Top commercial offerings: Cloudera, MapR, Hortonworks, IBM BigInsights
Open source distribution: Apache Hadoop

Issues

Each distribution has different packaging style
Each distribution has different installation blueprints

e.g., users, install locations, etc.

Different dependencies

e.g., IBM BigInsights requires IBM JAVA

Different hardware

e.g., POWER, x86, ARM

10 of 36

Big Data Solution Components

Core
Data Ingestion
Data Processing
Data Visualization

11 of 36

Pluggable model to enable the Big Data ecosystem

12 of 36

Pluggable Stack

Uses standard interfaces (dfs, map-reduce)
Enables diverse solutions regardless of core and surrounding services

Swappable components means rapid development at every layer

Core Infrastructure
Data Ingestion
Data Analysis
Data Visualizations

13 of 36

Pluggable Installation

Operating System independence

Tarballs
Eliminate OS packaging dependencies

Architecture independence

Determine requirements at deployment time

Example from Hive

http://bazaar.launchpad.net/~bigdata-dev/charms/trusty/apache-hive/trunk/view/head:/resources.yaml

resources:

hive-ppc64le:

url: http://<url>/apache-hive-0.13.0-bin.tar.gz

hash: 4c835644eb72a08df059b86c45fb159b95df08e831334cb57e24654ef078e7ee

hash_type: sha256

hive-x86_64:

url: http://<url>/apache-hive-1.0.0-bin.tar.gz

hash: b8e121f435defeb94d810eb6867d2d1c27973e4a3b4099f2716dbffafb274184

hash_type: sha256

14 of 36

Pluggable Configuration

Vendor properties

Provide sane defaults and allows fine-tuning
Allows vendor-specific configuration

Example from Hive

http://bazaar.launchpad.net/~bigdata-dev/charms/trusty/apache-hive/trunk/view/head:/dist.yaml

vendor: 'apache'

hadoop_version: '2.4.1'

packages:

- 'libmysql-java'

- 'mysql-client'

groups:

- 'hadoop'

users:

hive:

groups: ['hadoop']

dirs:

hive:

path: '/usr/lib/hive'

owner: 'hive'

group: 'hadoop'

ports:

hive:

port: 10000

15 of 36

Plugin Charm

Single, simplified connection point to Hadoop
Relating to plugin installs and manages:
Java Runtime
Access to interact with the data set

Hadoop API and CLI
Hadoop config (/etc/hadoop/conf, /etc/environment)

Allows charm reusability across Hadoop versions and distributions

includes: [‘interface:hadoop-plugin’]

@when(‘hadoop.yarn.ready’, ‘hadoop.hdfs.ready’)

def setup_pig(hadoop, *args):

pig.install()

pig.configure()

16 of 36

Hadoop Core

17 of 36

Big Data Core

Apache Hadoop and Spark Data Processing solutions provide the following fully configured services:

HDFS

Primary distributed storage used by Hadoop applications.

YARN

Architectural center of Hadoop that allows multiple data processing engines such as Mapreduce, Spark, real-time Spark Streaming, and other batch processing tools to process data stored in HDFS.

Spark

Fast and general engine for large-scale data processing.

18 of 36

Apache Hadoop Core

Hadoop Core Batch Processing

juju quickstart apache-core-batch-processing

Provides a distributed Hadoop Cluster ready for batch processing with Plugin capabilities to add in additional functionality

19 of 36

Data Ingest

20 of 36

Add Apache Flume

Data Ingest with Apache Flume

juju quickstart u/bigdata-dev/apache-ingestion-flume

Provides:

Hadoop Cluster
Flume for data ingest into HDFS

21 of 36

Add Apache Kafka

Data Ingest with Apache Kafka

juju quickstart u/bigdata-dev/apache-flume-ingestion-kafka

Provides:

Hadoop Cluster
Kafka for data ingest

22 of 36

Data Analysis

23 of 36

Add Apache Pig

Data Analysis with Apache Pig language

juju quickstart apache-analytics-pig

Provides:

Hadoop Cluster
Pig for Data Analysis

24 of 36

Add Apache Hive

Data Analytics with MySQL

juju quickstart apache-analytics-sql

Provides:

Hadoop Cluster
SQL Like Queries with Hive
MySQL DataStore

25 of 36

Data Visualization

26 of 36

Add iPy Notebook

Hadoop Core + Spark with Notebook Viz

juju quickstart apache-hadoop-spark-notebook

Provides:

Hadoop Cluster
Spark for real-time streaming
iPython Notebook for visualization

27 of 36

Add Apache Zeppelin

Hadoop Core + Spark with Notebook Viz

juju quickstart apache-hadoop-spark-zeppelin

Provides:

Hadoop Cluster
Spark for real-time streaming
iPython Notebook for visualization

28 of 36

+ Spark

29 of 36

Spark + Hadoop

Spark uses HDFS - It can use any Hadoop data source
Spark runs on YARN - It can run on the same cluster with mapreduce jobs, Hive, Pig, etc.

30 of 36

Spark Service

Our Apache Hadoop pluggable model provides four ways to interact with the Spark service

pyspark

Deployed and configured as part of apache-spark charm
Preconfigured to connect to HDFS and YARN services

spark-shell

Deployed and configured as part of apache-spark charm
Preconfigured to connect to HDFS and YARN services

spark-submit

Deployed and configured as part of apache-spark charm
Preconfigured to connect to HDFS and YARN services

Spark API (i.e. SparkStreaming, SparkSQL, SparkHive, SparkR, MLib, ..)

Deploy Spark-related application charms as subordinates to the Spark charm

Apache Zeppelin, IPython Notebook for Spark, etc

Spark charm details: https://jujucharms.com/apache-spark/

31 of 36

Spark with Layers

https://github.com/johnsca/layer-apache-spark

Refactoring Spark charm using layers, to make it easier to extend

@when('bootstrapped')

@when_not('spark.installed')

def install_spark():

spark = Spark()

if spark.verify_resources():

hookenv.status_set('maintenance', 'Installing Apache Spark')

spark.install()

set_state('spark.installed')

@when('spark.installed', 'hadoop.yarn.ready', 'hadoop.hdfs.ready')

def start_spark(*args):

hookenv.status_set('maintenance', 'Setting up Apache Spark')

spark = Spark()

spark.configure()

spark.start()

spark.open_ports()

set_state('spark.started')

hookenv.status_set('active', 'Ready')

32 of 36

Build Your Solution

33 of 36

Build and Share Your Solution

Real-time Syslog Analytics

juju quickstart realtime-syslog-analytics

Provides:

Hadoop Cluster
Spark for real-time streaming
Zepplin for visualization
Flume for Syslog Data ingest

34 of 36

Ecosystem Solutions

35 of 36

References and Contact Info

Core bundle technical documentation
Mailing Lists

Juju Big Data: https://lists.ubuntu.com/mailman/listinfo/bigdata
Juju: https://lists.ubuntu.com/mailman/listinfo/juju

IRC (Freenode)

#juju

Web

Github

https://github.com/juju-solutions/bigdata-community

36 of 36

Thanks!