Big Data in the Cloud?

Yes, you can do it in OpenStack

Hello!

I am Obed N Muñoz

I am here because I love to give presentations.

-vvvv

Who am I?

Software Engineer

Musician

Fast driver

Agenda

  • Introduction: Cloud and OpenStack
  • Data-processing
  • Sahara Project

Introduction

Cloud Computing and OpenStack

1

Cloud and XaaS Era

Everything as a Service

Cloud computing term is used for a variety of services and applications emerging for users to access on demand over the Internet as opposed to being utilized via on-premises means.

OpenStack

OpenStack is a cloud operating system that controls large pools of compute storage and networking resources throughout a datacenter, all managed through a dashboard, CLI, RestFUL API ...

Architecture

Data-Processing

Data-Processing in the Cloud

2

What’s around Data-Processing?

  • Big Data
  • Data Science
  • Cloud
  • Machine Learning
  • Patterns Recognition
  • Neural Networks
  • Etc ...

Data-Processing Technologies

Sahara Project

Data-Processing in OpenStack

3

OpenStack Sahara

The Sahara project provides a simple means to provision data-intensive application cluster (Spark or Hadoop) on top of OpenStack.

https://wiki.openstack.org/wiki/Sahara

Architecture

Getting Started

  • Clusters
  • Templates
  • Provisioning Plugins
  • Image Registry
  • Data Processing Frameworks
  • Elastic Data Processing (EDP)

http://docs.openstack.org/developer/sahara/userdoc/edp.html

More Features ...

  • OpenStack Block Storage support
  • Cluster Scaling
  • Data locality
  • Distributed Mode
  • Hadoop HDFS High Availability
  • Orchestration support

Clusters (Hadoop)

http://docs.openstack.org/developer/sahara/userdoc/edp.html

Data-Processing Frameworks

  • Hadoop
  • Spark
  • Storm

http://docs.openstack.org/developer/sahara/userdoc/edp.html

Provisioning Plugins

  • Vanilla - Vanilla Apache Hadoop
  • Ambari - Hortonworks Data Platform
  • Spark - Apache Spark with Cloudera HDFS
  • MapR Distribution - MapR plugin with MapR File System
  • Cloudera - Cloudera Hadoop

http://docs.openstack.org/developer/sahara/userdoc/edp.html

Elastic Data Processing (EDP)

Allows the execution of jobs on cluster created from Sahara. It supports:

  • Hive, Pig, MapReduce.Streaming, Java, Shell job types on Hadoop clusters
  • Spark jobs
  • Shared File system service (manila), or Sahara own database
  • Access to input and output data sources in:
    • HDFS
    • Swift
    • Manila

http://docs.openstack.org/developer/sahara/userdoc/edp.html

Resources

http://docs.openstack.org/developer/sahara/userdoc/edp.html

http://hackathon.openstackgdl.org/

Q & A

CONCLUSION

Thanks!

Any questions?

You can find me at:

  • @obedmr
  • obed.n.munoz@gmail.com
Big Data in OpenStack - Google Slides