1 of 26

Overview of NIST Big Data Public Working Group (NBD-PWG) �The 10 Use Case Patterns

  • Geoffrey Fox, Indiana University

  • Based on work of NIST Big Data Public Working Group (NBD-PWG) June-September 2013 http://bigdatawg.nist.gov/

  • Leaders of activity
  • Wo Chang, NIST
  • Robert Marcus, ET-Strategies
  • Chaitanya Baru, UC San Diego

1

E434/534 Big Data Use Cases from NIST Analysis

2 of 26

TYPICAL DATA INTERACTION SCENARIOS

These consist of multiple data systems including classic DB, streaming, archives, Hive, analytics, workflow and different user interfaces (events to visualization)

�From Bob Marcus (ET Strategies) http://bigdatawg.nist.gov/_uploadfiles/M0311_v2_2965963213.pdf

We list 10 and then go through each (of 10) in more detail. These slides are based on those produced by Bob Marcus at link above

E434/534 Big Data Use Cases from NIST Analysis

3 of 26

10 Generic Data Processing �Use Cases

  1. Multiple users performing interactive queries and updates on a database with basic availability and eventual consistency (BASE = (Basically Available, Soft state, Eventual consistency) as opposed to ACID = (Atomicity, Consistency, Isolation, Durability) )
  2. Perform real time analytics on data source streams and notify users when specified events occur
  3. Move data from external data sources into a highly horizontally scalable data store, transform it using highly horizontally scalable processing (e.g. Map-Reduce), and return it to the horizontally scalable data store (ELT Extract Load Transform)
  4. Perform batch analytics on the data in a highly horizontally scalable data store using highly horizontally scalable processing (e.g MapReduce) with a user-friendly interface (e.g. SQL like)
  5. Perform interactive analytics on data in analytics-optimized database
  6. Visualize data extracted from horizontally scalable Big Data store
  7. Move data from a highly horizontally scalable data store into a traditional Enterprise Data Warehouse (EDW)
  8. Extract, process, and move data from data stores to archives
  9. Combine data from Cloud databases and on premise data stores for analytics, data mining, and/or machine learning
  10. Orchestrate multiple sequential and parallel data transformations and/or analytic processing using a workflow manager�

E434/534 Big Data Use Cases from NIST Analysis

4 of 26

1. Multiple users performing interactive queries and updates on a database with basic availability and eventual consistency

Generate a SQL Query

Process SQL Query (RDBMS Engine, Hive, Hadoop, Drill)

Data Storage: RDBMS, HDFS, Hbase

Data, Streaming, Batch …..

Includes access to traditional ACID database

E434/534 Big Data Use Cases from NIST Analysis

5 of 26

2. Perform real time analytics on data source streams and notify users when specified events occur

Storm, Kafka, Hbase, Zookeeper

Streaming Data

Streaming Data

Streaming Data

Posted Data

Identified Events

Filter Identifying Events

Repository

Specify filter

Archive

Post Selected Events

Fetch streamed Data

E434/534 Big Data Use Cases from NIST Analysis

6 of 26

3. Move data from external data sources into a highly horizontally scalable data store, transform it using highly horizontally scalable processing (e.g. Map-Reduce), and return it to the horizontally scalable data store (ELT)�

http://www.dzone.com/articles/hadoop-t-etl

ETL is Extract Load Transform

Streaming Data

OLTP Database

Web Services

Transform with Hadoop, Spark, Giraph …

Data Storage: HDFS, Hbase

Enterprise �Data �Warehouse

E434/534 Big Data Use Cases from NIST Analysis

7 of 26

4. Perform batch analytics on the data in a highly horizontally scalable data store using highly horizontally scalable processing (e.g MapReduce) with a user-friendly interface (e.g. SQL like)

Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase

Data, Streaming, Batch …..

Hive

Mahout, R

SQL Query General Analytics

HCatalog

E434/534 Big Data Use Cases from NIST Analysis

8 of 26

Hive Example

E434/534 Big Data Use Cases from NIST Analysis

9 of 26

5. Perform interactive analytics on data in analytics-optimized database

Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase

Data, Streaming, Batch …..

Mahout, R

Similar to 4 which is batch

E434/534 Big Data Use Cases from NIST Analysis

10 of 26

Data ACCESS Patterns�Science EXAMPLES

E434/534 Big Data Use Cases from NIST Analysis

E434/534 Big Data Use Cases from NIST Analysis

11 of 26

5A. Perform interactive analytics on observational scientific data

Grid or Many Task Software, Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase, File Collection

Streaming Twitter data for Social Networking

Science Analysis Code, Mahout, R

Transport batch of data to primary analysis data system

Record Scientific Data in “field”

Local Accumulate and initial computing

Direct Transfer

Following examples are LHC, Remote Sensing, Astronomy and Bioinformatics

E434/534 Big Data Use Cases from NIST Analysis

12 of 26

Particle Physics (LHC)

LHC Data analyzes ~30 petabytes of data per year produced at CERN using ~300,000 cores around the world

Data reduced in size, replicated and looked at by physicists

E434/534 Big Data Use Cases from NIST Analysis

13 of 26

Astronomy – Dark Energy Survey I

 Victor M. Blanco Telescope Chile where new wide angle 520 mega pixel camera DECam installed

https://indico.cern.ch/event/214784/session/5/contribution/410

Ends up as part of International Virtual observatory (IVOA), which is a collection of interoperating data archives and software tools which utilize the internet to form a scientific research environment in which astronomical research programs can be conducted.

E434/534 Big Data Use Cases from NIST Analysis

14 of 26

Astronomy – Dark Energy Survey II

For DES (Dark Energy Survey) the data are sent from the mountaintop via a microwave link to La Serena, Chile. From there, an optical link forwards them to the NCSA (UIUC) as well as NERSC (LBNL) for storage and "reduction”. Here galaxies and stars in both the individual and stacked images are identified, catalogued, and finally their properties measured and stored in a database.

DES Machine room at NCSA

E434/534 Big Data Use Cases from NIST Analysis

15 of 26

Astronomy�Hubble �Space Telescope

http://asd.gsfc.nasa.gov/archive/hubble/a_pdf/news/facts/FS14.pdf

HST Processing in Baltimore Md

E434/534 Big Data Use Cases from NIST Analysis

16 of 26

CReSIS Remote Sensing: Radar Surveys

Expeditions last 1-2 months and gather up to 100 TB data. Most is saved on removable disks and flown back to continental US at end. A sample is analyzed in field to check instrument

E434/534 Big Data Use Cases from NIST Analysis

17 of 26

Gene Sequencing

Distributed (Illumina) devices distributed across world in many laboratories take data in form of “reads” that are aligned into a full sequence

This processing often local but data needs to be compared with world’s other gene so uploaded to central repository

Illumina HiSeq X 10 can sequence 18,000 genomes per year at $1000 each. Produces 0.6Terabases per day

E434/534 Big Data Use Cases from NIST Analysis

18 of 26

Remaining general access patterns

E434/534 Big Data Use Cases from NIST Analysis

E434/534 Big Data Use Cases from NIST Analysis

19 of 26

6. Visualize data extracted from horizontally scalable Big Data store

Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase

Mahout, R

Prepare Interactive Visualization

Orchestration Layer

Specify Analytics

Interactive Visualization

20 of 26

�7. Move data from a highly horizontally scalable data store into a traditional Enterprise Data Warehouse�

Streaming Data

OLTP Database

Web Services

Transform with Hadoop, Spark, Giraph …

Data Storage: HDFS, Hbase, (RDBMS)

Enterprise �Data �Warehouse

Data Warehouse Query

21 of 26

Moving to EDW Example from Teradata

Moving data from HDFS to Teradata Data Warehouse and Aster Discovery Platform

http://blogs.teradata.com/data-points/announcing-teradata-aster-big-analytics-appliance/

22 of 26

�8. Extract, process, and move data from data stores to archives�

http://www.dzone.com/articles/hadoop-t-etl

ETL is Extract Load Transform

Streaming Data

OLTP Database

Web Services

Transform with Hive, Drill, Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase, RDBMS

Archive

Transform as needed

23 of 26

�9. Combine data from Cloud databases and on premise data stores for analytics, data mining, and/or machine learning�

Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase

Mahout, R

Similar to 4 and 5

On premise Data

Streaming Data

24 of 26

Example: Integrate Cloud and local data

http://wikibon.org/w/images/2/20/Cloud-BigData.png

25 of 26

10. Orchestrate multiple sequential and parallel data transformations and/or analytic processing using a workflow manager

Hadoop, Spark, Giraph, Pig …

Data Storage: HDFS, Hbase

Analytic-1

Analytic-2

Orchestration Layer (Workflow)

Specify Analytics Pipeline

Analytic-3

(Visualize)

This can be used for science by adding data staging phases as in case 5A

26 of 26

Example from Hortonworks

http://hortonworks.com/hadoop/yarn/