Overview of NIST Big Data Public Working Group (NBD-PWG) �The 10 Use Case Patterns
1
E434/534 Big Data Use Cases from NIST Analysis
TYPICAL DATA INTERACTION SCENARIOS
These consist of multiple data systems including classic DB, streaming, archives, Hive, analytics, workflow and different user interfaces (events to visualization)
�From Bob Marcus (ET Strategies) http://bigdatawg.nist.gov/_uploadfiles/M0311_v2_2965963213.pdf
We list 10 and then go through each (of 10) in more detail. These slides are based on those produced by Bob Marcus at link above
E434/534 Big Data Use Cases from NIST Analysis
10 Generic Data Processing �Use Cases
E434/534 Big Data Use Cases from NIST Analysis
1. Multiple users performing interactive queries and updates on a database with basic availability and eventual consistency
Generate a SQL Query
Process SQL Query (RDBMS Engine, Hive, Hadoop, Drill)
Data Storage: RDBMS, HDFS, Hbase
Data, Streaming, Batch …..
Includes access to traditional ACID database
E434/534 Big Data Use Cases from NIST Analysis
2. Perform real time analytics on data source streams and notify users when specified events occur
Storm, Kafka, Hbase, Zookeeper
Streaming Data
Streaming Data
Streaming Data
Posted Data
Identified Events
Filter Identifying Events
Repository
Specify filter
Archive
Post Selected Events
Fetch streamed Data
E434/534 Big Data Use Cases from NIST Analysis
3. Move data from external data sources into a highly horizontally scalable data store, transform it using highly horizontally scalable processing (e.g. Map-Reduce), and return it to the horizontally scalable data store (ELT)�
http://www.dzone.com/articles/hadoop-t-etl
ETL is Extract Load Transform
Streaming Data
OLTP Database
Web Services
Transform with Hadoop, Spark, Giraph …
Data Storage: HDFS, Hbase
Enterprise �Data �Warehouse
E434/534 Big Data Use Cases from NIST Analysis
4. Perform batch analytics on the data in a highly horizontally scalable data store using highly horizontally scalable processing (e.g MapReduce) with a user-friendly interface (e.g. SQL like)
Hadoop, Spark, Giraph, Pig …
Data Storage: HDFS, Hbase
Data, Streaming, Batch …..
Hive
Mahout, R
SQL Query General Analytics
HCatalog
E434/534 Big Data Use Cases from NIST Analysis
Hive Example
E434/534 Big Data Use Cases from NIST Analysis
5. Perform interactive analytics on data in analytics-optimized database
Hadoop, Spark, Giraph, Pig …
Data Storage: HDFS, Hbase
Data, Streaming, Batch …..
Mahout, R
Similar to 4 which is batch
E434/534 Big Data Use Cases from NIST Analysis
Data ACCESS Patterns�Science EXAMPLES
E434/534 Big Data Use Cases from NIST Analysis
E434/534 Big Data Use Cases from NIST Analysis
5A. Perform interactive analytics on observational scientific data
Grid or Many Task Software, Hadoop, Spark, Giraph, Pig …
Data Storage: HDFS, Hbase, File Collection
Streaming Twitter data for Social Networking
Science Analysis Code, Mahout, R
Transport batch of data to primary analysis data system
Record Scientific Data in “field”
Local Accumulate and initial computing
Direct Transfer
Following examples are LHC, Remote Sensing, Astronomy and Bioinformatics
E434/534 Big Data Use Cases from NIST Analysis
Particle Physics (LHC)
LHC Data analyzes ~30 petabytes of data per year produced at CERN using ~300,000 cores around the world
Data reduced in size, replicated and looked at by physicists
E434/534 Big Data Use Cases from NIST Analysis
Astronomy – Dark Energy Survey I
Victor M. Blanco Telescope Chile where new wide angle 520 mega pixel camera DECam installed
https://indico.cern.ch/event/214784/session/5/contribution/410
Ends up as part of International Virtual observatory (IVOA), which is a collection of interoperating data archives and software tools which utilize the internet to form a scientific research environment in which astronomical research programs can be conducted.
E434/534 Big Data Use Cases from NIST Analysis
Astronomy – Dark Energy Survey II
For DES (Dark Energy Survey) the data are sent from the mountaintop via a microwave link to La Serena, Chile. From there, an optical link forwards them to the NCSA (UIUC) as well as NERSC (LBNL) for storage and "reduction”. Here galaxies and stars in both the individual and stacked images are identified, catalogued, and finally their properties measured and stored in a database.
DES Machine room at NCSA
E434/534 Big Data Use Cases from NIST Analysis
Astronomy�Hubble �Space Telescope
http://asd.gsfc.nasa.gov/archive/hubble/a_pdf/news/facts/FS14.pdf
HST Processing in Baltimore Md
E434/534 Big Data Use Cases from NIST Analysis
CReSIS Remote Sensing: Radar Surveys
Expeditions last 1-2 months and gather up to 100 TB data. Most is saved on removable disks and flown back to continental US at end. A sample is analyzed in field to check instrument
E434/534 Big Data Use Cases from NIST Analysis
Gene Sequencing
Distributed (Illumina) devices distributed across world in many laboratories take data in form of “reads” that are aligned into a full sequence
This processing often local but data needs to be compared with world’s other gene so uploaded to central repository
Illumina HiSeq X 10 can sequence 18,000 genomes per year at $1000 each. Produces 0.6Terabases per day
E434/534 Big Data Use Cases from NIST Analysis
Remaining general access patterns
E434/534 Big Data Use Cases from NIST Analysis
E434/534 Big Data Use Cases from NIST Analysis
6. Visualize data extracted from horizontally scalable Big Data store
Hadoop, Spark, Giraph, Pig …
Data Storage: HDFS, Hbase
Mahout, R
Prepare Interactive Visualization
Orchestration Layer
Specify Analytics
Interactive Visualization
�7. Move data from a highly horizontally scalable data store into a traditional Enterprise Data Warehouse�
Streaming Data
OLTP Database
Web Services
Transform with Hadoop, Spark, Giraph …
Data Storage: HDFS, Hbase, (RDBMS)
Enterprise �Data �Warehouse
Data Warehouse Query
Moving to EDW Example from Teradata
Moving data from HDFS to Teradata Data Warehouse and Aster Discovery Platform
http://blogs.teradata.com/data-points/announcing-teradata-aster-big-analytics-appliance/
�8. Extract, process, and move data from data stores to archives�
http://www.dzone.com/articles/hadoop-t-etl
ETL is Extract Load Transform
Streaming Data
OLTP Database
Web Services
Transform with Hive, Drill, Hadoop, Spark, Giraph, Pig …
Data Storage: HDFS, Hbase, RDBMS
Archive
Transform as needed
�9. Combine data from Cloud databases and on premise data stores for analytics, data mining, and/or machine learning�
Hadoop, Spark, Giraph, Pig …
Data Storage: HDFS, Hbase
Mahout, R
Similar to 4 and 5
On premise Data
Streaming Data
Example: Integrate Cloud and local data
http://wikibon.org/w/images/2/20/Cloud-BigData.png
10. Orchestrate multiple sequential and parallel data transformations and/or analytic processing using a workflow manager
Hadoop, Spark, Giraph, Pig …
Data Storage: HDFS, Hbase
Analytic-1
Analytic-2
Orchestration Layer (Workflow)
Specify Analytics Pipeline
Analytic-3
(Visualize)
This can be used for science by adding data staging phases as in case 5A
Example from Hortonworks
http://hortonworks.com/hadoop/yarn/