1 de 19

Data Platform of the Future

November 12, 2021

Rupert Berk, UWash

Satya Kunta, NYU

Ken Taylor, UIUC

Ashish Pandit, UCSD

2 de 19

Agenda

  • Introductions
  • Shifting Technical Capabilities for Data Platforms
    • Towards a Unified Data Platform
  • Next Gen Data Analytics Platforms
    • UCSD
    • NYU
  • Data Governance (@ UIUC)
  • Discussion

3 de 19

Shifting Technical Capabilities for Data Platforms

Event streams support real and near real-time reporting and alerting, as well as incremental ETL scenarios (small batches).

HTTP APIs speed development, provide fast access, and enable efficient governance.

The rapid growth in AI/ML offerings by cloud service providers promise predictive and even prescriptive analytics, in addition to traditional descriptive or diagnostic analytics.

Data lakes promise faster experimentation and innovation by expanding analysis of raw data by analysts and data scientists.

Data Warehouse

RDBMS

Data Lake

Object Storage, Document, graph, in-memory

ETL

ELT

Batch

Events & Streams

Modeling

Persistence

Transformation

Increments

Traditional Programming

Machine Learning

Coding

Files & SQL

APIs

Interfaces

4 de 19

Towards a Unified Data Platform

Event Broker

Data Sources

Batch-Driven Apps

Event-Driven Apps

Data Persistence Service

Raw

Zone

Curated Zone

Usage Optimized Zone

Stream Processing

Batch Processing

ML

ML Inference

Query API

Stream Query API

User Interfaces

Data Science Workbench

Data Visualization

Dashboards

CDC

Database

File

Event

IoT

5 de 19

UCSD’s Data Analytics Platform

6 de 19

7 de 19

Hierarchy manager

<- Hierarchy slot ID + [attributes]

Hierarchy slot attributes ->

Curated views (CVs)

  • Built off of activity records only
  • No base tables
  • CVs are built on top of viewlets
  • CVs can also be built on top of other CVs
  • Viewlet reuse should be high
  • Reuse should be at the highest level
  • CVs eliminate the need for user to do joins
  • CVs are normally materialized
  • Viewlets can also be materialized
  • CVs handle duplicate activities (idempotency)

Machine learning platform (MLP)

Stream in ->

<- Message out

<- Model development ->

Source systems/devices

  • Emit from point of entry, full incremental

or

  • Simulate incremental from DB

Base Views

Intermediate Viewlets

Curated Views

Final Curated Views

Curated Views (CVs)

iPaaS

  • Simple, parallel streams
  • Minimal hops, steps, merging
  • Save transformation for CVs
  • Easily restartable
  • Save extra data in a bag

Activity table (pile file)

  • Records have different length
  • Record have different fields
  • Records are added in the order they arrive
  • Adds, updates, deletes are different records
  • Records are from idempotent stream and can have duplicates
  • Records have unique identifiers for resolving duplicates
  • An activity table is a replayable log

<- Message out

Stream in ->

Activity Hub architecture

8 de 19

9 de 19

NYU’s Data Analytics Platform

10 de 19

Industry Tipping points...

Infinite Compute & Storage

Machine Learning/AI

FaaS(Function as a Service)/Serverless

Real-Time/Streaming

11 de 19

Velocity

    • Faster Time to Market
    • Quick Insights
    • Faster ingestion (Change Data Capture)
    • Event driven (No Batch processing)

Volume

    • Faster Processing
    • Infinite Scalable (Compute & Storage separation)
    • No to minimal administration/maintenance
    • Automated Backups/recovery & DR

Variety

    • Prescriptive, predictive Analytics
    • Real-time, streaming
    • APIs, Data Science, ML & AI
    • Pay-per-use (subscription, pub-sub, monetization)

The 3 V’s of Future Data Platforms….

12 de 19

Conceptual Architecture

13 de 19

High Level Technical Architecture

14 de 19

Data Governance

15 de 19

Data Classification and Account Organizational Units

  • Data classifications
    • Public - Data that can be freely shared with the public
    • Internal - Data that could result in reputational damage for the university, its operations or assets; unpublished research data; intellectual property; data exempted from FOIA requests
    • Sensitive - Data that requires authorization to be accessed; FERPA student records; Employee information; information covered by NDAs; System and network configuration information
    • High risk - Health information HIPAA, credit cards, bank accounts, Social Security numbers, driver’s license numbers, genetic/biometric information, government classified information
  • Legal considerations
    • AWS Regions are located worldwide under different laws and regulations
  • AWS Organizational Units (OU)
    • Allows accounts to be centrally managed by applying policies uniformly to all accounts within an OU

16 de 19

Data Classification and Account Organizational Units

17 de 19

Data Classification and Account Organizational Units

18 de 19

Data Classification and Account Organizational Units

19 de 19

Discussion