1 of 101

Relational Databases to the Modern Data Stack

Saurav Chhatrapati

4/25/22

CS 186

2 of 101

Agenda

3 of 101

Agenda

Goals

  • Discuss the larger data story that RDBMS fit into
  • Go into some technical details about current cloud data warehouses
  • Draw connections between 186 topics and industry

4 of 101

Agenda

Goals

  • Discuss the larger data story that RDBMS fit into
  • Go into some technical details about current cloud data warehouses
  • Draw connections between 186 topics and industry

Non-Goals

  • Discuss tradeoff between relational vs. NoSQL
  • Explain technical innovation behind scaling relational databases (CockroachDB, PlanetScale)

5 of 101

Let’s start at the beginning…

6 of 101

Phase 1*: Relational Databases

7 of 101

What were the initial use cases for relational databases?

8 of 101

What were the initial use cases for relational databases?

  • Banking
  • Ticket Reservations (Flights)
  • Retail Inventory

9 of 101

1970s: The Relational Story Starts

  • 1960s: Data stored in mainframe computers became expensive to manage
  • 1970: Codd introduces idea of relational data model
    • IBM was not convinced and continued to promote IMS (Information Management System), a hierarchical database
    • Originally developed in 1966 to track millions of parts and materials used by NASA’s Apollo Program
    • Kept System R from being a product until 1980
  • 1973: Michael Stonebraker and Eugene Wong began Ingres at Berkeley
    • Later became Postgres
  • 1976: Honeywell shipped Multics Relational Data Store, the first commercial relational database
  • 1979: Oracle releases their RDBMS product

10 of 101

Relational Database Genealogy

11 of 101

The Data Story - Foundations

Data Storage

Data Retrieval

Relational Model & RDBMS

12 of 101

What do the database access patterns look like for these applications?

Applications

  • Banking
  • Ticket Reservations (Flights)
  • Retail Inventory

13 of 101

What do the database access patterns look like for these applications?

Applications

  • Banking
  • Ticket Reservations (Flights)
  • Retail Inventory

Workload

  • Frequent writes, updates
  • Reads consisting of simple predicates
    • Frequent index lookups
  • Short transactions
  • Lack of expensive aggregates
  • Performance is critical, as these operations are on the critical path

14 of 101

What do the database access patterns look like for these applications?

Applications

  • Banking
  • Ticket Reservations (Flights)
  • Retail Inventory

Workload

  • Frequent writes, updates
  • Reads consisting of simple predicates
    • Frequent index lookups
  • Short transactions
  • Lack of expensive aggregates
  • Performance is critical, as these operations are on the critical path

Online Transaction Processing (OLTP)

15 of 101

Decision Support Systems (DSS)

  • Area of research starting in the 1970s around how “computer-based” systems can aid organizational decisions
  • Following trends in the 1980s
    • Adoption of RDBMS spreads across industries
    • Increase in data managed by enterprises from MB-GB to TB-PB
    • Ability to analyze and synthesize information becomes a priority
    • Number of individuals (i.e. business analysts) who have a need to perform analysis grows

16 of 101

Decision Support Systems (DSS)

  • Area of research starting in the 1970s around how “computer-based” systems can aid organizational decisions
  • Following trends in the 1980s
    • Adoption of RDBMS spreads across industries
    • Increase in data managed by enterprises from MB-GB to TB-PB
    • Ability to analyze and synthesize information becomes a priority
    • Number of individuals (i.e. business analysts) who have a need to perform analysis grows

Microsoft releases Excel in 1985

17 of 101

Online Analytical Processing (OLAP)

1990s: Codd and others define a new workload category: OLAP

18 of 101

Online Analytical Processing (OLAP)

  • Focus on reading and summarizing large volumes of data to understand trends and patterns
  • Read-only queries

19 of 101

Online Analytical Processing (OLAP)

  • Focus on reading and summarizing large volumes of data to understand trends and patterns
  • Read-only queries

20 of 101

Online Analytical Processing (OLAP)

  • Focus on reading and summarizing large volumes of data to understand trends and patterns
  • Read-only queries

AKA Data Warehouse

21 of 101

Phase 2: Data Warehouses

22 of 101

Data Warehouse

  • Architecture developed in the 1980s to serve business intelligence
  • Data from different sources brought into a central location where it can enable analysis
  • Stores historical data
  • Optimized for bulk loads

23 of 101

Data Warehouse

  • Architecture developed in the 1980s to serve business intelligence
  • Data from different sources brought into a central location where it can enable analysis
  • Stores historical data
  • Optimized for bulk loads
  • Challenge: Systems designed for OLTP now needed to support OLAP.

24 of 101

Massively Parallel Processing (MPP)

Teradata releases DBC/1012 in 1984

25 of 101

Massively Parallel Processing (MPP)

Teradata releases DBC/1012 in 1984

Gamma Database project starts in early 1980s at University of Wisconsin

26 of 101

Teradata - DBC/1012

  • Founded in 1979 out of research at Caltech
  • Believed there was a market for 1 TB large databases
  • Complemented by innovation in microprocessors at Intel and Motorola → Compute is becoming cheaper
  • Key idea was to use a parallel microprocessor architecture

27 of 101

Teradata - DBC/1012

  • Founded in 1979 out of research at Caltech
  • Believed there was a market for 1 TB large databases
  • Complemented by innovation in microprocessors at Intel and Motorola → Compute is becoming cheaper
  • Key idea was to use a parallel microprocessor architecture

28 of 101

Teradata - DBC/1012

Shared Nothing

29 of 101

Teradata - DBC/1012

30 of 101

Teradata - DBC/1012

31 of 101

Teradata - DBC/1012

32 of 101

Teradata - DBC/1012

60 in. x 27 in.

Weighed >600 lbs

33 of 101

Teradata - DBC/1012

60 in. x 27 in.

Weighed >600 lbs

How much did it cost?

  1. $100k
  2. $500k
  3. $1M
  4. $10M
  5. $20M

34 of 101

Teradata - DBC/1012

60 in. x 27 in.

Weighed >600 lbs

How much did it cost?

  • $100k
  • $500k
  • $1M
  • $10M
  • $20M

35 of 101

The Data Story - Analytics

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

36 of 101

Phase 3: Big Data

37 of 101

Why Big Data?

  • Early 2000s brings unprecedented amount of data
  • Web companies need to store lots of information and run compute over it
    • Ex: Google needs to maintain their web crawl inverted index
  • Why not use RDBMS?
    • The Business Reasons
      • Too expensive
      • No open-source parallel DBMS
    • The Technical Reasons
      • Scale: 100 PB index, 3.5B searches/day, 2.5M servers (2016/17)
      • Lots of non-relational data: html, images, video, etc.
      • Wanted to try a different programming model than SQL

38 of 101

Google File System (GFS)

  • Proprietary implementation of a distributed file system (GFS)
  • Large, append-only files with sequential scan access
  • Focus on delivering:
    • Scalability
    • Reliability
    • Availability

Node 5

Node 4

Node 3

Node 2

Node 1

Block

1

Block

3

Block

2

Block

1

Block

3

Block

2

Block

3

Block

2

Block

1

39 of 101

Google File System (GFS)

  • Proprietary implementation of a distributed file system (GFS)
  • Large, append-only files with sequential scan access
  • Focus on delivering:
    • Scalability
    • Reliability
    • Availability

Node 5

Node 4

Node 3

Node 2

Node 1

Block

1

Block

3

Block

2

Block

1

Block

3

Block

2

Block

3

Block

2

Block

1

“Component failures are the norm rather than the exception.”

40 of 101

MapReduce

Google publishes a paper in 2004

  1. Read TB to PBs of data
  2. Map: Extract information from each record
  3. Shuffle and Sort
  4. Reduce: Aggregate, summarize, filter, transform
  5. Write results back to file system

41 of 101

MapReduce

the quick

brown fox

the fox ate the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1

brown, 1

fox, 1

quick, 1

the, 1

fox, 1

the, 1

how, 1

now, 1

brown, 1

ate, 1

mouse, 1

cow, 1

Local disks

Global File System

Global File System

42 of 101

Hadoop

  • Open source implementations of DFS and MapReduce
    • Hadoop Distributed File System (HDFS)
  • Scaled at Yahoo, released in 2006

43 of 101

Hadoop

  • Open source implementations of DFS and MapReduce
    • Hadoop Distributed File System (HDFS)
  • Scaled at Yahoo, released in 2006

44 of 101

Challenges to MapReduce and Hadoop

  • Difficult to write complex queries; everything must be expressed as map-reduce
  • Hadoop writes intermediate results to disk making jobs extremely slow

45 of 101

Spark

  • Started at Berkeley AMP Lab (RISE Lab)
  • Replaces Hadoop MapReduce, but still runs over HDFS
  • Key innovation
    • API is closer to relational algebra, and eventually released Spark SQL allowing more more expressive queries
    • Caches in-memory when possible
  • Databricks founded to commercialize Spark
    • An important player we will come back to

46 of 101

Pushback to Big Data

47 of 101

Pushback to Big Data

Lacks

  • Bulk loader, Indexing, Updates, Transactions

Incompatible with

  • Report writers, BI tools

48 of 101

tl;dr Big Data Did Not Make RDBMS Irrelevant

db-engines.com/en/ranking

49 of 101

The Data Story - Scalable Unstructured Storage

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

Scalable Unstructured Data Storage

DFS, MapReduce

50 of 101

Phase 4: Cloud Data Warehouses

51 of 101

What’s up with the Cloud?

  • 2002: Amazon introduces online retail and realizes they can use compute and storage resources a lot more efficiently
  • 2006: Amazon launches Amazon Web Services (AWS)
    • Elastic Compute Cloud (EC2) - Compute VMs
    • Simple Storage Service (S3) - Scalable file store

52 of 101

What’s up with the Cloud?

  • 2002: Amazon introduces online retail and realizes they can use compute and storage resources a lot more efficiently
  • 2006: Amazon launches Amazon Web Services (AWS)
    • Elastic Compute Cloud (EC2) - Compute VMs
    • Simple Storage Service (S3) - Scalable file store
  • 2008: Google launches Google Cloud Platform (GCP)
  • 2010: Microsoft launches Azure
  • 2010: NASA and Rackspace Hosting launch open source OpenStack

53 of 101

Data Lake

  • Not all data can be easily structured
    • E.g. events, logs, IOT signals
  • Data Lake refers to central location of data in its native, raw format
  • Typically built over an object store, such as AWS S3

54 of 101

AWS Changes the Game - Redshift

  • Launched in Feb. 2013 as a cloud data warehouse
  • Initially priced starting at $1000/TB/year
  • Immediately became fastest growing service AWS offered at the time
  • “Amazon Redshift was designed to bring data warehousing to a mass market by making it easy to buy, easy to tune and easy to manage while also being fast and cost-effective.”

55 of 101

AWS Redshift Internals

  • Data Plane: MPP database engine based off of ParAccel
    • Storage and compute distributed across several nodes
    • Designed to scale form 100s GB to PBs
  • Data blocks replicated within the database instance and inside S3
  • Query plan is generated at the leader node and sent to all compute nodes participating in the query
  • Control Plane: Essentially manages the cluster
    • Responsible for adding or removing nodes, logging, collecting metrics

56 of 101

Snowflake

  • Founded in July 2012
    • 2 co-founders were data architects at Oracle
  • Focused on separating storage from the compute
    • Computation could be performed in a different node (or even cloud) than where the data was stored
  • Originally ran on AWS, but has expanded to Azure and GCP
  • Are they
    • Customers?
    • Competitors?
    • Partners?

57 of 101

Snowflake

  • Founded in July 2012
    • 2 co-founders were data architects at Oracle
  • Focused on separating storage from the compute
    • Computation could be performed in a different node (or even cloud) than where the data was stored
  • Originally ran on AWS, but has expanded to Azure and GCP
  • Are they
    • Customers?
    • Competitors?
    • Partners?

Largest software IPO at $70B

58 of 101

Snowflake Internals

  • Supports semi-structured data formats like JSON and Avro, automatic schema discovery
  • Storage and compute resources can scale independently
    • Example storage layer: AWS S3
  • Invested in local caching and skew resilience rather than developing their own HDFS-like service

59 of 101

Snowflake Internals

  • Virtual Warehouses
    • Cluster of EC2 instances
  • VMs launched on demand for expensive operations, like bulk loading
  • Worker nodes maintain cache of table data on local disk
  • Execution engine optimizations
    • Columnar storage over row-wise storage. Not a novel idea.
    • Vectorized. Avoids materialization of intermediate results.
  • Concurrency control managed by storage layer
    • Provides Snapshot Isolation (SI)
  • No indices; instead, Snowflake uses min-max based pruning (small materialized aggregates)
    • System maintains data distribution information about each chunk of data
  • “We discovered a very popular model, where organizations would use Hadoop for two things: for storing JSON, and for converting it to a format that can be loaded into an RDBMS.”

60 of 101

Databricks

  • Founded in 2013 by researchers at Berkeley, the creators of Apache Spark
  • Financials:
    • Raised $1.6B at $38B valuation in August 2021 Series H
    • Announced $800M ARR at end of 2021
  • First started as a way to manage Spark jobs
  • Core value initially came from collaborative notebooks
    • Let data scientists focus on the data science, and not on the infrastructure
  • Added integrations with existing data lakes
  • Developed a Lakehouse to store structured and unstructured data
    • Competing directly with Snowflake

61 of 101

The Data Story - Scalable Cheap Structured Storage

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

Scalable Unstructured Data Storage

DFS, MapReduce

Scalable Structured Data Storage

Cloud data warehouses

62 of 101

Phase 5: The Modern Data Stack

63 of 101

Goals of the Modern Data Stack

  • Deal with diverse variety of data coming from 10s to 100s of different sources
  • Extract relevant insights from operational data
  • Share insights in an effective manner that turns them into actions
  • Context should be kept within the organization
    • No dangling analysis done in a spreadsheet
  • Make the process repeatable and resilient

64 of 101

Building the MDS: Sources

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

65 of 101

Building the MDS: Data Warehouse

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

66 of 101

Building the MDS: Data Warehouse

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

How does the data actually get from the sources to storage?

67 of 101

Building the MDS: Data Warehouse

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

How does the data actually get from the sources to storage?

  • Process called Extract Transform Load (ETL) or Extract Load Transform (ELT)
    • Big debate between ETL vs. ELT

68 of 101

Building the MDS: Data Warehouse

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

How does the data actually get from the sources to storage?

  • Process called Extract Transform Load (ETL) or Extract Load Transform (ELT)
    • Big debate between ETL vs. ELT

69 of 101

Building the MDS: Ingestion and Transport

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

Data Replication

(Fivetran, Stitch)

Ingestion and Transport

70 of 101

Building the MDS: Transformation

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

Data Replication

(Fivetran, Stitch)

Ingestion and Transport

Transformation

(dbt)

71 of 101

Building the MDS: Analysis

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

Data Replication

(Fivetran, Stitch)

Ingestion and Transport

Transformation

(dbt)

DS/ML Tools�(Databricks, Sagemaker)

Analysis

72 of 101

Building the MDS: Business Intelligence

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

Data Replication

(Fivetran, Stitch)

Ingestion and Transport

Transformation

(dbt)

DS/ML Tools�(Databricks, Sagemaker)

Analysis

Dashboards

(Looker, Tableau)

BI

73 of 101

What’s new?

74 of 101

How does the MDS empower data scientists?

75 of 101

How does the MDS empower data scientists?

It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.

– DJ Patil, Data Jujitsu, O’Reilly Press 2012

76 of 101

How does the MDS empower data scientists?

77 of 101

Innovation in Storage Has Enabled Innovation In Other Areas

https://blog.getdbt.com/future-of-the-modern-data-stack/

78 of 101

The Entire Stack

  • Workflow Managers become important to orchestrate analytics jobs
  • Reverse ETL is about putting data back into applications like Salesforce, Hubspot
  • Growing areas of governance, discovery, observability, security

79 of 101

The Data Industry: By the Numbers

80 of 101

The Data Industry

81 of 101

Architecture Shifts in Data

https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/

82 of 101

The Data Space

83 of 101

The Data Story - Actionable Insights

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

Scalable Unstructured Data Storage

DFS, MapReduce

Scalable Structured Data Storage

Cloud data warehouses

Actionable Insights

Modern Data Stack

84 of 101

The Data Story Continues?

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

Scalable Unstructured Data Storage

DFS, MapReduce

Scalable Structured Data Storage

Cloud data warehouses

Actionable Insights

Modern Data Stack

85 of 101

What’s Next?

  • Continued innovation in the data warehouse

86 of 101

What’s Next?

  • Continued innovation in the data warehouse
  • Some companies choose to adopt a data lakehouse architecture

87 of 101

What’s Next?

  • Continued innovation in the data warehouse
  • Some companies choose to adopt a data lakehouse architecture
  • Real-time use cases become a larger part of the MDS

88 of 101

What’s Next?

  • Continued innovation in the data warehouse
  • Some companies choose to adopt a data lakehouse architecture
  • Real-time use cases become a larger part of the MDS
  • SQL remains the core part of the MDS, and becomes more closely coupled with ML

89 of 101

Where do I work?

90 of 101

Where do I work?

Managed ML

  • Let data scientists do data science and deliver value for the organization
  • Abstract away the data engineering and ops involved with putting models into production

91 of 101

Where do I work?

Vikram Sreekanti

CEO

Chenggang Wu

CTO

Joe Hellerstein

Chief Scientist

Joey Gonzalez

VP Product

92 of 101

Where do I work?

Vikram Sreekanti

CEO

Chenggang Wu

CTO

Joe Hellerstein

Chief Scientist

Joey Gonzalez

VP Product

We work on:

  • systems infrastructure design & development
  • operating production services
  • developer tools & SDKs
  • UI design & frontend engineering
  • machine learning frameworks and infrastructure

93 of 101

We’re Hiring!

94 of 101

Conclusion

  • Went over motivation behind RDBMS, data warehouses & data lakes on-premise and in the cloud, big data, modern data stack
  • Touched on cloud data warehouse internals
  • Connections to 186
    • Motivation for partitioning, parallel join algorithms, bulk loading
    • Column stores changing assumptions made about record format
    • Understanding applications that databases serve today
  • Trends observed
    • Compute is really really cheap today
    • Easier to pay someone for a service, rather than managing your own infrastructure

95 of 101

Conclusion

  • Went over motivation behind RDBMS, data warehouses & data lakes on-premise and in the cloud, big data, modern data stack
  • Touched on cloud data warehouse internals
  • Connections to 186
    • Motivation for partitioning, parallel join algorithms, bulk loading
    • Column stores changing assumptions made about record format
    • Understanding applications that databases serve today
  • Trends observed
    • Compute is really really cheap today
    • Easier to pay someone for a service, rather than managing your own infrastructure

96 of 101

Conclusion

Concurrency Control

Recovery

Database Management

System

Database

Query Parsing & Optimization

Relational Operators

Files and Index Management

Buffer Management

Disk Space Management

SQL Client

97 of 101

Conclusion

Concurrency Control

Recovery

Database Management

System

Database

Query Parsing & Optimization

Relational Operators

Files and Index Management

Buffer Management

Disk Space Management

SQL Client

98 of 101

Questions?

99 of 101

Sources

  • Joe Hellerstein’s 186 Slides
  • Aditya Parameswaran and Alvin Cheung’s 186 Slides

100 of 101

Additional Reading

101 of 101

Thanks!