1 of 101

Relational Databases to the Modern Data Stack

Saurav Chhatrapati

4/25/22

CS 186

2 of 101

Agenda

3 of 101

Agenda

Goals

Discuss the larger data story that RDBMS fit into
Go into some technical details about current cloud data warehouses
Draw connections between 186 topics and industry

4 of 101

Agenda

Goals

Discuss the larger data story that RDBMS fit into
Go into some technical details about current cloud data warehouses
Draw connections between 186 topics and industry

Non-Goals

Discuss tradeoff between relational vs. NoSQL
Explain technical innovation behind scaling relational databases (CockroachDB, PlanetScale)

5 of 101

Let’s start at the beginning…

6 of 101

Phase 1*: Relational Databases

7 of 101

What were the initial use cases for relational databases?

8 of 101

What were the initial use cases for relational databases?

Banking
Ticket Reservations (Flights)
Retail Inventory

9 of 101

1970s: The Relational Story Starts

1960s: Data stored in mainframe computers became expensive to manage
1970: Codd introduces idea of relational data model

IBM was not convinced and continued to promote IMS (Information Management System), a hierarchical database
Originally developed in 1966 to track millions of parts and materials used by NASA’s Apollo Program
Kept System R from being a product until 1980

1973: Michael Stonebraker and Eugene Wong began Ingres at Berkeley

Later became Postgres

1976: Honeywell shipped Multics Relational Data Store, the first commercial relational database
1979: Oracle releases their RDBMS product

10 of 101

Relational Database Genealogy

11 of 101

The Data Story - Foundations

Data Storage

Data Retrieval

Relational Model & RDBMS

12 of 101

What do the database access patterns look like for these applications?

Applications

Banking
Ticket Reservations (Flights)
Retail Inventory

13 of 101

What do the database access patterns look like for these applications?

Applications

Banking
Ticket Reservations (Flights)
Retail Inventory

Workload

Frequent writes, updates
Reads consisting of simple predicates

Frequent index lookups

Short transactions
Lack of expensive aggregates
Performance is critical, as these operations are on the critical path

14 of 101

What do the database access patterns look like for these applications?

Applications

Banking
Ticket Reservations (Flights)
Retail Inventory

Workload

Frequent writes, updates
Reads consisting of simple predicates

Frequent index lookups

Short transactions
Lack of expensive aggregates
Performance is critical, as these operations are on the critical path

Online Transaction Processing (OLTP)

15 of 101

Decision Support Systems (DSS)

Area of research starting in the 1970s around how “computer-based” systems can aid organizational decisions
Following trends in the 1980s

Adoption of RDBMS spreads across industries
Increase in data managed by enterprises from MB-GB to TB-PB
Ability to analyze and synthesize information becomes a priority
Number of individuals (i.e. business analysts) who have a need to perform analysis grows

16 of 101

Decision Support Systems (DSS)

Area of research starting in the 1970s around how “computer-based” systems can aid organizational decisions
Following trends in the 1980s

Adoption of RDBMS spreads across industries
Increase in data managed by enterprises from MB-GB to TB-PB
Ability to analyze and synthesize information becomes a priority
Number of individuals (i.e. business analysts) who have a need to perform analysis grows

Microsoft releases Excel in 1985

17 of 101

Online Analytical Processing (OLAP)

1990s: Codd and others define a new workload category: OLAP

18 of 101

Online Analytical Processing (OLAP)

Focus on reading and summarizing large volumes of data to understand trends and patterns
Read-only queries

19 of 101

Online Analytical Processing (OLAP)

Focus on reading and summarizing large volumes of data to understand trends and patterns
Read-only queries

20 of 101

Online Analytical Processing (OLAP)

Focus on reading and summarizing large volumes of data to understand trends and patterns
Read-only queries

AKA Data Warehouse

21 of 101

Phase 2: Data Warehouses

22 of 101

Data Warehouse

Architecture developed in the 1980s to serve business intelligence
Data from different sources brought into a central location where it can enable analysis
Stores historical data
Optimized for bulk loads

23 of 101

Data Warehouse

Architecture developed in the 1980s to serve business intelligence
Data from different sources brought into a central location where it can enable analysis
Stores historical data
Optimized for bulk loads
Challenge: Systems designed for OLTP now needed to support OLAP.

24 of 101

Massively Parallel Processing (MPP)

Teradata releases DBC/1012 in 1984

25 of 101

Massively Parallel Processing (MPP)

Teradata releases DBC/1012 in 1984

Gamma Database project starts in early 1980s at University of Wisconsin

26 of 101

Teradata - DBC/1012

Founded in 1979 out of research at Caltech
Believed there was a market for 1 TB large databases
Complemented by innovation in microprocessors at Intel and Motorola → Compute is becoming cheaper
Key idea was to use a parallel microprocessor architecture

27 of 101

Teradata - DBC/1012

Founded in 1979 out of research at Caltech
Believed there was a market for 1 TB large databases
Complemented by innovation in microprocessors at Intel and Motorola → Compute is becoming cheaper
Key idea was to use a parallel microprocessor architecture

28 of 101

Teradata - DBC/1012

Shared Nothing

29 of 101

Teradata - DBC/1012

30 of 101

Teradata - DBC/1012

31 of 101

Teradata - DBC/1012

32 of 101

Teradata - DBC/1012

60 in. x 27 in.

Weighed >600 lbs

33 of 101

Teradata - DBC/1012

60 in. x 27 in.

Weighed >600 lbs

How much did it cost?

$100k
$500k
$1M
$10M
$20M

34 of 101

Teradata - DBC/1012

60 in. x 27 in.

Weighed >600 lbs

How much did it cost?

$100k
$500k
$1M
$10M
$20M

35 of 101

The Data Story - Analytics

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

36 of 101

Phase 3: Big Data

37 of 101

Why Big Data?

Early 2000s brings unprecedented amount of data
Web companies need to store lots of information and run compute over it

Ex: Google needs to maintain their web crawl inverted index

Why not use RDBMS?

The Business Reasons

Too expensive
No open-source parallel DBMS

The Technical Reasons

Scale: 100 PB index, 3.5B searches/day, 2.5M servers (2016/17)
Lots of non-relational data: html, images, video, etc.
Wanted to try a different programming model than SQL

38 of 101

Google File System (GFS)

Proprietary implementation of a distributed file system (GFS)
Large, append-only files with sequential scan access
Focus on delivering:

Scalability
Reliability
Availability

Node 5

Node 4

Node 3

Node 2

Node 1

Block

1

Block

3

Block

2

Block

1

Block

3

Block

2

Block

3

Block

2

Block

1

39 of 101

Google File System (GFS)

Proprietary implementation of a distributed file system (GFS)
Large, append-only files with sequential scan access
Focus on delivering:

Scalability
Reliability
Availability

Node 5

Node 4

Node 3

Node 2

Node 1

Block

1

Block

3

Block

2

Block

1

Block

3

Block

2

Block

3

Block

2

Block

1

“Component failures are the norm rather than the exception.”

40 of 101

MapReduce

Google publishes a paper in 2004

Read TB to PBs of data
Map: Extract information from each record
Shuffle and Sort
Reduce: Aggregate, summarize, filter, transform
Write results back to file system

41 of 101

MapReduce

the quick

brown fox

the fox ate the mouse

how now

brown cow

Map

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1

brown, 1

fox, 1

quick, 1

the, 1

fox, 1

the, 1

how, 1

now, 1

brown, 1

ate, 1

mouse, 1

cow, 1

Local disks

Global File System

42 of 101

Hadoop

Open source implementations of DFS and MapReduce

Hadoop Distributed File System (HDFS)

Scaled at Yahoo, released in 2006

43 of 101

Hadoop

Open source implementations of DFS and MapReduce

Hadoop Distributed File System (HDFS)

Scaled at Yahoo, released in 2006

44 of 101

Challenges to MapReduce and Hadoop

Difficult to write complex queries; everything must be expressed as map-reduce
Hadoop writes intermediate results to disk making jobs extremely slow

45 of 101

Spark

Started at Berkeley AMP Lab (RISE Lab)
Replaces Hadoop MapReduce, but still runs over HDFS
Key innovation

API is closer to relational algebra, and eventually released Spark SQL allowing more more expressive queries
Caches in-memory when possible

Databricks founded to commercialize Spark

An important player we will come back to

46 of 101

Pushback to Big Data

47 of 101

Pushback to Big Data

Lacks

Bulk loader, Indexing, Updates, Transactions

Incompatible with

Report writers, BI tools

48 of 101

tl;dr Big Data Did Not Make RDBMS Irrelevant

db-engines.com/en/ranking

49 of 101

The Data Story - Scalable Unstructured Storage

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

Scalable Unstructured Data Storage

DFS, MapReduce

50 of 101

Phase 4: Cloud Data Warehouses

51 of 101

What’s up with the Cloud?

2002: Amazon introduces online retail and realizes they can use compute and storage resources a lot more efficiently
2006: Amazon launches Amazon Web Services (AWS)

Elastic Compute Cloud (EC2) - Compute VMs
Simple Storage Service (S3) - Scalable file store

52 of 101

What’s up with the Cloud?

2002: Amazon introduces online retail and realizes they can use compute and storage resources a lot more efficiently
2006: Amazon launches Amazon Web Services (AWS)

Elastic Compute Cloud (EC2) - Compute VMs
Simple Storage Service (S3) - Scalable file store

2008: Google launches Google Cloud Platform (GCP)
2010: Microsoft launches Azure
2010: NASA and Rackspace Hosting launch open source OpenStack

53 of 101

Data Lake

Not all data can be easily structured

E.g. events, logs, IOT signals

Data Lake refers to central location of data in its native, raw format
Typically built over an object store, such as AWS S3

54 of 101

AWS Changes the Game - Redshift

Launched in Feb. 2013 as a cloud data warehouse
Initially priced starting at $1000/TB/year
Immediately became fastest growing service AWS offered at the time
“Amazon Redshift was designed to bring data warehousing to a mass market by making it easy to buy, easy to tune and easy to manage while also being fast and cost-effective.”

55 of 101

AWS Redshift Internals

Data Plane: MPP database engine based off of ParAccel

Storage and compute distributed across several nodes
Designed to scale form 100s GB to PBs

Data blocks replicated within the database instance and inside S3
Query plan is generated at the leader node and sent to all compute nodes participating in the query
Control Plane: Essentially manages the cluster

Responsible for adding or removing nodes, logging, collecting metrics

56 of 101

Snowflake

Founded in July 2012

2 co-founders were data architects at Oracle

Focused on separating storage from the compute

Computation could be performed in a different node (or even cloud) than where the data was stored

Originally ran on AWS, but has expanded to Azure and GCP
Are they

Customers?
Competitors?
Partners?

57 of 101

Snowflake

Founded in July 2012

2 co-founders were data architects at Oracle

Focused on separating storage from the compute

Computation could be performed in a different node (or even cloud) than where the data was stored

Originally ran on AWS, but has expanded to Azure and GCP
Are they

Customers?
Competitors?
Partners?

Largest software IPO at $70B

58 of 101

Snowflake Internals

Supports semi-structured data formats like JSON and Avro, automatic schema discovery
Storage and compute resources can scale independently

Example storage layer: AWS S3

Invested in local caching and skew resilience rather than developing their own HDFS-like service

59 of 101

Snowflake Internals

Virtual Warehouses

Cluster of EC2 instances

VMs launched on demand for expensive operations, like bulk loading
Worker nodes maintain cache of table data on local disk
Execution engine optimizations

Columnar storage over row-wise storage. Not a novel idea.
Vectorized. Avoids materialization of intermediate results.

Concurrency control managed by storage layer

Provides Snapshot Isolation (SI)

No indices; instead, Snowflake uses min-max based pruning (small materialized aggregates)

System maintains data distribution information about each chunk of data

“We discovered a very popular model, where organizations would use Hadoop for two things: for storing JSON, and for converting it to a format that can be loaded into an RDBMS.”

60 of 101

Databricks

Founded in 2013 by researchers at Berkeley, the creators of Apache Spark
Financials:

Raised $1.6B at $38B valuation in August 2021 Series H
Announced $800M ARR at end of 2021

First started as a way to manage Spark jobs
Core value initially came from collaborative notebooks

Let data scientists focus on the data science, and not on the infrastructure

Added integrations with existing data lakes
Developed a Lakehouse to store structured and unstructured data

Competing directly with Snowflake

61 of 101

The Data Story - Scalable Cheap Structured Storage

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

Scalable Unstructured Data Storage

DFS, MapReduce

Scalable Structured Data Storage

Cloud data warehouses

62 of 101

Phase 5: The Modern Data Stack

63 of 101

Goals of the Modern Data Stack

Deal with diverse variety of data coming from 10s to 100s of different sources
Extract relevant insights from operational data
Share insights in an effective manner that turns them into actions
Context should be kept within the organization

No dangling analysis done in a spreadsheet

Make the process repeatable and resilient

64 of 101

Building the MDS: Sources

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

65 of 101

Building the MDS: Data Warehouse

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

66 of 101

Building the MDS: Data Warehouse

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

How does the data actually get from the sources to storage?

67 of 101

Building the MDS: Data Warehouse

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

How does the data actually get from the sources to storage?

Process called Extract Transform Load (ETL) or Extract Load Transform (ELT)

Big debate between ETL vs. ELT

68 of 101

Building the MDS: Data Warehouse

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

How does the data actually get from the sources to storage?

Process called Extract Transform Load (ETL) or Extract Load Transform (ELT)

Big debate between ETL vs. ELT

69 of 101

Building the MDS: Ingestion and Transport

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

Data Replication

(Fivetran, Stitch)

Ingestion and Transport

70 of 101

Building the MDS: Transformation

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

Data Replication

(Fivetran, Stitch)

Ingestion and Transport

Transformation

(dbt)

71 of 101

Building the MDS: Analysis

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

Data Replication

(Fivetran, Stitch)

Ingestion and Transport

Transformation

(dbt)

DS/ML Tools�(Databricks, Sagemaker)

Analysis

72 of 101

Building the MDS: Business Intelligence

OLTP

ERP

(Salesforce, Netsuite)

Operational Apps

(Salesforce, Hubspot)

Event Collectors

(Segment)

Logs

3rd Party APIs

(Stripe)

File/Object Storage

Sources

Data Warehouse

(Snowflake, Redshift)

Storage

Data Replication

(Fivetran, Stitch)

Ingestion and Transport

Transformation

(dbt)

DS/ML Tools�(Databricks, Sagemaker)

Analysis

Dashboards

(Looker, Tableau)

BI

73 of 101

What’s new?

74 of 101

How does the MDS empower data scientists?

75 of 101

How does the MDS empower data scientists?

It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.

– DJ Patil, Data Jujitsu, O’Reilly Press 2012

76 of 101

How does the MDS empower data scientists?

77 of 101

Innovation in Storage Has Enabled Innovation In Other Areas

https://blog.getdbt.com/future-of-the-modern-data-stack/

78 of 101

The Entire Stack

Workflow Managers become important to orchestrate analytics jobs
Reverse ETL is about putting data back into applications like Salesforce, Hubspot
Growing areas of governance, discovery, observability, security

79 of 101

The Data Industry: By the Numbers

80 of 101

The Data Industry

81 of 101

Architecture Shifts in Data

https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/

82 of 101

The Data Space

83 of 101

The Data Story - Actionable Insights

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

Scalable Unstructured Data Storage

DFS, MapReduce

Scalable Structured Data Storage

Cloud data warehouses

Actionable Insights

Modern Data Stack

84 of 101

The Data Story Continues?

Data Storage

Data Retrieval

Relational Model & RDBMS

Analytics

MPP & Early data warehouses

Scalable Unstructured Data Storage

DFS, MapReduce

Scalable Structured Data Storage

Cloud data warehouses

Actionable Insights

Modern Data Stack

…

85 of 101

What’s Next?

Continued innovation in the data warehouse

86 of 101

What’s Next?

Continued innovation in the data warehouse
Some companies choose to adopt a data lakehouse architecture

87 of 101

What’s Next?

Continued innovation in the data warehouse
Some companies choose to adopt a data lakehouse architecture
Real-time use cases become a larger part of the MDS

88 of 101

What’s Next?

Continued innovation in the data warehouse
Some companies choose to adopt a data lakehouse architecture
Real-time use cases become a larger part of the MDS
SQL remains the core part of the MDS, and becomes more closely coupled with ML

89 of 101

Where do I work?

90 of 101

Where do I work?

Managed ML

Let data scientists do data science and deliver value for the organization
Abstract away the data engineering and ops involved with putting models into production

91 of 101

Where do I work?

Vikram Sreekanti

CEO

Chenggang Wu

CTO

Joe Hellerstein

Chief Scientist

Joey Gonzalez

VP Product

92 of 101

Where do I work?

Vikram Sreekanti

CEO

Chenggang Wu

CTO

Joe Hellerstein

Chief Scientist

Joey Gonzalez

VP Product

We work on:

systems infrastructure design & development
operating production services
developer tools & SDKs
UI design & frontend engineering
machine learning frameworks and infrastructure

93 of 101

We’re Hiring!

sauravc@berkeley.edu�

94 of 101

Conclusion

Went over motivation behind RDBMS, data warehouses & data lakes on-premise and in the cloud, big data, modern data stack
Touched on cloud data warehouse internals
Connections to 186

Motivation for partitioning, parallel join algorithms, bulk loading
Column stores changing assumptions made about record format
Understanding applications that databases serve today

Trends observed

Compute is really really cheap today
Easier to pay someone for a service, rather than managing your own infrastructure

95 of 101

Conclusion

Went over motivation behind RDBMS, data warehouses & data lakes on-premise and in the cloud, big data, modern data stack
Touched on cloud data warehouse internals
Connections to 186

Motivation for partitioning, parallel join algorithms, bulk loading
Column stores changing assumptions made about record format
Understanding applications that databases serve today

Trends observed

Compute is really really cheap today
Easier to pay someone for a service, rather than managing your own infrastructure

96 of 101

Conclusion

Concurrency Control

Recovery

Database Management

System

Database

Query Parsing & Optimization

Relational Operators

Files and Index Management

Buffer Management

Disk Space Management

SQL Client

97 of 101

Conclusion

Concurrency Control

Recovery

Database Management

System

Database

Query Parsing & Optimization

Relational Operators

Files and Index Management

Buffer Management

Disk Space Management

SQL Client

98 of 101

Questions?

99 of 101

Sources

Joe Hellerstein’s 186 Slides
Aditya Parameswaran and Alvin Cheung’s 186 Slides

100 of 101

Additional Reading

101 of 101

Thanks!