Relational Databases to the Modern Data Stack
Saurav Chhatrapati
4/25/22
CS 186
Agenda
Agenda
Goals
Agenda
Goals
Non-Goals
Let’s start at the beginning…
Phase 1*: Relational Databases
What were the initial use cases for relational databases?
What were the initial use cases for relational databases?
1970s: The Relational Story Starts
Relational Database Genealogy
The Data Story - Foundations
Data Storage
Data Retrieval
Relational Model & RDBMS
What do the database access patterns look like for these applications?
Applications
What do the database access patterns look like for these applications?
Applications
Workload
What do the database access patterns look like for these applications?
Applications
Workload
Online Transaction Processing (OLTP)
Decision Support Systems (DSS)
Decision Support Systems (DSS)
Microsoft releases Excel in 1985
Online Analytical Processing (OLAP)
1990s: Codd and others define a new workload category: OLAP
Online Analytical Processing (OLAP)
Online Analytical Processing (OLAP)
Online Analytical Processing (OLAP)
AKA Data Warehouse
Phase 2: Data Warehouses
Data Warehouse
Data Warehouse
Massively Parallel Processing (MPP)
Teradata releases DBC/1012 in 1984
Massively Parallel Processing (MPP)
Teradata releases DBC/1012 in 1984
Gamma Database project starts in early 1980s at University of Wisconsin
Teradata - DBC/1012
Teradata - DBC/1012
Teradata - DBC/1012
Shared Nothing
Teradata - DBC/1012
Teradata - DBC/1012
Teradata - DBC/1012
Teradata - DBC/1012
60 in. x 27 in.
Weighed >600 lbs
Teradata - DBC/1012
60 in. x 27 in.
Weighed >600 lbs
How much did it cost?
Teradata - DBC/1012
60 in. x 27 in.
Weighed >600 lbs
How much did it cost?
The Data Story - Analytics
Data Storage
Data Retrieval
Relational Model & RDBMS
Analytics
MPP & Early data warehouses
Phase 3: Big Data
Why Big Data?
Google File System (GFS)
Node 5
Node 4
Node 3
Node 2
Node 1
Block
1
Block
3
Block
2
Block
1
Block
3
Block
2
Block
3
Block
2
Block
1
Google File System (GFS)
Node 5
Node 4
Node 3
Node 2
Node 1
Block
1
Block
3
Block
2
Block
1
Block
3
Block
2
Block
3
Block
2
Block
1
“Component failures are the norm rather than the exception.”
MapReduce
Google publishes a paper in 2004
MapReduce
the quick
brown fox
the fox ate the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Local disks
Global File System
Global File System
Hadoop
Hadoop
Challenges to MapReduce and Hadoop
Spark
Pushback to Big Data
Pushback to Big Data
Lacks
Incompatible with
tl;dr Big Data Did Not Make RDBMS Irrelevant
db-engines.com/en/ranking
The Data Story - Scalable Unstructured Storage
Data Storage
Data Retrieval
Relational Model & RDBMS
Analytics
MPP & Early data warehouses
Scalable Unstructured Data Storage
DFS, MapReduce
Phase 4: Cloud Data Warehouses
What’s up with the Cloud?
What’s up with the Cloud?
Data Lake
AWS Changes the Game - Redshift
AWS Redshift Internals
Snowflake
Snowflake
Largest software IPO at $70B
Snowflake Internals
Snowflake Internals
Databricks
The Data Story - Scalable Cheap Structured Storage
Data Storage
Data Retrieval
Relational Model & RDBMS
Analytics
MPP & Early data warehouses
Scalable Unstructured Data Storage
DFS, MapReduce
Scalable Structured Data Storage
Cloud data warehouses
Phase 5: The Modern Data Stack
Goals of the Modern Data Stack
Building the MDS: Sources
OLTP
ERP
(Salesforce, Netsuite)
Operational Apps
(Salesforce, Hubspot)
Event Collectors
(Segment)
Logs
3rd Party APIs
(Stripe)
File/Object Storage
Sources
Building the MDS: Data Warehouse
OLTP
ERP
(Salesforce, Netsuite)
Operational Apps
(Salesforce, Hubspot)
Event Collectors
(Segment)
Logs
3rd Party APIs
(Stripe)
File/Object Storage
Sources
Data Warehouse
(Snowflake, Redshift)
Storage
Building the MDS: Data Warehouse
OLTP
ERP
(Salesforce, Netsuite)
Operational Apps
(Salesforce, Hubspot)
Event Collectors
(Segment)
Logs
3rd Party APIs
(Stripe)
File/Object Storage
Sources
Data Warehouse
(Snowflake, Redshift)
Storage
How does the data actually get from the sources to storage?
Building the MDS: Data Warehouse
OLTP
ERP
(Salesforce, Netsuite)
Operational Apps
(Salesforce, Hubspot)
Event Collectors
(Segment)
Logs
3rd Party APIs
(Stripe)
File/Object Storage
Sources
Data Warehouse
(Snowflake, Redshift)
Storage
How does the data actually get from the sources to storage?
Building the MDS: Data Warehouse
OLTP
ERP
(Salesforce, Netsuite)
Operational Apps
(Salesforce, Hubspot)
Event Collectors
(Segment)
Logs
3rd Party APIs
(Stripe)
File/Object Storage
Sources
Data Warehouse
(Snowflake, Redshift)
Storage
How does the data actually get from the sources to storage?
Building the MDS: Ingestion and Transport
OLTP
ERP
(Salesforce, Netsuite)
Operational Apps
(Salesforce, Hubspot)
Event Collectors
(Segment)
Logs
3rd Party APIs
(Stripe)
File/Object Storage
Sources
Data Warehouse
(Snowflake, Redshift)
Storage
Data Replication
(Fivetran, Stitch)
Ingestion and Transport
Building the MDS: Transformation
OLTP
ERP
(Salesforce, Netsuite)
Operational Apps
(Salesforce, Hubspot)
Event Collectors
(Segment)
Logs
3rd Party APIs
(Stripe)
File/Object Storage
Sources
Data Warehouse
(Snowflake, Redshift)
Storage
Data Replication
(Fivetran, Stitch)
Ingestion and Transport
Transformation
(dbt)
Building the MDS: Analysis
OLTP
ERP
(Salesforce, Netsuite)
Operational Apps
(Salesforce, Hubspot)
Event Collectors
(Segment)
Logs
3rd Party APIs
(Stripe)
File/Object Storage
Sources
Data Warehouse
(Snowflake, Redshift)
Storage
Data Replication
(Fivetran, Stitch)
Ingestion and Transport
Transformation
(dbt)
DS/ML Tools�(Databricks, Sagemaker)
Analysis
Building the MDS: Business Intelligence
OLTP
ERP
(Salesforce, Netsuite)
Operational Apps
(Salesforce, Hubspot)
Event Collectors
(Segment)
Logs
3rd Party APIs
(Stripe)
File/Object Storage
Sources
Data Warehouse
(Snowflake, Redshift)
Storage
Data Replication
(Fivetran, Stitch)
Ingestion and Transport
Transformation
(dbt)
DS/ML Tools�(Databricks, Sagemaker)
Analysis
Dashboards
(Looker, Tableau)
BI
What’s new?
How does the MDS empower data scientists?
How does the MDS empower data scientists?
It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data.
– DJ Patil, Data Jujitsu, O’Reilly Press 2012
How does the MDS empower data scientists?
Innovation in Storage Has Enabled Innovation In Other Areas
https://blog.getdbt.com/future-of-the-modern-data-stack/
The Entire Stack
The Data Industry: By the Numbers
The Data Industry
Architecture Shifts in Data
https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/
The Data Space
The Data Story - Actionable Insights
Data Storage
Data Retrieval
Relational Model & RDBMS
Analytics
MPP & Early data warehouses
Scalable Unstructured Data Storage
DFS, MapReduce
Scalable Structured Data Storage
Cloud data warehouses
Actionable Insights
Modern Data Stack
The Data Story Continues?
Data Storage
Data Retrieval
Relational Model & RDBMS
Analytics
MPP & Early data warehouses
Scalable Unstructured Data Storage
DFS, MapReduce
Scalable Structured Data Storage
Cloud data warehouses
Actionable Insights
Modern Data Stack
…
What’s Next?
What’s Next?
What’s Next?
What’s Next?
Where do I work?
Where do I work?
Managed ML
Where do I work?
Vikram Sreekanti
CEO
Chenggang Wu
CTO
Joe Hellerstein
Chief Scientist
Joey Gonzalez
VP Product
Where do I work?
Vikram Sreekanti
CEO
Chenggang Wu
CTO
Joe Hellerstein
Chief Scientist
Joey Gonzalez
VP Product
We work on:
We’re Hiring!
Conclusion
Conclusion
Conclusion
Concurrency Control
Recovery
Database Management
System
Database
Query Parsing & Optimization
Relational Operators
Files and Index Management
Buffer Management
Disk Space Management
SQL Client
Conclusion
Concurrency Control
Recovery
Database Management
System
Database
Query Parsing & Optimization
Relational Operators
Files and Index Management
Buffer Management
Disk Space Management
SQL Client
Questions?
Sources
Additional Reading
Thanks!