1 of 63

A Survey of Cloud Database Systems

C. Mohan� �Distinguished Visiting Professor (Tsinghua University, Beijing, China)��Member, Board of Governors (Digital University Kerala, India)�Retired IBM Fellow (Silicon Valley, USA) & Former Shaw Visiting Professor (National Univ of Singapore)��https://bit.ly/CMoDUK �

Indian Institute of Science (IISc)�Bangalore, India, 5 January 2022

@seemohan

C. Mohan, IISc BLR, 2022-01-05

2 of 63

Acks + More Info

Lots of material in this presentation picked was up from other sources
Will add later complete source references
https://bit.ly/CMnusC - Check out reading list for my National University of Singapore (NUS) seminar course (CS6282 Practical Cloud/Parallel/Distributed Data and Computing Systems) I taught in 1H2021
This presentation is still very much a work-in-progress and hence incomplete!

Will be revising it significantly as I give it elsewhere
Watch out for my social media posts on revised slides + videos
Links to my social media accounts at https://bit.ly/CMoDUK

@seemohan

C. Mohan, IISc BLR, 2022-01-05

3 of 63

Relational Database Management Systems (RDBMSs)

My former IBM Research colleague in San Jose (California) Edgar F. Codd developed the relational model of data and the theory behind it before start of my IBM career
Other former colleagues then developed System R which gave rise to SQL language for data definition and data manipulation in a high-level, declarative manner
In parallel, in Univ of California at Berkely, Mike Stonebraker and colleagues developed Ingres and the language Quel which differed from SQL
SQL won the standardization game making IBMers very happy!
Very early on, Codd received the Turing Award for his relational data model work
Another former IBMer, Jim Gray, also got Turing Award for his transaction and other work
Much later, Mike Stonebraker got the same award for his RDBMS and related work
RDBMSs and SQL have stood the test of time, in spite of emergence of NoSQL, and big data systems like Hadoop and Spark

@seemohan

C. Mohan, IISc BLR, 2022-01-05

4 of 63

Traditional DBMS Architecture

Source: https://www.geeksforgeeks.org/structure-of-database-management-system/

@seemohan

C. Mohan, IISc BLR, 2022-01-05

5 of 63

Storage Hierarchy

Source: https://diveintosystems.org/book/C11-MemHierarchy/mem_hierarchy.html

@seemohan

C. Mohan, IISc BLR, 2022-01-05

6 of 63

Storage Hierarchy

Source: https://medium.com/@esmerycornielle/the-cpu-and-the-memory-2eb300d6c72d

@seemohan

C. Mohan, IISc BLR, 2022-01-05

7 of 63

Storage Hierarchy

Source: https://proactive.co.in/aws-storage-solutions/

@seemohan

C. Mohan, IISc BLR, 2022-01-05

8 of 63

Different Workloads

Online Transaction Processing (OLTP)
Online Analytical Processing (OLAP)
Modern Application Requirements
Design Challenges of HTAP Systems
HTAP/OLTAP Systems

IBM DB2 BLU and IBM Wildfire/Db2 Event Store
MemSQL (SingleStore)
Oracle DBIM
SAP HANA
VoltDB
HyPer
TiDB

@seemohan

C. Mohan, IISc BLR, 2022-01-05

9 of 63

Online Transaction Processing (OLTP)

Online Transaction Processing (OLTP) systems around for 4 decades: TPF, IMS, CICS … �Much later RDBMSs … DB2, Oracle, Sybase, Informix, …
Focus on short transactions (both reads and writes), typically not ad-hoc
Major emphasis on failure tolerance, performance (response time & throughput), security, priority-based scheduling, …
Various benchmarks have driven developments: Transaction Processing Council (TPC): �B, C, E, … http://tpc.org/information/benchmarks5.asp
Prolonged debate on system architectures: Shared Everything (SE), Shared Disks (SD), Shared Nothing (SN) – SD has won (DB2, Oracle)
Indexing has been crucial – typically, B+ trees but also hashing in some systems
Stored procedures used for high performance
Of late, leveraging main memory, SSD; compiling to machine code; …

@seemohan

C. Mohan, IISc BLR, 2022-01-05

10 of 63

Online Analytical Processing (OLAP)

Complex high-level queries for decision support, data mining, multi-dimensional analytics with data warehouses
Consolidation (roll-up), drill-down, and slicing and dicing
Ad hoc or repetitive long-running queries needing complex query optimization techniques
Indexing choices harder due to ad hoc nature of queries + maintenance cost
Precomputed views might be used
TPC DI, DS, H, x-HS, … benchmarks http://tpc.org/information/benchmarks5.asp
Periodic replication from OLTP systems, reduced data currency: delayed analytics
Parallel RDBMSs – DB2, Teradata, Netezza, Vertica, VectorWise
MOLAP, ROLAP, Hybrid systems – Hyperion, Essbase, Cognos, Business Objects, Microsoft Analysis Services, …
Modern ones are column stores, Hadoop systems; leverage GPUs, FPGAs, …

@seemohan

C. Mohan, IISc BLR, 2022-01-05

11 of 63

Data Warehouse Architecture

@seemohan

C. Mohan, IISc BLR, 2022-01-05

12 of 63

Modern Data Architecture

@seemohan

C. Mohan, IISc BLR, 2022-01-05

13 of 63

HTAP Benefits

@seemohan

C. Mohan, IISc BLR, 2022-01-05

14 of 63

Tradeoffs in the New Age

http://www.w3resource.com/mongodb/nosql.php

“Split-Brain” Syndrome

@seemohan

C. Mohan, IISc BLR, 2022-01-05

15 of 63

Modern Application Requirements

Real-time operational intelligence processing
In rapidly changing databases, analytics on latest states of data
Hybrid Transactional/Analytics Processing (HTAP) processing on same data with stringent response time and data currency requirements – aka OLTAP
Eliminate multiple copies of data
High volume of transactions, high ingest rate streaming of data
Ads/recommendation serving in mobile environments
Sensors at the edge of the network (a la Apache Edgent)
Public safety, risk management, fraud detection, real-time inventory/pricing, …
High availability in highly distributed environments with high rate of failures of different components
Stronger transactional guarantees than what NoSQL systems give

@seemohan

C. Mohan, IISc BLR, 2022-01-05

16 of 63

HTAP Systems: Design Challenges

Row versus Column formats
Logging, buffer hits and row reconstruction overheads to columnar format
I/O overheads, compression inefficiencies of row format
In-place updates versus versioning: append-only versus true update workloads
Index maintenance overheads of versioning data
Strict consistency versus conflicting updates: a function of need for multi-data center configurations (globally distributed operations versus only localized operations)
Additional complications of public cloud
Scaling challenges with addition and deletion of nodes
Pains stemming from “brain-dead” but currently very popular file systems like HDFS!
Supporting recent as well as historical data

@seemohan

C. Mohan, IISc BLR, 2022-01-05

17 of 63

Traditional Parallel Database Architectures

Source: DB2 for z/OS: Data Sharing in a Nutshell, IBM Redbook SG24-7322-00, October 2006

@seemohan

C. Mohan, IISc BLR, 2022-01-05

18 of 63

Traditional Distributed Databases

Have been worked on for 5 decades
My first project at IBM Research in 1982 was R*, the distributed version of System R

Invented two-phase commit variations like Presumed Abort (PA) and Presumed Commit (PC)
Distributed query processing and optimization
Snapshots – aka materialized views in a distributed context

Replication – synchronous, asynchronous
Evolution from homogeneous database nodes to heterogeneous nodes – heterogeneous databases – global and local optimizations
Data partitioning – horizontal and vertical
Failure models – Byzantine failures included or not
Node autonomy considerations

@seemohan

C. Mohan, IISc BLR, 2022-01-05

19 of 63

Traditional Database Replication

Primary log replay at replica – homogeneous systems with full DB replicas, typically done for disaster recovery (DR) backup
Log capture generates DML statements from what is logged and apply executes those statements (e.g., IBM Q Replication)

Can handle non-determinism and partial replicas
Requires dependency analysis to leverage parallelism at apply time
https://www.ibm.com/developerworks/data/roadmaps/qrepl-roadmap.html

Capture DML statements as issued by application and re-execute them at replica (e.g., H-Store/VoltDB)

Cannot handle non-determinism
Typically, serial execution of transactions

@seemohan

C. Mohan, IISc BLR, 2022-01-05

20 of 63

Distributed SQL Databases Comparison

Source: http://bit.ly/3t3rBDa Yugabyte, 1/2021

@seemohan

C. Mohan, IISc BLR, 2022-01-05

21 of 63

Distributed NoSQL Databases Comparison

Source: http://bit.ly/3t3rBDa Yugabyte, 1/2021

@seemohan

C. Mohan, IISc BLR, 2022-01-05

22 of 63

Cloud Deployment Models

Source: https://bit.ly/31KGP7j, 3/2017

@seemohan

C. Mohan, IISc BLR, 2022-01-05

23 of 63

Cloud Deployment Models

Source: https://bit.ly/3rRGjPz, 2/2021

@seemohan

C. Mohan, IISc BLR, 2022-01-05

24 of 63

Cloud Service Models

Source: https://bit.ly/31KGP7j, 3/2017

@seemohan

C. Mohan, IISc BLR, 2022-01-05

25 of 63

Cloud Service Models

Source: https://bit.ly/3rRGjPz, 2/2021

Infrastructure as a Service (IaaS): Virtualized resources��Examples: Microsoft Azure, Amazon Web Services (AWS), Cisco Metacloud, Google Compute Engine (GCE)
Platform as a Service (PaaS): IaaS + OS + Middleware��Examples: AWS Elastic Beanstalk, Apache Stratos, Google App Engine, Microsoft Azure
Software as a Service (SaaS): Fully developed software solution��Examples: Microsoft Office 365, Salesforce, Cisco WebEx, Google Apps
Function as a Service (FaaS): aka Serverless Computing – deploy application code��Examples: AWS Lambdas, Azure Functions

@seemohan

C. Mohan, IISc BLR, 2022-01-05

26 of 63

Cloud Data Center Architecture

Source: https://crmtrilogix.com/Cloud-Blog/IaaS-and-PaaS/Cloud-Infrastructure--Data-Center-Architecture/160

@seemohan

C. Mohan, IISc BLR, 2022-01-05

27 of 63

Cloud Fault Domain Zones Regions

Source: https://k21academy.com/microsoft-azure/az-303/azure-availability-zones-and-regions/

@seemohan

C. Mohan, IISc BLR, 2022-01-05

28 of 63

Seattle 2018 DB Meeting (http://bit.ly/DBseat)

Material from Working Group Report on Cloud Data Services�Chairs: Sailesh Krishnamurthy and Fatma Ozcan

Multitenancy
Serverless, PaaS, auto-scaling

Serverless: Allows one to code without thinking about hardware/network configuration

Hybrid cloud
Implications of cloud for database engine architectures (DBaaS)
Auto-tuning and usability
Open source: Lack of a good open source distributed OLTP DBMS
Migrating apps from on-prem to cloud easily
Data sovereignty
What happens when storage/memory hierarchy becomes deep? �{block storage - object storage (S3) - cold storage (glacier) }

@seemohan

C. Mohan, IISc BLR, 2022-01-05

29 of 63

What Users Want

Performance and predictable costs without complex knobs or complicated system sizing
Pay-per-use pricing models as opposed to pricing based on pre-provisioned limits
Auto-scaling service capacity based on dynamic needs
Supporting multitenant SaaS applications
State management for new idioms like Serverless and Functions-as-a-Service
Applications that operate across multiple nodes, multiple datacenters (regionalized), multiple regions (globalized)
Applications that operate across on-premise and multiple cloud services seamlessly (hybrid cloud)

@seemohan

C. Mohan, IISc BLR, 2022-01-05

30 of 63

What Users Want

How can we use more ML/DL for auto-tuning cloud data services to eliminate DBAs? Optimal data layout (sharding and partitioning decisions), data caching decisions (where and which data to cache), better query optimization, admission control, etc.
Is there a way to determine when it is beneficial to move to the cloud? �There is a distinction between cost of services vs cost to operate. �Is this about cost of ownership, cost of startup, cost of burstiness/elasticity?
How can we run analytics workloads across data centers? How about active-active OLTP across data centers? Can we also enable HTAP?
How do disaggregated storage and a rich storage hierarchy (including NVRAM) impact cloud data architecture?

@seemohan

C. Mohan, IISc BLR, 2022-01-05

31 of 63

Some Cloud DBMSs

Alibaba POLARDB & POLARDB-X
Amazon Aurora
CockroachDB (CRDB)
FoundationDB
Google Spanner & F1
MariaDB
MongoDB
SQL DB Hyperscale (aka Microsoft Socrates) & POLARIS
YugabyteDB

@seemohan

C. Mohan, IISc BLR, 2022-01-05

32 of 63

Traditional vs Cloud DB Architectures

Local

Storage

SQL

Transactions

Caching

Logging

Compute

Network

Storage

Attached

Storage

SQL

Transactions

Caching

Logging

Compute

Compute and Storage�decoupled for scalability,�availability, durability

@seemohan

C. Mohan, IISc BLR, 2022-01-05

33 of 63

State Separation from Compute

@seemohan

C. Mohan, IISc BLR, 2022-01-05

34 of 63

Starts up on demand,�shuts down when not in use
Scales up/down automatically
Pay per second

WARM POOL

OF INSTANCES

APPLICATION

DATABASE STORAGE

SCALABLE DB CAPACITY

REQUEST ROUTERS

Aurora Serverless

Aurora MySQL launched in Oct 2014; Aurora Postgres launched in Oct 2017
Aurora received ACM SIGMOD Systems Award July 2019

@seemohan

C. Mohan, IISc BLR, 2022-01-05

35 of 63

Aurora Architecture

@seemohan

C. Mohan, IISc BLR, 2022-01-05

36 of 63

Amazon Aurora Design Philosophy

@seemohan

C. Mohan, IISc BLR, 2022-01-05

37 of 63

Aurora Compute

Single tenant compute instance (Head Node)
Handles upper half of database stack

Client connection management
Query processing
Transaction management
Locking
Buffer cache
Access methods
Undo management

Customer�Application

SQL

Transactions

Caching

Logging

Head Node

Customer VPC

@seemohan

C. Mohan, IISc BLR, 2022-01-05

38 of 63

Aurora: Offload Redo to Storage

Database Tier

Writes redo log records on network
No full data block writes for checkpointing, cache eviction, background writes
Push log applicator to storage

Storage Tier

Multi tenant, scalable, replicated and self healing storage fleet
Each segment replicated 6 ways across 3 datacenters (AZs)
Highly parallel scale out redo processing
Generate DB blocks on demand (redo)
Materialize DB blocks in background (redo)

The Log is the Database

SQL

Transactions

AZ 1

AZ 2

AZ 3

Caching

Amazon S3

@seemohan

C. Mohan, IISc BLR, 2022-01-05

39 of 63

Network I/O in MySQL vs Amazon Aurora

@seemohan

C. Mohan, IISc BLR, 2022-01-05

40 of 63

Aurora Reduces Aggregate I/O Burden

30 minute SysBench write-only workload, 100GB dataset

6X more log writes, but 9X less network traffic

RDS MySQL Multi-AZ 30K PIOPS

Total trx : 780K

I/Os per trx : 7.4 (ex-mirroring)

Aurora

Total trx : 27,378K

I/Os per trx : 0.95 (inc 6x amp)

35X MORE

7.7X LESS

@seemohan

C. Mohan, IISc BLR, 2022-01-05

41 of 63

Aurora Storage Node

@seemohan

C. Mohan, IISc BLR, 2022-01-05

42 of 63

Availability & Software Upgrades

@seemohan

C. Mohan, IISc BLR, 2022-01-05

43 of 63

Distributed Commits in Cloud Databases

No atomic flush spanning multiple storage nodes

Commits over distributed storage use 2PC/Paxos to establish global consistency

Network chatter leads to stalls and jitter in the write path

Local

Storage

SQL

Transactions

Caching

Logging

Compute

Network

Storage

@seemohan

C. Mohan, IISc BLR, 2022-01-05

44 of 63

Storage nodes

Establish compact consistency points that:

Increase monotonically

Are continuously returned to the database

Do not vote on accepting a write

Execute idempotent operations on local state

Database nodes

Handle locking, transactions, deadlocks, constraints etc.

Aurora: Asynchronous Commit Processing

@seemohan

C. Mohan, IISc BLR, 2022-01-05

45 of 63

Backward Chaining of Redo Log Records

Each redo log record includes backlink LSNs:

Block backlink: Previous log record of the modified block

Used to materialize blocks on demand and background

Segment backlink: Previous log record in the segment

Used to identify records not received by the storage node

Volume backlink: Previous log record in the volume

Used to regenerate metadata as a fallback path

@seemohan

C. Mohan, IISc BLR, 2022-01-05

46 of 63

Aurora: Crash Recovery

DB opens all storage nodes, reach read quorum for all PGs

Locally re-compute PGCLs and VCL

VCL is the highest point where all records have met quorum

Everything past VCL is truncated at crash recovery

No redo or undo processing is required before the database is opened for processing

CRASH

Log records

Gaps

Volume Complete

LSN (VCL)

AT CRASH

IMMEDIATELY AFTER CRASH RECOVERY

Ragged edge of log not meeting quorum and truncated at recovery

@seemohan

C. Mohan, IISc BLR, 2022-01-05

47 of 63

Offload processing to storage
Designed for analytical queries�with large scans
Aurora storage has thousands of CPUs
Push down predicates�and parallelize query processing
Reduce network traffic
Avoid buffer pool trashing

Database Node

Storage nodes

Push down �predicates

Aggregate�results

Aurora Parallel Query

@seemohan

C. Mohan, IISc BLR, 2022-01-05

48 of 63

Aurora Multi-Master

@seemohan

C. Mohan, IISc BLR, 2022-01-05

49 of 63

Microsoft Azure: SQL DB HADR Architecture

Single R/W primary
Log shipping with 3 read-only secondaries
Log, incremental/full DB backups
Versioning for snapshot isolation done locally in temporary storage
Resilient buffer pool extension (RBPX)

@seemohan

C. Mohan, IISc BLR, 2022-01-05

50 of 63

Microsoft Socrates (aka SQL DB Hyperscale)

@seemohan

C. Mohan, IISc BLR, 2022-01-05

51 of 63

POLARIS: Distributed SQL in Azure Synapse

Converging warehouses and data lakes in Azure Synapse

@seemohan

C. Mohan, IISc BLR, 2022-01-05

52 of 63

Google Spanner

Globally distributed semi-relational system, SQL-based query lang
Uses Colossus File System (successor to GFS)
Transactional, relies on TrueTime and Paxos

@seemohan

C. Mohan, IISc BLR, 2022-01-05

53 of 63

Google F1

Developed for use with AdWords platform replacing MySQL
Relies on Google Spanner; mostly stateless F1 servers
Distributed SQL queries for OLAP-style queries, using snapshot transactions and 2PC
Optimistic concurrency control
Transaction consistent secondary indexes
Protocol Buffer as column data type
Clustered hierarchical tables
Increased read, write and commit latencies
5 copy replication
Row or sub-row level locking
Many issues found with inefficiencies of MySQL ORM layer – replaced with F1 ORM
Multiple client consumers of a single query’s results

@seemohan

C. Mohan, IISc BLR, 2022-01-05

54 of 63

CockroachDB (CRDB)

Supports public and private cloud nodes in a single CRDB cluster
Uses standard clock synchronization protocol like NTP
Geo-distributed (range) partitioning and replica placement
At least 3 replicas of every partition
Distributed SQL optimization and execution
Shared nothing architecture
Serializability only – no snapshot isolation! For read-only “follower read” option exists.
PostgreSQL
Store and index Spatial data types with familiar, PostGIS-compatible SQL syntax
Uses Raft consensus protocol
New in v20.2: By default, CRDB uses Pebble storage engine, with RocksDB as an option. Pebble is intended to be bidirectionally compatible with RocksDB on-disk format, but differs in that it:

Is written in Go and implements a subset of RocksDB's large feature set
Contains optimizations that benefit CockroachDB

@seemohan

C. Mohan, IISc BLR, 2022-01-05

55 of 63

Snowflake

Source: https://bit.ly/3eNUXQd

Cloud-based warehouse system
Compute and storage elasticity
Multi-tenancy
Custom-designed storage for management and exchange of ephemeral data as well as for caching persistent data for write-through
Virtual Waterhouse (VW) = set of EC2 instances
Elasticity on fine-grained timescales via pre-warmed VWs
Anonymized 70 million queries available at https://github.com/resource-disaggregation/snowset

@seemohan

C. Mohan, IISc BLR, 2022-01-05

56 of 63

FoundationDB

@seemohan

C. Mohan, IISc BLR, 2022-01-05

57 of 63

Alibaba Cloud Database Systems

@seemohan

C. Mohan, IISc BLR, 2022-01-05

58 of 63

Alibaba POLARDB Architecture

Cloud-native LSM-based row store OLTP system
Compatible with MySQL & PostgreSQL
RDMA interconnect
1 writer & multiple reader nodes
3 copies of data via Parallel-Raft protocol
Built on PolarFS
Pushes down table scans to storage and uses computational storage drives

Source: https://bit.ly/3sZAkZE

@seemohan

C. Mohan, IISc BLR, 2022-01-05

59 of 63

Alibaba POLARDB Architecture

FPGA-based host-managed computational storage drive design
Single FPGA for flash memory control and table scans
Host manages address mapping, request scheduling and garbage collection

@seemohan

C. Mohan, IISc BLR, 2022-01-05

60 of 63

Alibaba POLARDB Architecture

PolarFS changed to do table scans
Employs parallelism for scans

@seemohan

C. Mohan, IISc BLR, 2022-01-05

61 of 63

POLARDB-X Architecture

Distributed shared-nothing OLTP DBMS to overcome RDMA-limited scaling
Combines shared-storage and shared-nothing concepts

@seemohan

C. Mohan, IISc BLR, 2022-01-05

62 of 63

IoT-Cloud Architecture

Source: P. Pierleoni et al. IEEE Access, 2019

@seemohan

C. Mohan, IISc BLR, 2022-01-05

63 of 63

IoT-Cloud Architecture

Source: P. Pierleoni et al. IEEE Access, 2019

@seemohan

C. Mohan, IISc BLR, 2022-01-05