1 of 63

A Survey of Cloud Database Systems

C. MohanDistinguished Visiting Professor (Tsinghua University, Beijing, China)��Member, Board of Governors (Digital University Kerala, India)�Retired IBM Fellow (Silicon Valley, USA) & Former Shaw Visiting Professor (National Univ of Singapore)��https://bit.ly/CMoDUK

Indian Institute of Science (IISc)�Bangalore, India, 5 January 2022

@seemohan

1

C. Mohan, IISc BLR, 2022-01-05

2 of 63

Acks + More Info

  • Lots of material in this presentation picked was up from other sources
  • Will add later complete source references
  • https://bit.ly/CMnusC - Check out reading list for my National University of Singapore (NUS) seminar course (CS6282 Practical Cloud/Parallel/Distributed Data and Computing Systems) I taught in 1H2021
  • This presentation is still very much a work-in-progress and hence incomplete!
    • Will be revising it significantly as I give it elsewhere
    • Watch out for my social media posts on revised slides + videos
    • Links to my social media accounts at https://bit.ly/CMoDUK

@seemohan

2

C. Mohan, IISc BLR, 2022-01-05

3 of 63

Relational Database Management Systems (RDBMSs)

  • My former IBM Research colleague in San Jose (California) Edgar F. Codd developed the relational model of data and the theory behind it before start of my IBM career
  • Other former colleagues then developed System R which gave rise to SQL language for data definition and data manipulation in a high-level, declarative manner
  • In parallel, in Univ of California at Berkely, Mike Stonebraker and colleagues developed Ingres and the language Quel which differed from SQL
  • SQL won the standardization game making IBMers very happy!
  • Very early on, Codd received the Turing Award for his relational data model work
  • Another former IBMer, Jim Gray, also got Turing Award for his transaction and other work
  • Much later, Mike Stonebraker got the same award for his RDBMS and related work
  • RDBMSs and SQL have stood the test of time, in spite of emergence of NoSQL, and big data systems like Hadoop and Spark

@seemohan

3

C. Mohan, IISc BLR, 2022-01-05

4 of 63

Traditional DBMS Architecture

@seemohan

4

C. Mohan, IISc BLR, 2022-01-05

5 of 63

Storage Hierarchy

@seemohan

5

C. Mohan, IISc BLR, 2022-01-05

6 of 63

Storage Hierarchy

@seemohan

6

C. Mohan, IISc BLR, 2022-01-05

7 of 63

Storage Hierarchy

@seemohan

7

C. Mohan, IISc BLR, 2022-01-05

8 of 63

Different Workloads

  • Online Transaction Processing (OLTP)
  • Online Analytical Processing (OLAP)
  • Modern Application Requirements
  • Design Challenges of HTAP Systems
  • HTAP/OLTAP Systems
    • IBM DB2 BLU and IBM Wildfire/Db2 Event Store
    • MemSQL (SingleStore)
    • Oracle DBIM
    • SAP HANA
    • VoltDB
    • HyPer
    • TiDB

@seemohan

8

C. Mohan, IISc BLR, 2022-01-05

9 of 63

Online Transaction Processing (OLTP)

  • Online Transaction Processing (OLTP) systems around for 4 decades: TPF, IMS, CICS … �Much later RDBMSs … DB2, Oracle, Sybase, Informix, …
  • Focus on short transactions (both reads and writes), typically not ad-hoc
  • Major emphasis on failure tolerance, performance (response time & throughput), security, priority-based scheduling, …
  • Various benchmarks have driven developments: Transaction Processing Council (TPC): �B, C, E, … http://tpc.org/information/benchmarks5.asp
  • Prolonged debate on system architectures: Shared Everything (SE), Shared Disks (SD), Shared Nothing (SN) – SD has won (DB2, Oracle)
  • Indexing has been crucial – typically, B+ trees but also hashing in some systems
  • Stored procedures used for high performance
  • Of late, leveraging main memory, SSD; compiling to machine code; …

@seemohan

9

C. Mohan, IISc BLR, 2022-01-05

10 of 63

Online Analytical Processing (OLAP)

  • Complex high-level queries for decision support, data mining, multi-dimensional analytics with data warehouses
  • Consolidation (roll-up), drill-down, and slicing and dicing
  • Ad hoc or repetitive long-running queries needing complex query optimization techniques
  • Indexing choices harder due to ad hoc nature of queries + maintenance cost
  • Precomputed views might be used
  • TPC DI, DS, H, x-HS, … benchmarks http://tpc.org/information/benchmarks5.asp
  • Periodic replication from OLTP systems, reduced data currency: delayed analytics
  • Parallel RDBMSs – DB2, Teradata, Netezza, Vertica, VectorWise
  • MOLAP, ROLAP, Hybrid systems – Hyperion, Essbase, Cognos, Business Objects, Microsoft Analysis Services, …
  • Modern ones are column stores, Hadoop systems; leverage GPUs, FPGAs, …

@seemohan

10

C. Mohan, IISc BLR, 2022-01-05

11 of 63

Data Warehouse Architecture

@seemohan

11

C. Mohan, IISc BLR, 2022-01-05

12 of 63

Modern Data Architecture

@seemohan

12

C. Mohan, IISc BLR, 2022-01-05

13 of 63

HTAP Benefits

@seemohan

13

C. Mohan, IISc BLR, 2022-01-05

14 of 63

Tradeoffs in the New Age

http://www.w3resource.com/mongodb/nosql.php

“Split-Brain” Syndrome

@seemohan

14

C. Mohan, IISc BLR, 2022-01-05

15 of 63

Modern Application Requirements

  • Real-time operational intelligence processing
  • In rapidly changing databases, analytics on latest states of data
  • Hybrid Transactional/Analytics Processing (HTAP) processing on same data with stringent response time and data currency requirements – aka OLTAP
  • Eliminate multiple copies of data
  • High volume of transactions, high ingest rate streaming of data
  • Ads/recommendation serving in mobile environments
  • Sensors at the edge of the network (a la Apache Edgent)
  • Public safety, risk management, fraud detection, real-time inventory/pricing, …
  • High availability in highly distributed environments with high rate of failures of different components
  • Stronger transactional guarantees than what NoSQL systems give

@seemohan

15

C. Mohan, IISc BLR, 2022-01-05

16 of 63

HTAP Systems: Design Challenges

  • Row versus Column formats
  • Logging, buffer hits and row reconstruction overheads to columnar format
  • I/O overheads, compression inefficiencies of row format
  • In-place updates versus versioning: append-only versus true update workloads
  • Index maintenance overheads of versioning data
  • Strict consistency versus conflicting updates: a function of need for multi-data center configurations (globally distributed operations versus only localized operations)
  • Additional complications of public cloud
  • Scaling challenges with addition and deletion of nodes
  • Pains stemming from “brain-dead” but currently very popular file systems like HDFS!
  • Supporting recent as well as historical data

@seemohan

16

C. Mohan, IISc BLR, 2022-01-05

17 of 63

Traditional Parallel Database Architectures

Source: DB2 for z/OS: Data Sharing in a Nutshell, IBM Redbook SG24-7322-00, October 2006

@seemohan

17

C. Mohan, IISc BLR, 2022-01-05

18 of 63

Traditional Distributed Databases

  • Have been worked on for 5 decades
  • My first project at IBM Research in 1982 was R*, the distributed version of System R
    • Invented two-phase commit variations like Presumed Abort (PA) and Presumed Commit (PC)
    • Distributed query processing and optimization
    • Snapshots – aka materialized views in a distributed context
  • Replication – synchronous, asynchronous
  • Evolution from homogeneous database nodes to heterogeneous nodes – heterogeneous databases – global and local optimizations
  • Data partitioning – horizontal and vertical
  • Failure models – Byzantine failures included or not
  • Node autonomy considerations

@seemohan

18

C. Mohan, IISc BLR, 2022-01-05

19 of 63

Traditional Database Replication

  • Primary log replay at replica – homogeneous systems with full DB replicas, typically done for disaster recovery (DR) backup
  • Log capture generates DML statements from what is logged and apply executes those statements (e.g., IBM Q Replication)
  • Capture DML statements as issued by application and re-execute them at replica (e.g., H-Store/VoltDB)
    • Cannot handle non-determinism
    • Typically, serial execution of transactions

@seemohan

19

C. Mohan, IISc BLR, 2022-01-05

20 of 63

Distributed SQL Databases Comparison

Source: http://bit.ly/3t3rBDa Yugabyte, 1/2021

@seemohan

20

C. Mohan, IISc BLR, 2022-01-05

21 of 63

Distributed NoSQL Databases Comparison

Source: http://bit.ly/3t3rBDa Yugabyte, 1/2021

@seemohan

21

C. Mohan, IISc BLR, 2022-01-05

22 of 63

Cloud Deployment Models

Source: https://bit.ly/31KGP7j, 3/2017

@seemohan

22

C. Mohan, IISc BLR, 2022-01-05

23 of 63

Cloud Deployment Models

Source: https://bit.ly/3rRGjPz, 2/2021

@seemohan

23

C. Mohan, IISc BLR, 2022-01-05

24 of 63

Cloud Service Models

Source: https://bit.ly/31KGP7j, 3/2017

@seemohan

24

C. Mohan, IISc BLR, 2022-01-05

25 of 63

Cloud Service Models

Source: https://bit.ly/3rRGjPz, 2/2021

  • Infrastructure as a Service (IaaS): Virtualized resources�Examples: Microsoft Azure, Amazon Web Services (AWS), Cisco Metacloud, Google Compute Engine (GCE)
  • Platform as a Service (PaaS): IaaS + OS + Middleware�Examples: AWS Elastic Beanstalk, Apache Stratos, Google App Engine, Microsoft Azure
  • Software as a Service (SaaS): Fully developed software solution�Examples: Microsoft Office 365, Salesforce, Cisco WebEx, Google Apps
  • Function as a Service (FaaS): aka Serverless Computing – deploy application code�Examples: AWS Lambdas, Azure Functions

@seemohan

25

C. Mohan, IISc BLR, 2022-01-05

26 of 63

Cloud Data Center Architecture

@seemohan

26

C. Mohan, IISc BLR, 2022-01-05

27 of 63

Cloud Fault Domain Zones Regions

@seemohan

27

C. Mohan, IISc BLR, 2022-01-05

28 of 63

Seattle 2018 DB Meeting (http://bit.ly/DBseat)

Material from Working Group Report on Cloud Data ServicesChairs: Sailesh Krishnamurthy and Fatma Ozcan

  • Multitenancy
  • Serverless, PaaS, auto-scaling
    • Serverless: Allows one to code without thinking about hardware/network configuration
  • Hybrid cloud
  • Implications of cloud for database engine architectures (DBaaS)
  • Auto-tuning and usability
  • Open source: Lack of a good open source distributed OLTP DBMS
  • Migrating apps from on-prem to cloud easily
  • Data sovereignty
  • What happens when storage/memory hierarchy becomes deep? �{block storage - object storage (S3) - cold storage (glacier) }

@seemohan

28

C. Mohan, IISc BLR, 2022-01-05

29 of 63

What Users Want

  • Performance and predictable costs without complex knobs or complicated system sizing
  • Pay-per-use pricing models as opposed to pricing based on pre-provisioned limits
  • Auto-scaling service capacity based on dynamic needs
  • Supporting multitenant SaaS applications
  • State management for new idioms like Serverless and Functions-as-a-Service
  • Applications that operate across multiple nodes, multiple datacenters (regionalized), multiple regions (globalized)
  • Applications that operate across on-premise and multiple cloud services seamlessly (hybrid cloud)

@seemohan

29

C. Mohan, IISc BLR, 2022-01-05

30 of 63

What Users Want

  • How can we use more ML/DL for auto-tuning cloud data services to eliminate DBAs? Optimal data layout (sharding and partitioning decisions), data caching decisions (where and which data to cache), better query optimization, admission control, etc.
  • Is there a way to determine when it is beneficial to move to the cloud? �There is a distinction between cost of services vs cost to operate. �Is this about cost of ownership, cost of startup, cost of burstiness/elasticity?
  • How can we run analytics workloads across data centers? How about active-active OLTP across data centers? Can we also enable HTAP?
  • How do disaggregated storage and a rich storage hierarchy (including NVRAM) impact cloud data architecture?

@seemohan

30

C. Mohan, IISc BLR, 2022-01-05

31 of 63

Some Cloud DBMSs

  • Alibaba POLARDB & POLARDB-X
  • Amazon Aurora
  • CockroachDB (CRDB)
  • FoundationDB
  • Google Spanner & F1
  • MariaDB
  • MongoDB
  • SQL DB Hyperscale (aka Microsoft Socrates) & POLARIS
  • YugabyteDB

@seemohan

31

C. Mohan, IISc BLR, 2022-01-05

32 of 63

Traditional vs Cloud DB Architectures

Local

Storage

SQL

Transactions

Caching

Logging

Compute

Network

Storage

Attached

Storage

SQL

Transactions

Caching

Logging

Compute

Compute and Storage�decoupled for scalability,�availability, durability

@seemohan

32

C. Mohan, IISc BLR, 2022-01-05

33 of 63

State Separation from Compute

@seemohan

33

C. Mohan, IISc BLR, 2022-01-05

34 of 63

  • Starts up on demand,�shuts down when not in use
  • Scales up/down automatically
  • Pay per second

WARM POOL

OF INSTANCES

APPLICATION

DATABASE STORAGE

SCALABLE DB CAPACITY

REQUEST ROUTERS

Aurora Serverless

  • Aurora MySQL launched in Oct 2014; Aurora Postgres launched in Oct 2017
  • Aurora received ACM SIGMOD Systems Award July 2019

@seemohan

34

C. Mohan, IISc BLR, 2022-01-05

35 of 63

Aurora Architecture

@seemohan

35

C. Mohan, IISc BLR, 2022-01-05

36 of 63

Amazon Aurora Design Philosophy

@seemohan

36

C. Mohan, IISc BLR, 2022-01-05

37 of 63

Aurora Compute

  • Single tenant compute instance (Head Node)
  • Handles upper half of database stack
    • Client connection management
    • Query processing
    • Transaction management
    • Locking
    • Buffer cache
    • Access methods
    • Undo management

Customer�Application

SQL

Transactions

Caching

Logging

Head Node

Customer VPC

@seemohan

37

C. Mohan, IISc BLR, 2022-01-05

38 of 63

Aurora: Offload Redo to Storage

Database Tier

  • Writes redo log records on network
  • No full data block writes for checkpointing, cache eviction, background writes
  • Push log applicator to storage

Storage Tier

  • Multi tenant, scalable, replicated and self healing storage fleet
  • Each segment replicated 6 ways across 3 datacenters (AZs)
  • Highly parallel scale out redo processing
  • Generate DB blocks on demand (redo)
  • Materialize DB blocks in background (redo)

The Log is the Database

SQL

Transactions

AZ 1

AZ 2

AZ 3

Caching

Amazon S3

@seemohan

38

C. Mohan, IISc BLR, 2022-01-05

39 of 63

Network I/O in MySQL vs Amazon Aurora

@seemohan

39

C. Mohan, IISc BLR, 2022-01-05

40 of 63

Aurora Reduces Aggregate I/O Burden

30 minute SysBench write-only workload, 100GB dataset

6X more log writes, but 9X less network traffic

RDS MySQL Multi-AZ 30K PIOPS

Total trx : 780K

I/Os per trx : 7.4 (ex-mirroring)

Aurora

Total trx : 27,378K

I/Os per trx : 0.95 (inc 6x amp)

35X MORE

7.7X LESS

@seemohan

40

C. Mohan, IISc BLR, 2022-01-05

41 of 63

Aurora Storage Node

@seemohan

41

C. Mohan, IISc BLR, 2022-01-05

42 of 63

Availability & Software Upgrades

@seemohan

42

C. Mohan, IISc BLR, 2022-01-05

43 of 63

Distributed Commits in Cloud Databases

No atomic flush spanning multiple storage nodes

Commits over distributed storage use 2PC/Paxos to establish global consistency

Network chatter leads to stalls and jitter in the write path

Local

Storage

SQL

Transactions

Caching

Logging

Compute

Network

Storage

@seemohan

43

C. Mohan, IISc BLR, 2022-01-05

44 of 63

Storage nodes

Establish compact consistency points that:

Increase monotonically

Are continuously returned to the database

Do not vote on accepting a write

Execute idempotent operations on local state

Database nodes

Handle locking, transactions, deadlocks, constraints etc.

Aurora: Asynchronous Commit Processing

@seemohan

44

C. Mohan, IISc BLR, 2022-01-05

45 of 63

Backward Chaining of Redo Log Records

Each redo log record includes backlink LSNs:

        • Block backlink: Previous log record of the modified block
          • Used to materialize blocks on demand and background
        • Segment backlink: Previous log record in the segment
          • Used to identify records not received by the storage node
        • Volume backlink: Previous log record in the volume
          • Used to regenerate metadata as a fallback path

@seemohan

45

C. Mohan, IISc BLR, 2022-01-05

46 of 63

Aurora: Crash Recovery

DB opens all storage nodes, reach read quorum for all PGs

Locally re-compute PGCLs and VCL

VCL is the highest point where all records have met quorum

Everything past VCL is truncated at crash recovery

No redo or undo processing is required before the database is opened for processing

CRASH

Log records

Gaps

Volume Complete

LSN (VCL)

AT CRASH

IMMEDIATELY AFTER CRASH RECOVERY

Ragged edge of log not meeting quorum and truncated at recovery

@seemohan

46

C. Mohan, IISc BLR, 2022-01-05

47 of 63

  • Offload processing to storage
  • Designed for analytical queries�with large scans
  • Aurora storage has thousands of CPUs
  • Push down predicates�and parallelize query processing
  • Reduce network traffic
  • Avoid buffer pool trashing

Database Node

Storage nodes

Push down �predicates

Aggregate�results

Aurora Parallel Query

@seemohan

47

C. Mohan, IISc BLR, 2022-01-05

48 of 63

Aurora Multi-Master

@seemohan

48

C. Mohan, IISc BLR, 2022-01-05

49 of 63

Microsoft Azure: SQL DB HADR Architecture

  • Single R/W primary
  • Log shipping with 3 read-only secondaries
  • Log, incremental/full DB backups
  • Versioning for snapshot isolation done locally in temporary storage
  • Resilient buffer pool extension (RBPX)

@seemohan

49

C. Mohan, IISc BLR, 2022-01-05

50 of 63

Microsoft Socrates (aka SQL DB Hyperscale)

@seemohan

50

C. Mohan, IISc BLR, 2022-01-05

51 of 63

POLARIS: Distributed SQL in Azure Synapse

Converging warehouses and data lakes in Azure Synapse

@seemohan

51

C. Mohan, IISc BLR, 2022-01-05

52 of 63

Google Spanner

  • Globally distributed semi-relational system, SQL-based query lang
  • Uses Colossus File System (successor to GFS)
  • Transactional, relies on TrueTime and Paxos

@seemohan

52

C. Mohan, IISc BLR, 2022-01-05

53 of 63

Google F1

  • Developed for use with AdWords platform replacing MySQL
  • Relies on Google Spanner; mostly stateless F1 servers
  • Distributed SQL queries for OLAP-style queries, using snapshot transactions and 2PC
  • Optimistic concurrency control
  • Transaction consistent secondary indexes
  • Protocol Buffer as column data type
  • Clustered hierarchical tables
  • Increased read, write and commit latencies
  • 5 copy replication
  • Row or sub-row level locking
  • Many issues found with inefficiencies of MySQL ORM layer – replaced with F1 ORM
  • Multiple client consumers of a single query’s results

@seemohan

53

C. Mohan, IISc BLR, 2022-01-05

54 of 63

CockroachDB (CRDB)

  • Supports public and private cloud nodes in a single CRDB cluster
  • Uses standard clock synchronization protocol like NTP
  • Geo-distributed (range) partitioning and replica placement
  • At least 3 replicas of every partition
  • Distributed SQL optimization and execution
  • Shared nothing architecture
  • Serializability only – no snapshot isolation! For read-only “follower read” option exists.
  • PostgreSQL
  • Store and index Spatial data types with familiar, PostGIS-compatible SQL syntax
  • Uses Raft consensus protocol
  • New in v20.2: By default, CRDB uses Pebble storage engine, with RocksDB as an option. Pebble is intended to be bidirectionally compatible with RocksDB on-disk format, but differs in that it:
    • Is written in Go and implements a subset of RocksDB's large feature set
    • Contains optimizations that benefit CockroachDB

@seemohan

54

C. Mohan, IISc BLR, 2022-01-05

55 of 63

Snowflake

  • Cloud-based warehouse system
  • Compute and storage elasticity
  • Multi-tenancy
  • Custom-designed storage for management and exchange of ephemeral data as well as for caching persistent data for write-through
  • Virtual Waterhouse (VW) = set of EC2 instances
  • Elasticity on fine-grained timescales via pre-warmed VWs
  • Anonymized 70 million queries available at https://github.com/resource-disaggregation/snowset

@seemohan

55

C. Mohan, IISc BLR, 2022-01-05

56 of 63

FoundationDB

@seemohan

56

C. Mohan, IISc BLR, 2022-01-05

57 of 63

Alibaba Cloud Database Systems

@seemohan

57

C. Mohan, IISc BLR, 2022-01-05

58 of 63

Alibaba POLARDB Architecture

  • Cloud-native LSM-based row store OLTP system
  • Compatible with MySQL & PostgreSQL
  • RDMA interconnect
  • 1 writer & multiple reader nodes
  • 3 copies of data via Parallel-Raft protocol
  • Built on PolarFS
  • Pushes down table scans to storage and uses computational storage drives

@seemohan

58

C. Mohan, IISc BLR, 2022-01-05

59 of 63

Alibaba POLARDB Architecture

  • FPGA-based host-managed computational storage drive design
  • Single FPGA for flash memory control and table scans
  • Host manages address mapping, request scheduling and garbage collection

@seemohan

59

C. Mohan, IISc BLR, 2022-01-05

60 of 63

Alibaba POLARDB Architecture

  • PolarFS changed to do table scans
  • Employs parallelism for scans

@seemohan

60

C. Mohan, IISc BLR, 2022-01-05

61 of 63

POLARDB-X Architecture

  • Distributed shared-nothing OLTP DBMS to overcome RDMA-limited scaling
  • Combines shared-storage and shared-nothing concepts

@seemohan

61

C. Mohan, IISc BLR, 2022-01-05

62 of 63

IoT-Cloud Architecture

Source: P. Pierleoni et al. IEEE Access, 2019

@seemohan

62

C. Mohan, IISc BLR, 2022-01-05

63 of 63

IoT-Cloud Architecture

Source: P. Pierleoni et al. IEEE Access, 2019

@seemohan

63

C. Mohan, IISc BLR, 2022-01-05