3 of 13

Quick Intro: Building blocks of a Platform

(Type of Platform: DbaaS)

Automation first approach. Development over operational management.

Focus on Self serve features.

Kubernetes stateful objects

Build tooling to support increasing scale and maintenance overhead.

4 of 13

Quantifying Reliability of a Platform: Availability

Availability Computation - The “multi-nine” challenge: Why uptime isn’t the whole story.�Our scientific, real-time approach to calculating true DbaaS availability.�Resiliency Trade-offs in Multi-Cloud
Designing for failure: DR strategies across multiple cloud environment.�Real-world examples: Balancing opex, performance, and recovery during peak events.

Visibility interval

5m, 24h, 30d

Level of Granularity:

Unified single for multi-DC
DC level
Tenant level for multi-tenant platforms

5 of 13

Unified Fleet view

ToDo: Add more dashboards and monitoring snapshots

6 of 13

Perception of Availability

Availability(%) = (Successful unit of work / Total unit of work ) * 100

Computation

Per DC = (1 - sum((failed_queries{context="DC1"}[5m])) / sum((total_queries{context="DC1"}[5m]))) * 100

ToDo:

Add Overall
Add tenant level
Segregate Read and Write availability

Prerequisite for calculation:

List of metrics anyone would need - ToDo

7 of 13

Resiliency@scale

For a multi-tenant platform operating in multi-cloud environment, resiliency strategies are bound to be case to case basis. No one size fits all solution is possible.

Complexity:

Local - complete/partial

Remote - complete/partial

Single region/ Multi-region

ToDo: Complete the scenarios identifies here

8 of 13

Resiliency@scale.. contd

Learnings:

Identifying patterns for accepted DR mechanism for tenants. Eg - Is traffic routing possible for tenant to perform ELB failover.
RPO and RTO awareness of tenants.
Identify Healthy DC/cluster to act as a source for initiating resiliency measures.
Identify the topologies - active/primary, passive/read-only, AA, AP, linear chain or star replication.
Identify replication replay/rewind v/s backup restore tradeoffs.
Automated tools as self-serve with expected completion time to users.
Calling out the impact with each possible.
Restoring the original topologies based on source, intermediate or destination in the chain.

9 of 13

Replication Topology

Linear chain topology:

DC1 -> DC2
DC1 <-> DC2
DC3 -> DC1 -> DC2

Star topology

DC1 -> DC2, DC1 -> DC3

Hybrid topology

DC2 <-> DC1 -> DC3
DC3 <-> DC1 -> DC2
DC2 <-> DC1 <-> DC3

ToDo - Add an intuitive graphical view instead of text

10 of 13

Restore strategy	Deciding factors
Customer restores the data from offline jobs	Tenant have pipeline available to restore entire data. Tenants to have levers to restore more important data quickly to resume operations.
Replication [Data migration through replication from other healthy DC]	It can only be performed if there is a replica DC without data loss. Supports partial data restoration based on updated-at timestamp. Provide lever to only restore recently updated records. DC with lower RPO and RTO is preferred. Preferred approach for controlled replication speed, where throttling is supported dynamically.
Data Restore using backup [Data restore through last successful backup]	It can only be executed if backup is setup on cluster and have successful backup ready to restore. Fastest bootstrap till last backup RPO on empty cluster. 20% faster then XDR restore The delta records between last backup start time to now will be unavailable.
Backup restore followed by replication for delta data	Fastest way to recover a cluster without any data loss Preferred over other options in case of time constraints.

DR and Restoration strategies

11 of 13

Infusing Intelligence@scale

Insights & Recommendations: Leveraging LLMs for proactive identification of optimization opportunities for maintainers.
Self-serve troubleshooting: Enabling platform users with LLM-powered troubleshooting and debugging guidance.

ToDo: Add Insights generation components/arch diagram/flow

1 of 13

2 of 13