1 of 13

The Right 'AIR' Mix:

Fueling High-Performance Platforms

(Work in-progress)

2 of 13

A - Availability

I - Intelligence

R - Resilience

3 of 13

Quick Intro: Building blocks of a Platform

(Type of Platform: DbaaS)

  • Automation first approach. Development over operational management.

  • Focus on Self serve features.

  • Kubernetes stateful objects

  • Build tooling to support increasing scale and maintenance overhead.

4 of 13

Quantifying Reliability of a Platform: Availability

  • Availability Computation - The “multi-nine” challenge: Why uptime isn’t the whole story.�Our scientific, real-time approach to calculating true DbaaS availability.�Resiliency Trade-offs in Multi-Cloud
  • Designing for failure: DR strategies across multiple cloud environment.�Real-world examples: Balancing opex, performance, and recovery during peak events.

Visibility interval

  • 5m, 24h, 30d

Level of Granularity:

  • Unified single for multi-DC
  • DC level
  • Tenant level for multi-tenant platforms

5 of 13

Unified Fleet view

ToDo: Add more dashboards and monitoring snapshots

6 of 13

Perception of Availability

Availability(%) = (Successful unit of work / Total unit of work ) * 100

Computation

Per DC = (1 - sum((failed_queries{context="DC1"}[5m])) / sum((total_queries{context="DC1"}[5m]))) * 100

ToDo:

  • Add Overall
  • Add tenant level
  • Segregate Read and Write availability

Prerequisite for calculation:

  • List of metrics anyone would need - ToDo

7 of 13

Resiliency@scale

For a multi-tenant platform operating in multi-cloud environment, resiliency strategies are bound to be case to case basis. No one size fits all solution is possible.

Complexity:

Local - complete/partial

Remote - complete/partial

Single region/ Multi-region

ToDo: Complete the scenarios identifies here

8 of 13

Resiliency@scale.. contd

Learnings:

  • Identifying patterns for accepted DR mechanism for tenants. Eg - Is traffic routing possible for tenant to perform ELB failover.
  • RPO and RTO awareness of tenants.
  • Identify Healthy DC/cluster to act as a source for initiating resiliency measures.
  • Identify the topologies - active/primary, passive/read-only, AA, AP, linear chain or star replication.
  • Identify replication replay/rewind v/s backup restore tradeoffs.
  • Automated tools as self-serve with expected completion time to users.
  • Calling out the impact with each possible.
  • Restoring the original topologies based on source, intermediate or destination in the chain.

9 of 13

Replication Topology

    • Linear chain topology:
      • DC1 -> DC2
      • DC1 <-> DC2
      • DC3 -> DC1 -> DC2

    • Star topology
      • DC1 -> DC2, DC1 -> DC3

    • Hybrid topology
      • DC2 <-> DC1 -> DC3
      • DC3 <-> DC1 -> DC2
      • DC2 <-> DC1 <-> DC3

ToDo - Add an intuitive graphical view instead of text

10 of 13

Restore strategy

Deciding factors

Customer restores the data from offline jobs

  • Tenant have pipeline available to restore entire data.
  • Tenants to have levers to restore more important data quickly to resume operations.

Replication

[Data migration through replication from other healthy DC]

  • It can only be performed if there is a replica DC without data loss.
  • Supports partial data restoration based on updated-at timestamp. Provide lever to only restore recently updated records.
  • DC with lower RPO and RTO is preferred.
  • Preferred approach for controlled replication speed, where throttling is supported dynamically.

Data Restore using backup

[Data restore through last successful backup]

  • It can only be executed if backup is setup on cluster and have successful backup ready to restore.
  • Fastest bootstrap till last backup RPO on empty cluster. 20% faster then XDR restore
  • The delta records between last backup start time to now will be unavailable.

Backup restore followed by replication for delta data

  • Fastest way to recover a cluster without any data loss
  • Preferred over other options in case of time constraints.

DR and Restoration strategies

11 of 13

Infusing Intelligence@scale

  • Insights & Recommendations: Leveraging LLMs for proactive identification of optimization opportunities for maintainers.
  • Self-serve troubleshooting: Enabling platform users with LLM-powered troubleshooting and debugging guidance.

ToDo: Add Insights generation components/arch diagram/flow

12 of 13

Automated recommendations generated per Tenant onboarded to platform

13 of 13

Thank you