1 of 24

Automating for Failure

Olubusayo Amowe�(Software Engineer)

Samson Olufuwa�(DevOps Engineer)

2 of 24

A bit about us

We run a not for profit (www.fuerza.africa) that enables underrepresented groups start a DevSecOps career.

3 of 24

Key points

  • Why should you prepare for failure?
  • What is disaster recovery and what are it’s strategies?
  • How can Terraform help with disaster recovery?
  • Demo using Terraform and AWS

4 of 24

Failure is Normal

One of the key pillars of SRE success is to accept that failure is normal

5 of 24

What is Disaster Recovery?

Disaster Recovery involves set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems.

6 of 24

Disaster Recovery strategies

  • Backup and Data Recovery
  • Pilot light
  • Warm - Hot Standby
  • Multi Region

7 of 24

Backup and Data Recovery

This describes the process of creating and storing copies of data that can be used to protect organizations against data loss.

8 of 24

Pilot Light

Involves replicating part of your IT structure for a limited set of core services so that your cloud environment can take over in event of disaster

9 of 24

Warm - Hot standby

scaled down version of fully functional environment is always running

10 of 24

Multi - Region

your infra can run on multiple regions

11 of 24

Some terms related to Disaster Recovery(RTO & RPO)

RTO or Recovery time objective is the maximum time your application can be offline. This usually depends on the SLAs you offer to your customers. An SLA is a promise made by you as a service provider, to your consumers, about the availability of your service and the ramifications of failing to deliver the agreed-upon level of service.

RPO or Recovery point objective is the maximum amount of time during which the data might be lost.

Typically, smaller RTO and RPO values mean that the application must recover quickly from an interruption.

12 of 24

Issues with using other tools for DR

  • Very expensive
  • Restricted to a particular vendor
  • Sometimes really slow
  • And again very expensive

13 of 24

14 of 24

How can Terraform help with disaster recovery?

  • Infrastructure as Code hence giving room for better DR
  • Cheaper for awesome RTO and RPO
  • Spins up same replica in seconds
  • Works with any cloud provider

15 of 24

Steps to enable DR with Terraform

  • Turn your infra to IAC
  • Enable backups for your infrastructure
  • Create your AMIS
  • Turn your infra to modules

16 of 24

Best Practices with Terraform DR

  • Your infrastructure should be stateless
  • All variables and system values are dynamic
  • Practice fire drill depending on your organizational needs

17 of 24

Considerations when using Terraform

  • Applications that are not dependent on terraform can take time to install.
  • DNS changes (Might take time for DNS changes to reflect)
  • Terraform will not back up your data ,have a good data backup strategy.

18 of 24

DEMO

19 of 24

Terraform config (main.tf)

data "aws_ami" "ubuntu" {

most_recent = true

filter {

name = "name"

values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]

}

filter {

name = "virtualization-type"

values = ["hvm"]

}

owners = ["099720109477"] # Canonical

}

CODE EDITOR

20 of 24

Terraform config (versions.tf)

data "aws_ami" "ubuntu" {

most_recent = true

filter {

name = "name"

values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]

}

filter {

name = "virtualization-type"

values = ["hvm"]

}

owners = ["099720109477"] # Canonical

}

CODE EDITOR

21 of 24

AWS codebuild config

(buildspec.yml)

data "aws_ami" "ubuntu" {

most_recent = true

filter {

name = "name"

values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]

}

filter {

name = "virtualization-type"

values = ["hvm"]

}

owners = ["099720109477"] # Canonical

}

CODE EDITOR

22 of 24

data "aws_ami" "ubuntu" {

most_recent = true

filter {

name = "name"

values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]

}

filter {

name = "virtualization-type"

values = ["hvm"]

}

owners = ["099720109477"] # Canonical

}

CODE EDITOR

23 of 24

RECAP

  • Be prepared for disaster
  • Have a plan
  • Do regular tests to validate your plan
  • Terraform makes DR easy
  • Remember to backup your data

24 of 24

Thank you!