Business Continuity & Disaster Recovery Policy

Business Continuity

We believe that business continuity is the ability to maintain operations/services in the face of a disruptive event. This requires the availability of computing, application services, physical network access, and network services, as well as user/client access to this infrastructure. Collavate maintains continuity in operations and services, including systems such as Web servers, email, critical databases, and so forth, which requires specific technology. This technology and infrastructure can include Cloud, virtualization, clustering/failover/failback, server hardware, network and networking services, remote datacenter facilities, replication services, and redundant shared storage. Depending on the type of event, continuity of a given application is achieved by failing over its services and user/client access locally within the same datacenter or to remote, physically disparate data center which is provided by Google Cloud Platform(GCP).

With Collavate business continuity plan, the failover of a service is measured in seconds or less. Backup technologies, including those that rely on disk as a backup target, cannot provide this level of continuity of services. Backups, in order to be used, require a restoration process and are typically used for disaster recovery purposes.

Our business impact analysis template(above this one) is designed for companies to establish a clear plan of action after a disruption in normal business processes. Through both qualitative and quantitative business operation variables, a BIA collects information to develop a targeted recovery strategy to maintain productivity and business continuity. These variables include recovery time objective (RTO), recovery point objective (RPO), and maximum tolerable downtime (MTD). By identifying severity of impact, resource requirements, and recovery priorities, a company can minimize its recovery time. After these initial components are established, the BIA can assess the financial and operational impacts based on the levels of severity afflicted on business units, departments, and processes.

Disaster Recovery

Phase I – Data Collection

1. Project should be organized with timeline, resources, and expected output

2. Business impact analysis should be conducted at regular intervals

3. Risk assessment should be conducted regularly

4. Onsite and Offsite Backup and Recovery procedures should be reviewed

5. Alternate site location must be selected and ready for use

Phase II – Plan Development and Testing

1. Development of Disaster Recovery Plan

2. Testing the plan

Phase III – Monitoring and Maintenance

1. Maintenance of the Plan through updates and review

2. Periodic inspection of DRP

3. Documentation of changes

4. If Only I had Known!

5. IT Network Disaster Recovery

Objective

The statement of the objective including project details, Onsite/Offsite data, resources, and business type Disaster Recovery Plan Criteria A documentation of the procedures as to declaring emergency, evacuation of site pertaining to nature of disaster, active backup, notification of the related officials/DR team/staff, notification of procedures to be followed when disaster breaks out, alternate location specifications, should all be maintained. It is beneficial to be prepared in advance with sample DRPs and disaster recovery examples so that every individual in an organization are better educated on the basics.

DR Team – Roles and Responsibilities

Documentation should include identification and contact details of key personnel in the disaster recovery team, their roles and responsibilities in the team. Contingency Procedures The routine to be established when operating in contingency mode should be determined and documented. It should include inventory of systems and equipment in the location; descriptions of process, equipment, software; minimum requirements of processing; location of vital records with categories; descriptions of data and communication networks, and customer/vendor details. A resource planning should be developed for operating in emergency mode. The essential procedures to restore normalcy and business continuity must be listed out, including the plan steps for recovering lost data and to restore normal operating mode.

Testing and Maintenance

The dates of testing, disaster recovery scenario, and plans for each scenario should be documented. Maintenance involves record of scheduled review on a daily, weekly, monthly, quarterly, yearly basis; reviews of plans, teams, activities, tasks accomplished and complete documentation review and update. The disaster recovery plan developed thereby should be tested for efficiency. To aid in that function a test strategy and corresponding test plan should be developed and administered. The results obtained should be recorded, analyzed, and modified as required. Organizations realize the importance of business continuity plans that keep their business operations continuing without any hindrance. Disaster recovery planning is a crucial component of today’s network-based organizations that determine productivity, and business continuity.

Collavate adapts Google’s Disaster Recovery Guideline

When it comes to disaster recovery, there’s no silver bullet—that is, no single recovery plan can cover all use cases. This article provides guidance for handling a variety of disaster recovery scenarios using Google’s cloud infrastructure.

Note: This document addresses targeted disaster recovery scenarios. The viability of each suggested approach is subject to your specific compliance requirements. For general advice for developing a disaster recovery plan on Google Cloud Platform, see Designing a Disaster Recovery Plan with Google Cloud Platform.

Terminology

This article uses the following terms:

● The recovery time objective (RTO), which is the maximum acceptable length of time that your application can be offline. This value is usually defined as part of a larger service level agreement (SLA).

● A recovery point objective (RPO), which is the maximum acceptable length of time during which data might be lost due to a major incident. Note that this metric describes the length of time only; it does not address the amount or quality of the data lost.

For a broader discussion of these concepts, as well as general principles for designing a disaster recovery plan, seeDesigning a Disaster Recovery Plan with Google Cloud Platform.

Scenarios

This section explores common disaster recovery scenarios and provides recovery strategies and example implementations on Google Cloud Platform for each.

Historical data recovery

Historical data most often needs to be archived for compliance reasons, but it is also commonly archived for use in future historical analysis. In both cases, it’s important to archive relevant log and database data in a durable way using an easily accessible and transformable format. Typically, historical data has a medium or large RTO. However, as it is expected to be complete and accurate, historical data tends to have a small RPO.

Archiving log data

Log data is usually used for historical trend analysis and for potential forensic analysis. Generally, this data does not need to be stored for years. However, as noted earlier, it’s important that this data can be easily imported into a format that lends itself to analysis.

Google Cloud Platform provides several options for exporting log data, including:

● Stream to Google Cloud Storage bucket, which periodically writes your logs to Cloud Storage. The files are timestamped, encrypted, and stored in appropriately-named folders, making it simple to locate logs from a given time period.

● Stream to BigQuery dataset, which streams your logs to a BigQuery dataset. BigQuery stores data in an immutable, read-only manner. For details on exporting logs, see Exporting Your Logs.

Archiving database data

Note: This section discusses archiving and retrieving relational database records. However, the methodology described can be applied to non-relational databases, such as NoSQL datastores, as well. Relational database backups often use a multitiered solution, where the live data is stored on a local storage device and backups are stored on progressively “colder” storage solutions. In this solution, a cron job (or similar) backs up the live data to the second tier at regular intervals, and another job is used to back up data from that tier to another tier at slightly wider intervals. One possible implementation of this strategy on Google Cloud Platform would be to use persistent disk for the live data tier, a standard Cloud Storage bucket for the second tier, and a Cloud Storage Nearline bucket for the final tier.

In this implementation, the tiers would be connected as follows:

1. Configure your application to back up data to the persistent disk attached to the instance.

2. Set up a task, such as a cron job, to move the data to the standard Cloud Storage bucket after a defined period of time.

3. Finally, set up another cron job or use Cloud Storage Transfer Service to move your data from the standard bucket to the Nearline bucket.

4. Note: You can find Python example code for Cloud Storage Transfer Service in Cloud Platform’s GitHub repository.

To make this a complete disaster recovery solution, you must also implement some method of restoring your backups to a compatible version of the database.

Three viable approaches are as follows:

● Create a custom image that has the proper version of the database system installed.

● You can then create a new Compute Engine instance with this image to test the import process. Note that this approach requires regular and rigorous testing.

● Take regular snapshots of your database system.

● If your database system lives on a Compute Engine persistent disk, you can take snapshots of your system each time you upgrade. If your database system goes down or you need to roll back to a previous version, you can simply create a new persistent disk from your desired snapshot and make that disk the boot disk for a new Compute Engine instance. Note that, to avoid data corruption, this approach requires you to freeze the database system’s disk while taking a snapshot.

● Export the data to a highly-portable flat format such as CSV, XML, or JSON, and store it in Cloud Storage Nearline.

● This approach will provide maximum flexibility, allowing you to import the data into any database system you choose to use. In addition, JSON and CSV can be easily imported into BigQuery, which will make future analysis simple and straightforward.

● Note: This approach’s viability is subject to your specific compliance requirements.

Archiving directly to BigQuery

If your use case permits, you can archive real-time event data directly into BigQuery by using streaming inserts. This approach is particularly useful for performing big data analytics. To prevent accidental overwrites, you should use IAM to manage who has update and delete access to the data written to the tables.

Data corruption recovery

When database data has been corrupted, your data will need to be recovered easily and made available quickly. A good approach here is to use backups in combination with transactional log files from the corrupted database to roll back to a known-good state.

If you have chosen to use Cloud SQL, Google Cloud Platform’s fully-managed MySQL database, you should enable automated backups and binary logging for your Cloud SQL instances. This will allow you to easily perform a point-in-time recovery, which restores your database from a backup and recovers it to a fresh Cloud SQL instance. For more details, see Cloud SQL Backups and Recovery.backup

If you manage your own relational databases with Compute Engine, the principles remain the same, but you are responsible for managing the database service and implementing an appropriate backup process.

 If you are using an append-only data store like BigQuery, there are a number of mitigating strategies you can adopt:

● Export the data from BigQuery, and create a new table that contains the exported data but excludes the corrupted data.

● Store your data in different tables for specific time periods. This method ensures that you will need to restore only a subset of data to a new table, rather than a whole dataset.

● Store the original data on Cloud Storage. This will allow you to create a new table and reload the uncorrupted data. From there, you can adjust your applications to point to the new table.

● Note: This method provides good availability, with only a small hiccup as you point your applications to the new store. However, unless you have implemented application-level controls to prevent access to the corrupted data, this method can result in inaccurate results during later analysis.

Additionally, if your RTO permits, you can prevent access to the table with the corrupted data by leaving your applications offline until the uncorrupted data has been restored to a new table.

Application recovery

It’s important to maintain high levels of uptime—if your service is unavailable, you’re losing business. This section will examine ways of failing your application over to another location as quickly as possible.

Note: The solutions in this section focus on applications running entirely on Google Cloud Platform. For advice on handling remote recovery use cases, such as on-premises-to-cloud or cloud-to-cloud, see the Remote recovery section below.

Hot standby server failover

In this solution, you have a continuously online server on standby. This server does not receive traffic while the main application server is functional. If your service is running entirely on Google Compute Engine, you can streamline application failover by using Compute Engine’s HTTP load balancing service. The HTTP load balancer accepts traffic through a single global external IP address, and then distributes it according to forwarding rules you define. Properly configured, this service will automatically fail over to your standby server in the event that a main instance becomes unhealthy.

Important: The HTTP load balancing service can direct traffic only to Compute Engine instance groups; it cannot be used to send traffic to IPs outside your Compute Engine network.

Warm standby server failover

This solution is identical to hot standby server failover, but omits use of Compute Engine’s HTTP load balancing service in favor of manual DNS adjustment. Here, RTO is determined by how quickly you can adjust the DNS record to cut over to the standby server.

Cold standby server failover

In this solution, you have an offline application server on standby that is identical to the main application server. In the event that the main application server goes offline, the standby server is instantiated. Once it is online, traffic fails over to it.

In this example, you would run the following:

● A serving instance. This instance is part of an instance group, and said group is used as a backend service for an HTTP load balancer.

● A minimal instance that performs the following functions:

○ Runs a cron job to snapshot the serving instance at regular intervals

○ Checks the health of the serving instance at regular intervals

This minimal instance is part of a managed instance group, and this group is controlled by a Compute Engine autoscaler. The autoscaler is configured to keep exactly one minimal instance running at all times, utilizing an instance template to create a new instance in the event that the current running instance becomes unavailable.

If the minimal instance detects that the serving instance has been unresponsive for a specified period of time, it instantiates a new instance using the latest snapshot and adds the new instance to the managed instance group. When the new instance comes online, the HTTP load balancer begins directing traffic to it.

Warm static site failover

In the unlikely event that you are unable to serve your application from Compute Engine instances, you can mitigate service interruption by having a Cloud Storage-based static site on standby. This solution is very economical, and can be particularly effective if your website has few or no dynamic elements—in the event of failure, you can simply change your DNS settings, and you will have something serving immediately.

The following diagram illustrates an example implementation:

Remote recovery

If your production environment is on-premises or on another cloud provider, Google Cloud Platform can be useful as a target for backups and archives. Using Carrier Interconnect, Direct Peering, and/or Compute Engine VPN, you can easily adapt the previously described disaster recovery strategies to your own situation. This section discusses methods for integrating Google Cloud Platform into your remote disaster recovery strategies.

Replicating storage with Google Cloud Platform

If you are replicating from an on-premises storage appliance, you can use Carrier Interconnect or Direct Peering to establish a connection with Google Cloud Platform, then copy your data to the storage solution of your choice. Data can then be restored to your on-premises storage or to a storage location on Google Cloud Platform.

Replicating application data with Google Cloud Platform

In this scenario, production workloads are on-premises and Google Cloud Platform is the disaster recovery failover target. One possible solution is to set up a minimal recovery suite—a cold standby application server and a hot/active database—on Google Cloud Platform, configuring the former to quickly scale up in the event that it needs to run a production workload. In this situation, the database must be kept up-to-date; however, the application servers would only be instantiated when there is a need to switch over to production. Depending on your RTO, the appropriate image starting point would be used to start and configure a working instance.

The diagram below illustrates how a multitiered application can run on-premises while using a minimal recovery suite on Google Cloud Platform:

To reduce costs, you can run the database on the smallest machine type capable of running the database service. When the on-premises application needs to fail over, you can make your database system production-ready as follows:

1. Destroy the minimal instance, making sure to keep the persistent disk containing your database system intact. If your system is on the boot disk, you will need to set the auto-delete state of the disk to false before destroying this instance.

2. Create a new instance, using a machine type that has appropriate resources for handling a production load.

3. Attach the persistent disk containing your database system to the new instance.

In the event of a disaster, your monitoring service will be triggered to spin up the web tier and application tier instances in Google Cloud Platform. You can then adjust the Cloud DNS record to point to the web tier or, if you are using the Compute Engine HTTP load balancing service, to the load balancer’s external IP.

The following diagram illustrates the state of the overall production environment after the disaster recovery plan has been executed:

Maintaining machine image consistency

If you choose to implement an on-premises/cloud or cloud/cloud hybrid solution, you will most likely need to find a way to maintain consistency across production environments.

For a discussion of how to create an automated pipeline for continuously building images with Packer and other open source utilities, seeAutomated Image Builds with Jenkins, Packer, and Kubernetes.

If a fully-configured image is required, consider something like Packer, which can create identical machine images for multiple platforms from a single configuration file. In the case of Packer, you can put the configuration file in version control to keep track of what version is deployed in production.

As another option, you could use configuration management tools such as Chef, Puppet, Ansible, or Saltstack to configure instances with finer granularity, creating base images, minimally-configured images, or fully-configured images as needed.

For a discussion of how to use these tools effectively, see Compute Engine Management with Puppet, Chef, Salt, and Ansible.

You can also manually convert and import existing images such as Amazon AMIs, Virtualbox images, and RAW disk images to Compute Engine.