[TM Technical Report] 
Data Backup and Disaster Recovery

 [Authors: Shawn Lee]

Introduction

Data storage in TEAMMATES

Strengths of the GAE datastore

Disasters that can affect the GAE datastore

Available solutions by GAE

Datastore admin

Using the Datastore admin

Limitations of the datastore admin

Bulkloader

Using the bulkloader

Limitations of bulkloader

Incremental Backup

Implementing an incremental backup

Retrieving recently modified courses via logs

Downloading all data within a course using remote api

Automatically schedule downloads using scheduled tasks

Uploading backup data

Resources

References

Introduction

Data backup is important as TEAMMATES contains a large amount of user data. In the event of a disaster, we must be able to recover all lost user data. Data backup is also useful in the event where we decide the migrate the data to a new host.

In this report, we will:

Data storage in TEAMMATES

TEAMMATES is hosted on the Google App Engine (GAE). Thus, all user data is stored on the GAE datastore. The GAE datastore is a high replication storage system, which means data is stored and replicated on more than 1 data server.

Strengths of the GAE datastore

The high replication datastore provides a large amount of redundancy and fault tolerance. If a single datastore suffers a power outage or an external attack, no data will be lost as there will be at least one other datastore containing the same data. Hence, each replica datastore serves as a form of backup for the user data.

Hence, the GAE datastore guarantees some form of disaster recovery.

Disasters that can affect the GAE datastore

While the GAE datastore is resilient against problems that targets a single datastore, it is still vulnerable to problems which affect the datastore globally. These problems include:

These problems occurs with a low probability, but in the event it does occur, it will cause large scale damage to TEAMMATES. Thus, it is important we take measures to ensure we always have data backup to restore any lost data.

Available solutions by GAE

Datastore admin

The GAE provides an interface known as “Datastore admin”, which allows administrators to view and perform operations on any data stored within the datastore. This includes an option to perform data backup.

This method is simple as the backup will be performed on all entities automatically. The backup can also be made to perform on regular intervals via a cron job, which is a GAE feature that allows an application to perform tasks based on a schedule. This allows us to keep regular backups of all the data in the datastore.

Using the Datastore admin

  1. To use the datastore admin, we must access it from the application dashboard. The dashboard can be accessed from https://appengine.google.com/

        The datastore admin can be found under the data section (circled in red).

dashboard.png

  1. Enable the datastore admin if it is not enabled.
  2. Select the entity types to backup

backup.png

  1. Click on the “backup entities” button
  2. After backup is completed, the data can be restored via selecting the backup and clicking on restore, which is also located at the datastore admin screen.

restore.png

Limitations of the datastore admin

While the datastore admin is easy to use, it is also has many limitations:

The datastore admin only offers an internal backup, which means any data backup can only be transferred within the app engine. This backup can only be used to restore data to an application, or migrate data to another application within the GAE. This is undesirable as we will be unable to migrate the data to another host other than the GAE as there is no option to export the data.

Also, the backup can only be performed on types of entities.Therefore, we do not have the option to selectively backup the entities, and therefore every entity will be placed in the backup. This is inefficient as a previous backup might have already backup the entity. A better solution will be to simply check which entity has been modified since the last backup, and selectively backup those entities. However, since we are unable to backup individual entities, we will incur extra costs if we download all the data.

Lastly, as the backup data is still stored on the GAE and we are unable to export it for personal storage, it is still vulnerable to attacks by hackers who can wipe out all our backup data.

Bulkloader

The bulkloader is a python script provided by the GAE python SDK. It allows us to generate a configuration file to download data from the GAE datastore. The bulkloader provides 2 features:

  1. Downloading data to a local machine
  2. Specifying configurations for formatting downloaded data such as .csv format

The ability to download and store the data on a local machine is very important since we have seen in the previous section that backup data stored on the cloud is still vulnerable to attacks by hackers. Thus, a local machine can serve as a secondary location for a backup.

Also, the configuration file is able format the downloaded data as a comma-separated values (.csv) file, which makes the data human readable. It also allows us to migrate the data out of GAE to other platforms as the .csv format is widely supported.

Using the bulkloader

  1. Download the GAE python SDK from: https://developers.google.com/appengine/downloads#Google_App_Engine_SDK_for_Python

  1. Open command prompt with administrator privileges in the python SDK directory

  1. Generate the configuration file by running the command:

bulkloader.py --create_config --url <app-url>/remote_api --filename <filename>.yaml

<app-url> is the application URL, in this case is https://teammatesv4.appspot.com/

        <filename> is the name of the generated configuration file

        A sample command will look like:

bulkloader.py --create_config --url https://teammatesv4.appspot.com/remote_api --filename generate_bulkloader.yaml

  1. The generated configuration file will contain options for each entity type found in the application. To specify .csv format for the data, replace the connector types as follows:

                connector: csv

                connector_options:

                        encoding: utf-8

                        columns: from_header

  1. The entity data can then be downloaded via the command:

bulkloader.py --download --url <app-url>/remote_api --config_file <generated filename> --kind <entity type> --filename <output file>.csv

<app-url> is the application URL, in this case is https://teammatesv4.appspot.com/

        <generated filename> is the name of the generated configuration file

        <entity type> refers to the type of entity which we wish to download

<output file> is the csv file we wish to save to

A sample command will look like:

bulkloader.py --download --url https://teammatesv4.appspot.com/remote_api --config_file generated_bulkloader.yaml --kind Account --filename accounts.csv

  1. The data can be uploaded by replacing the --download flag with --upload

Limitations of bulkloader

Although the bulkloader is able to specify many options for downloading data, it has several limitations:

As the bulkloader is run remotely from the GAE, any interruption in the connection will cause data transfer of the backup data to fail. As there is no automatic retry mechanism in place, the backup will be incomplete. In such an event, the backup must be run again, which will waste time and resources.

Also, even though we are able to specify options in the bulkloader, we cannot choose specific entities for download. Therefore, we will have to download all the entities as mentioned in the datastore admin.

Incremental Backup

The main problem we have seen in the previous solutions is we waste resources to perform backup on every entity. As some entities might have already be downloaded in a previous backup, we can ignore that entity if there are no changes made to it since we already have the latest backup copy of it. In other words, we should only perform backup of entities that were modified since the last backup. This is known as an incremental backup.

Hence, to perform an incremental backup, we should ideally:

  1. Create a modified timestamp for each entity
  2. Filter all entities with modified timestamp later than the previous backup
  3. Backup those modified entities

However, there are 2 problems with this solution:

  1. Solutions provided by GAE cannot filter specific entities based on timestamp
  2. Adding a new timestamp field to all entities in TEAMMATES requires a large change in the database level codes. We must also update all entities currently stored in TEAMMATES with the new timestamp field.

Implementing an incremental backup

Based on the problems highlighted, we will combine various methods as an alternative to perform an incremental backup in TEAMMATES. The following flowchart shows the current approach towards incremental backup:

  1. Retrieving recently modified courses via logs

To avoid adding a timestamp field, we will create unique log messages for the creation, modification or deletion of all entities in TEAMMATES. We can retrieve the logs via the log api provided by the GAE. The following figure shows the message logged after modification of an entity.

Based on these logs, we can obtain which courses were modified recently. These allows us to incrementally backup data as only the most recently updated data will be considered for the backup.

  1. Downloading all data within a course using remote api

Using the above courses that were extracted, we can then download all entities within the course. These include all accounts of students and instructors, evaluations, feedback sessions and responses, comments and submissions.

The data must be downloaded to a remote machine, therefore we will make use of the remote api provided by the GAE to connect directly from the local machine to the app engine datastore. Individual entities can then be retrieved via their respective logic or database classes.

These entities will be saved into a .json file. Each course data will be stored in a individual .json file, and all of these .json files will be saved into a folder timestamped with the current time. This allows us to identify the latest data if a restore of data is required.

  1. Automatically schedule downloads using scheduled tasks

To automate backing up of data, we will use scheduled tasks provided by Windows. We can export the backup source files in the form of a .jar file, which allows us to run the backup as though it is an executable file. We can then specify the scheduled task to execute the .jar file on regular intervals.

The .jar file can be created with the following steps:

  1. In Eclipse, go to File -> Export -> Select runnable JAR file
  2. Under launch configuration, select OfflineBackup
  3. Choose a destination for the .jar file
  4. Under library handling, select “Copy required libraries into a sub-foler next to the generated JAR”
  5. Move the .jar files into the root folder of TEAMMATES

After consideration, we have decided to schedule the backup to be run every 24 hours. This time interval works best as it allows us to capture a reasonable amount of data. Since data within courses might be changed more than once a day, short intervals will cause us to backup the same course many times, which might be unnecessary. However, longer intervals will cause a lot of data to be lost in the case of a disaster. Therefore, daily backups will work best.

  1. Uploading backup data

To upload the backup data back to the app engine datastore, we will use a similar approach as when downloading data. Through the remote api, we will be able to connect to the app engine datastore. We can then parse the .json backup files back into an entity format. The respective logic or database classes can be used to persist these entities to the app engine datastore.

All the approach discussed above has already been implemented in TEAMMATES. Scripts has also been written to facilitate adding of the scheduled tasks and running the backup .jar files, therefore this allows the automated incremental backup to be setup easily on a new machine.

Future work

Resources

http://blog.notdot.net/2010/04/Using-the-new-bulkloader

http://learntogoogleit.com/post/59602939088/logging-api-example

References

[1] Logs Java API Overview

https://cloud.google.com/appengine/docs/java/logs/

[2] Remote API for Java

https://cloud.google.com/appengine/docs/java/tools/remoteapi

[3] Backup/Restore, Copy, and Delete Data

https://developers.google.com/appengine/docs/adminconsole/datastoreadmin#Backup_And_Restore

[4] Window Scheduled Tasks

http://technet.microsoft.com/en-us/library/cc772785(v=ws.10).aspx

[5] Scheduled Tasks with Cron Job for Java

https://cloud.google.com/appengine/docs/java/config/cron

--- end of report ---