[TM Technical Report]
Data Backup and Disaster Recovery
[Authors: Shawn Lee]
Strengths of the GAE datastore
Disasters that can affect the GAE datastore
Limitations of the datastore admin
Implementing an incremental backup
Retrieving recently modified courses via logs
Downloading all data within a course using remote api
Automatically schedule downloads using scheduled tasks
Data backup is important as TEAMMATES contains a large amount of user data. In the event of a disaster, we must be able to recover all lost user data. Data backup is also useful in the event where we decide the migrate the data to a new host.
In this report, we will:
TEAMMATES is hosted on the Google App Engine (GAE). Thus, all user data is stored on the GAE datastore. The GAE datastore is a high replication storage system, which means data is stored and replicated on more than 1 data server.
The high replication datastore provides a large amount of redundancy and fault tolerance. If a single datastore suffers a power outage or an external attack, no data will be lost as there will be at least one other datastore containing the same data. Hence, each replica datastore serves as a form of backup for the user data.
Hence, the GAE datastore guarantees some form of disaster recovery.
While the GAE datastore is resilient against problems that targets a single datastore, it is still vulnerable to problems which affect the datastore globally. These problems include:
These problems occurs with a low probability, but in the event it does occur, it will cause large scale damage to TEAMMATES. Thus, it is important we take measures to ensure we always have data backup to restore any lost data.
The GAE provides an interface known as “Datastore admin”, which allows administrators to view and perform operations on any data stored within the datastore. This includes an option to perform data backup.
This method is simple as the backup will be performed on all entities automatically. The backup can also be made to perform on regular intervals via a cron job, which is a GAE feature that allows an application to perform tasks based on a schedule. This allows us to keep regular backups of all the data in the datastore.
The datastore admin can be found under the data section (circled in red).
While the datastore admin is easy to use, it is also has many limitations:
The datastore admin only offers an internal backup, which means any data backup can only be transferred within the app engine. This backup can only be used to restore data to an application, or migrate data to another application within the GAE. This is undesirable as we will be unable to migrate the data to another host other than the GAE as there is no option to export the data.
Also, the backup can only be performed on types of entities.Therefore, we do not have the option to selectively backup the entities, and therefore every entity will be placed in the backup. This is inefficient as a previous backup might have already backup the entity. A better solution will be to simply check which entity has been modified since the last backup, and selectively backup those entities. However, since we are unable to backup individual entities, we will incur extra costs if we download all the data.
Lastly, as the backup data is still stored on the GAE and we are unable to export it for personal storage, it is still vulnerable to attacks by hackers who can wipe out all our backup data.
The bulkloader is a python script provided by the GAE python SDK. It allows us to generate a configuration file to download data from the GAE datastore. The bulkloader provides 2 features:
The ability to download and store the data on a local machine is very important since we have seen in the previous section that backup data stored on the cloud is still vulnerable to attacks by hackers. Thus, a local machine can serve as a secondary location for a backup.
Also, the configuration file is able format the downloaded data as a comma-separated values (.csv) file, which makes the data human readable. It also allows us to migrate the data out of GAE to other platforms as the .csv format is widely supported.
bulkloader.py --create_config --url <app-url>/remote_api --filename <filename>.yaml
<app-url> is the application URL, in this case is https://teammatesv4.appspot.com/
<filename> is the name of the generated configuration file
A sample command will look like:
bulkloader.py --create_config --url https://teammatesv4.appspot.com/remote_api --filename generate_bulkloader.yaml
connector: csv
connector_options:
encoding: utf-8
columns: from_header
bulkloader.py --download --url <app-url>/remote_api --config_file <generated filename> --kind <entity type> --filename <output file>.csv
<app-url> is the application URL, in this case is https://teammatesv4.appspot.com/
<generated filename> is the name of the generated configuration file
<entity type> refers to the type of entity which we wish to download
<output file> is the csv file we wish to save to
A sample command will look like:
bulkloader.py --download --url https://teammatesv4.appspot.com/remote_api --config_file generated_bulkloader.yaml --kind Account --filename accounts.csv
Although the bulkloader is able to specify many options for downloading data, it has several limitations:
As the bulkloader is run remotely from the GAE, any interruption in the connection will cause data transfer of the backup data to fail. As there is no automatic retry mechanism in place, the backup will be incomplete. In such an event, the backup must be run again, which will waste time and resources.
Also, even though we are able to specify options in the bulkloader, we cannot choose specific entities for download. Therefore, we will have to download all the entities as mentioned in the datastore admin.
The main problem we have seen in the previous solutions is we waste resources to perform backup on every entity. As some entities might have already be downloaded in a previous backup, we can ignore that entity if there are no changes made to it since we already have the latest backup copy of it. In other words, we should only perform backup of entities that were modified since the last backup. This is known as an incremental backup.
Hence, to perform an incremental backup, we should ideally:
However, there are 2 problems with this solution:
Based on the problems highlighted, we will combine various methods as an alternative to perform an incremental backup in TEAMMATES. The following flowchart shows the current approach towards incremental backup:
To avoid adding a timestamp field, we will create unique log messages for the creation, modification or deletion of all entities in TEAMMATES. We can retrieve the logs via the log api provided by the GAE. The following figure shows the message logged after modification of an entity.
Based on these logs, we can obtain which courses were modified recently. These allows us to incrementally backup data as only the most recently updated data will be considered for the backup.
Using the above courses that were extracted, we can then download all entities within the course. These include all accounts of students and instructors, evaluations, feedback sessions and responses, comments and submissions.
The data must be downloaded to a remote machine, therefore we will make use of the remote api provided by the GAE to connect directly from the local machine to the app engine datastore. Individual entities can then be retrieved via their respective logic or database classes.
These entities will be saved into a .json file. Each course data will be stored in a individual .json file, and all of these .json files will be saved into a folder timestamped with the current time. This allows us to identify the latest data if a restore of data is required.
To automate backing up of data, we will use scheduled tasks provided by Windows. We can export the backup source files in the form of a .jar file, which allows us to run the backup as though it is an executable file. We can then specify the scheduled task to execute the .jar file on regular intervals.
The .jar file can be created with the following steps:
After consideration, we have decided to schedule the backup to be run every 24 hours. This time interval works best as it allows us to capture a reasonable amount of data. Since data within courses might be changed more than once a day, short intervals will cause us to backup the same course many times, which might be unnecessary. However, longer intervals will cause a lot of data to be lost in the case of a disaster. Therefore, daily backups will work best.
To upload the backup data back to the app engine datastore, we will use a similar approach as when downloading data. Through the remote api, we will be able to connect to the app engine datastore. We can then parse the .json backup files back into an entity format. The respective logic or database classes can be used to persist these entities to the app engine datastore.
All the approach discussed above has already been implemented in TEAMMATES. Scripts has also been written to facilitate adding of the scheduled tasks and running the backup .jar files, therefore this allows the automated incremental backup to be setup easily on a new machine.
Future work
http://blog.notdot.net/2010/04/Using-the-new-bulkloader
http://learntogoogleit.com/post/59602939088/logging-api-example
[1] Logs Java API Overview
https://cloud.google.com/appengine/docs/java/logs/
[2] Remote API for Java
https://cloud.google.com/appengine/docs/java/tools/remoteapi
[3] Backup/Restore, Copy, and Delete Data
https://developers.google.com/appengine/docs/adminconsole/datastoreadmin#Backup_And_Restore
[4] Window Scheduled Tasks
http://technet.microsoft.com/en-us/library/cc772785(v=ws.10).aspx
[5] Scheduled Tasks with Cron Job for Java
https://cloud.google.com/appengine/docs/java/config/cron
--- end of report ---