[TM Technical Report] 
Improving Scalability and Performance

 [Authors: James Dam Tuan Long]

Introduction

Bottleneck: Inefficient reads/writes

Bottleneck: Overhead of managing entity relationships

Bottleneck: Iterating all entities

Bottleneck: Too many writes in a single request

Future work

References

Introduction

This report describes a preliminary study that identified potential scalability bottlenecks in TEAMMATES and some attempts to remove them.

Bottleneck: Inefficient reads/writes

Currently, objects are read using GQL queries. Here is an example:

 

 

The above function tries to query datastore for the coordinator with a particular googleID, the result from the query is a list but the function only returns the first element in the list.

 

There is nothing wrong with that function in a normal application using SQL. However, GAE datastore uses a NoSQL proprietary backend technology called BigTable, which has a much more efficient way to read an object which is direct lookup [1]. Direct lookup use the key of object (in the above case is the googleID of coordinators) to identify the object. Direct lookup can be 4-5 times faster and consume half of the resource used by a query. For frequently used entities, using memcache can boost performance (up to 20 times faster than direct lookup) and save a lot of resources [TODO: cite source].

 

Writing data back to datastore can also be improved. Since GAE by default builds index for all fields, except for text fields and blob fields. However, only fields that used for querying need to be indexed.

 

Proposed solution 1: 

Rewrite of current datastore to use direct lookup and memcache.

Pros: Increases performance and saves space.

Cons: Have to rewrite current datastore and add extra code.

Decision: Put on hold as the change is drastic and the need for performance improvement is not that drastic at the moment.

 

Proposed solution 2: 

Disable unused built-in indexes.

Pros: Reduces write operations and space to store indexes, increase writing speed.

Cons: Removing wrong indexes could results in serious problems.

Decision: Solution adopted. Some unused indexes were identified and removed. (see Issue 346). Removing unused indexes actually reduces storage space. For the data set of 50,000 users, the space used for indexing reduces from 547Mb to 335Mb. The dastastore read and write usage reduced by 10-20% .

Bottleneck: Overhead of managing entity relationships

Currently, objects are updated separately and relationships are maintained at code level (not database level) using string references as entity IDs. For example, the Student entity has a String field called courseId that is used to identify a Course entity, as opposed to having a field of Course type. Maintaining relationship between two objects by keeping the ID of one object as a field in the other is used quite often, especially for 1-to-many relationships. However, it is not very efficient for “1-to-a-few” and “parent-child” relationships (like instructor-course, student-course or course-evaluation relationships, where parent object is not likely to own a lot of child objects and the ownership is not supposed to change).

Proposed solution 1:

Do not use Bigtable (datastore). Google has now offered Google Cloud SQL, an SQL-like database that can be used in GAE. For a not very data-intensive application like TEAMMATES, some web developers suggest using SQL because of its simplicity.  

Pros: simplify database design, and still use Google infrastructures.

Cons: Same scalability problem as a traditional web application because of the natural of SQL.

Decision: Changing to a new database is a drastic measure that is not needed at this point.

Proposed solution 2: 

GAE provides better ways for this type of relationships,

  1. Entity groups: Entity groups is a great technique for protecting atomicity of data when used with transaction control. Datastore designed with entity groups has a better performance as you can query only subset of datastore (query with ancestor). Accessing ancestors of an entity is also pretty easy and fast with no extra field in datastore.
  2. List fields: List field is another powerful tool. For example, if we store Student as a list field of Course, we can always know how many Students a Course has and gain access to them without any extra query.

Pros: save resources, increases speed and manages object relationships better.

Cons: require deep knowledge of GAE datastore and need to write a lot of extra code.

Decision: To be considered in the future.

Proposed solution 3:

Use the Objectify framework.

Objectify   is a very popular framework for GAE. According to their documentation,

“Objectify is a Java data access API specifically designed for the Google App Engine datastore. It occupies a "middle ground"; easier to use and more transparent than JDO or JPA, but significantly more convenient than the Low-Level API. Objectify is designed to make novices immediately productive yet also expose the full power of the GAE datastore. “

Pros: Objectify exposes all native datastore features, including batch operations, queries, transactions, asynchronous operations, and partial indexes. Its syntax is intuitive and easy to understand.

Cons: Objectify is not easy to master. It adds to the learning curve of developers.

Decision: Put on hold.

Experiments: After redesigning part of the app (classes converted: Account, Instructor) to use Objectify (version 4.0a), we compared its performances against the original app. The results were not encouraging. In some cases, Objectify was found to be even slower than the original.

Given below are some data (averaged over 100 requests)

 

Using Objectify

NOT Using Objectify

Delete coordinator

874 ms

603 ms

Delete course

924

622

Create coordinator

801

634

Create course

803

608

Get coordinator

799

620

Get course

969

565

Bottleneck: Iterating all entities

There are some functions in the current TEAMMATES that are not very scalable. One kind of such functions queries all the entity of a kind from datastore and then processes the result in memory.  This is never a good solution even for non-web application since all machines have limited memory (the current instance type TEAMMATES is using has only 128MB of memory). Imagine that we have 100,000 students in the system, one call to such functions below will crash the system because of insufficient memory.

Even if the system does not crash, it is never a good idea nor a good practice to write such functions because of the memory consumption and speed.

Proposed solution 1:

Use query cursor. Query cursor helps to break the query result into smaller chunks to process repeatedly. Query cursor and task queue can be used to solve long processing task in background to overcome 60s limit for a request.

Pros: intuitive, similar to cursor in SQL

Cons: have to use multiple queries, which means more resources and waiting time.

Proposed solution 2: 

Design queries to reduce in-memory processing.

Decision: Revisit queries in the current system and try to optimize them. Currently, there are methods such as getAllStudents which are not scalable but used only by the admin features.

Bottleneck: Too many writes in a single request

Every request to GAE should be completed within 60 seconds. After 60s, GAE kills unfinished requests by throwing a DeadlineExceededException. Even if the request is not killed by GAE, having a long request can cause unpleasant experience to users. For example, in the current implementation, creating an evaluation for a large class is extremely expensive because it also creates all Submission entities requires for the evaluation, costing up to thousands of read and write operations. The number of submissions in an evaluation depends on the size of the course. For a course of hundreds of students and time size of about 3-5 students, an evaluation can have more than 1000 submissions. Because of that, creating evaluation operation has the risk of exceeding 60s. Note: A quick test with V4.35 indicates creating an evaluation for a class of 300 students takes around 30 seconds.

Proposed solution 1:

Persist objects in batches (i.e., persistAll method) instead of one at a time.

Decision: This is the current approach.

Proposed solution 2:

Use entity groups and transactions.

Pros : Transactions ensures atomicity of data.

Cons: Requires redesigning of datastore. Using transactions will block the whole entity group from being modified by other requests.

Decision: To be considered in the future.

Proposed solution 3: 

Use asynchronous writes to the datastore.

Pros: Saves the waiting time of write operations. Can save multiple entities at the same time.

Cons: Really advanced and newly introduced technique. Can cause more problems than solving if not used correctly.

Decision: To be considered in the future.

Future work

References

[1] Google I/O 2012 talk - Optimizing Your Google App Engine App (video)(slides)

---end of report---