Overview

Dimagi is looking for a platform engineer (http://www.dimagi.com/about/careers/platform-engineer/) or consulting firm to help expand our flagship services to test, stress, and scale our operations worldwide. Dimagi’s user base is rapidly increasing in both volumes of users and data as well as use cases that push the capabilities of our solutions. We are looking to evaluate our current infrastructure and practices, and then design upgrades to a variety of our environments.

Dimagi devops needs

This is a high level overview of Dimagi’s dev-ops requirements. This is list is roughly prioritized in order of how important/big a problem this is for Dimagi.

Server setup and design - cleaning up our existing salt scripts, or setting up a new set of scripts that enable us to easily manage all of our servers and easily create and configure new servers. Improvement/rearchitecture of server roles.
Deployment - helping sort through our deployment process and design and streamline any things we are doing poorly
Continuous integration and delivery - improve and redesign our existing continuous integration setup and release management processes
Metrics, monitoring, and logging - better visibility into our logs and understanding detail-level metrics about how our system is being used

Timeline

We are looking to engage/hire immediately. We operate productions geos in the US and are standing up a new production geo in India in July. Ideally, we would block the new India geo on the improved server setup and design and use that as the initial test bed before upgrading the US geo.

Server setup and design

Current state of servers

(For the purposes of this document we will ignore any legacy servers being maintained by Dimagi and focus on the CommCare HQ instances being maintained)

There are three instances of our cloud product (CommCare HQ) that we are maintaining. The reason we host multiple versions is because of data sovereignty policies or regulations with our partners or governments that require health data be stored in a country’s borders.

These instances are summarized below:

Country	Hosting Provider	Setup	OS
US / Production	Rackspace	Multiple managed VMs (see below)	Ubuntu 12.04
India (Current)	Reliance	Single physical machine	RHEL (version?)
India (New Production)	IBM	Multiple bare metal	Ubuntu 12.04
Zambia	World Vision Country Office	Single physical machine	Ubuntu 12.04
Staging	Rackspace	Multiple managed VMs, similar to prod but with less machines	Ubuntu 12.04
Preview	Rackspace	Shares with staging	Ubuntu 12.04

Our production setup in the US (and soon India) is the most interesting, since it contains many different machines and roles. These are summarized in the following graphic.

The following table describes the functional role of each machine in more detail.

Machine	Apps / Roles	Count
Proxy	Apache (load balancer)	1
Web worker	Python, django, gunicorn	3
Task Server	Python, django, celery	1
Form engine	Java, Jython	1
DB Machine	Postgres, Redis, RabbitMQ, memcached, change listeners (python, django)	1
Elastic DB Machine	ElasticSearch	2
Cloudant node	Cloudant	1

In the non-production environments we run everything as a singleton on a single machine.

Current server setup and management

We have a set of salt-based tools for provisioning new servers by role. These can be found in our github repository (requires access) or as an attachment to this document. These are working well for steady-state things that we do (see below), but aren’t 100% there in terms of provisioning new servers.

Steady-state operations

The following are frequently used steady-state operations across servers that must be able to work across clouds.

Updating settings files
Adding a new server administrator
Removing a server administrator
OS patches and updates
Application patches and updates

Potential future state

In our future state we know we will have two multiple-server cloud environments - one in the US and one in India (this is in the near term). Over the medium term we may have up to 5 more regional environments which may be multi-server or single-server.

Future server setup and management

Here is where we get to our desired view of the world. This can be summarized as a few sets of rules.

Better separation of machines and roles
Everything is automated and configurable
It is trivial to add capacity or machines at any point in the stack

Right now our scripts are really good at the things we do frequently - setting up web workers, configuring users, and updating settings. However, we want a complete separation of roles and machines. The ideal set of roles might look something like this:

Role	Apps
Proxy	Apache, load balancer
Web worker	Python, django, gunicorn
Task server	Python, django, celery
Form engine	Java, Jython
Postgres server	Postgres
Elastic server	ElasticSearch
Redis server	Redis
Caching server	Memcached
Rabbit server	RabbitMQ
Change listener server	Python, django
Cloudant / couch node	Cloudant / CouchDB

Note that this is not that far off from our current state (involves mainly splitting up the DB machine) but the hard part is being able to specify an arbitrary number of roles, map them to physical servers in an arbitrary way, and have single click installation and setup.

We are also very interested in adding redundancy where we currently don’t have any - particularly at the proxy and Postgres layers which are still single points of failure.

Deployment

We use fabric to deploy and everything is going pretty smoothly with our current fabfile. However there are a few things we can streamline.

From the deployer’s perspective

From the deployer’s perspective, the process goes something like this.

Enter my personal dev environment

cd to commcare-hq directory (the code-root)
activate commcarehq virtualenv

Run the preindex command: fab production preindex_views

Runs for about 5-10 minutes

Wait for an email to come in saying that preindexing is complete. This is usually immediate, but if a couchdb view has changed, it can take up to 10 hours as cloudant catches the new view up, before we deploy.
Run the deploy command: fab production deploy

If there is a postgres migration, then this can hang for 20 minutes or so in a way that affects the availability of some of our servcies but does not general cause across-the-board downtime. This is a pretty new issue, as we’ve only begun to push more of our data into postgres.

In the best case, when there aren’t any migrations or view updates, the whole thing only takes about 15 minutes.

From the technical side

The main things that happen during a preindex:

Code is pulled a special preindex directory of one of the machines, which is separate from the main code root that the server runs from

This is done through a series of git commands (fetch, pull, clean -ffd, submodule update; .pyc files are also deleted), rather than building a distribution and unzipping it on the server or anything like that.

The preindex virtualenv (also distinct from the main virtualenv) is updated, again through running pip install -r commands rather than installing a pre-built env.
“Preindex” CouchDB, which means copying all the CouchDB design docs to our db, but labeled as *-tmp, so they do not affect production, and then hitting them all one by one to trigger view rebuilds. Once they all respond in a timely fashion, we conclude that the indexes are up to date.

This is run asynchronously. If there’s nothing new to reindex, it’s complete more or less immediately. If there’s an index to build, it can take quite a while as mentioned above. Upon completion an email is sent.

Another minor thing happens during preindex, which is constructing a mapping from static files to their sha1 hashes; we have a simple system that uses this mapping for browser cache invalidation. (Or rather forcing a cache miss when the resource content changes, which is why we use file hashes.)

The main things that happen during deploy:

(this section is not yet completed)