Overview
Dimagi is looking for a platform engineer (http://www.dimagi.com/about/careers/platform-engineer/) or consulting firm to help expand our flagship services to test, stress, and scale our operations worldwide. Dimagi’s user base is rapidly increasing in both volumes of users and data as well as use cases that push the capabilities of our solutions. We are looking to evaluate our current infrastructure and practices, and then design upgrades to a variety of our environments.
Dimagi devops needs
This is a high level overview of Dimagi’s dev-ops requirements. This is list is roughly prioritized in order of how important/big a problem this is for Dimagi.
Timeline
We are looking to engage/hire immediately. We operate productions geos in the US and are standing up a new production geo in India in July. Ideally, we would block the new India geo on the improved server setup and design and use that as the initial test bed before upgrading the US geo.
(For the purposes of this document we will ignore any legacy servers being maintained by Dimagi and focus on the CommCare HQ instances being maintained)
There are three instances of our cloud product (CommCare HQ) that we are maintaining. The reason we host multiple versions is because of data sovereignty policies or regulations with our partners or governments that require health data be stored in a country’s borders.
These instances are summarized below:
Country | Hosting Provider | Setup | OS |
US / Production | Rackspace | Multiple managed VMs (see below) | Ubuntu 12.04 |
India (Current) | Reliance | Single physical machine | RHEL (version?) |
India (New Production) | IBM | Multiple bare metal | Ubuntu 12.04 |
Zambia | World Vision Country Office | Single physical machine | Ubuntu 12.04 |
Staging | Rackspace | Multiple managed VMs, similar to prod but with less machines | Ubuntu 12.04 |
Preview | Rackspace | Shares with staging | Ubuntu 12.04 |
Our production setup in the US (and soon India) is the most interesting, since it contains many different machines and roles. These are summarized in the following graphic.
The following table describes the functional role of each machine in more detail.
Machine | Apps / Roles | Count |
Proxy | Apache (load balancer) | 1 |
Web worker | Python, django, gunicorn | 3 |
Task Server | Python, django, celery | 1 |
Form engine | Java, Jython | 1 |
DB Machine | Postgres, Redis, RabbitMQ, memcached, change listeners (python, django) | 1 |
Elastic DB Machine | ElasticSearch | 2 |
Cloudant node | Cloudant | 1 |
In the non-production environments we run everything as a singleton on a single machine.
We have a set of salt-based tools for provisioning new servers by role. These can be found in our github repository (requires access) or as an attachment to this document. These are working well for steady-state things that we do (see below), but aren’t 100% there in terms of provisioning new servers.
The following are frequently used steady-state operations across servers that must be able to work across clouds.
In our future state we know we will have two multiple-server cloud environments - one in the US and one in India (this is in the near term). Over the medium term we may have up to 5 more regional environments which may be multi-server or single-server.
Here is where we get to our desired view of the world. This can be summarized as a few sets of rules.
Right now our scripts are really good at the things we do frequently - setting up web workers, configuring users, and updating settings. However, we want a complete separation of roles and machines. The ideal set of roles might look something like this:
Role | Apps |
Proxy | Apache, load balancer |
Web worker | Python, django, gunicorn |
Task server | Python, django, celery |
Form engine | Java, Jython |
Postgres server | Postgres |
Elastic server | ElasticSearch |
Redis server | Redis |
Caching server | Memcached |
Rabbit server | RabbitMQ |
Change listener server | Python, django |
Cloudant / couch node | Cloudant / CouchDB |
Note that this is not that far off from our current state (involves mainly splitting up the DB machine) but the hard part is being able to specify an arbitrary number of roles, map them to physical servers in an arbitrary way, and have single click installation and setup.
We are also very interested in adding redundancy where we currently don’t have any - particularly at the proxy and Postgres layers which are still single points of failure.
We use fabric to deploy and everything is going pretty smoothly with our current fabfile. However there are a few things we can streamline.
From the deployer’s perspective, the process goes something like this.
In the best case, when there aren’t any migrations or view updates, the whole thing only takes about 15 minutes.
The main things that happen during a preindex:
The main things that happen during deploy:
(this section is not yet completed)