Published using Google Docs
Dimagi devops needs
Updated automatically every 5 minutes

Overview

Dimagi is looking for a platform engineer (http://www.dimagi.com/about/careers/platform-engineer/) or consulting firm to help expand our flagship services to test, stress, and scale our operations worldwide. Dimagi’s user base is rapidly increasing in both volumes of users and data as well as use cases that push the capabilities of our solutions. We are looking to evaluate our current infrastructure and practices, and then design upgrades to a variety of our environments.

Dimagi devops needs

This is a high level overview of Dimagi’s dev-ops requirements. This is list is roughly prioritized in order of how important/big a problem this is for Dimagi.

  1. Server setup and design - cleaning up our existing salt scripts, or setting up a new set of scripts that enable us to easily manage all of our servers and easily create and configure new servers. Improvement/rearchitecture of server roles.
  2. Deployment - helping sort through our deployment process and design and streamline any things we are doing poorly
  3. Continuous integration and delivery - improve and redesign our existing continuous integration setup and release management processes
  4. Metrics, monitoring, and logging - better visibility into our logs and understanding detail-level metrics about how our system is being used

Timeline

We are looking to engage/hire immediately.  We operate productions geos in the US and are standing up a new production geo in India in July.  Ideally, we would block the new India geo on the improved server setup and design and use that as the initial test bed before upgrading the US geo.

Server setup and design

Current state of servers

(For the purposes of this document we will ignore any legacy servers being maintained by Dimagi and focus on the CommCare HQ instances being maintained)

There are three instances of our cloud product (CommCare HQ) that we are maintaining. The reason we host multiple versions is because of data sovereignty policies or regulations with our partners or governments that require health data be stored in a country’s borders.

These instances are summarized below:

Country

Hosting Provider

Setup

OS

US / Production

Rackspace

Multiple managed VMs (see below)

Ubuntu 12.04

India (Current)

Reliance

Single physical machine

RHEL (version?)

India (New Production)

IBM

Multiple bare metal

Ubuntu 12.04

Zambia

World Vision Country Office

Single physical machine

Ubuntu 12.04

Staging

Rackspace

Multiple managed VMs, similar to prod but with less machines

Ubuntu 12.04

Preview

Rackspace

Shares with staging

Ubuntu 12.04

Our production setup in the US (and soon India) is the most interesting, since it contains many different machines and roles. These are summarized in the following graphic.

The following table describes the functional role of each machine in more detail.

Machine

Apps / Roles

Count

Proxy

Apache (load balancer)

1

Web worker

Python, django, gunicorn

3

Task Server

Python, django, celery

1

Form engine

Java, Jython

1

DB Machine

Postgres, Redis, RabbitMQ, memcached, change listeners (python, django)

1

Elastic DB Machine

ElasticSearch

2

Cloudant node

Cloudant

1

In the non-production environments we run everything as a singleton on a single machine.

Current server setup and management

We have a set of salt-based tools for provisioning new servers by role. These can be found in our github repository (requires access) or as an attachment to this document. These are working well for steady-state things that we do (see below), but aren’t 100% there in terms of provisioning new servers.

Steady-state operations

The following are frequently used steady-state operations across servers that must be able to work across clouds.

Potential future state

In our future state we know we will have two multiple-server cloud environments - one in the US and one in India (this is in the near term). Over the medium term we may have up to 5 more regional environments which may be multi-server or single-server.

Future server setup and management

Here is where we get to our desired view of the world. This can be summarized as a few sets of rules.

  1. Better separation of machines and roles
  2. Everything is automated and configurable
  3. It is trivial to add capacity or machines at any point in the stack

Right now our scripts are really good at the things we do frequently - setting up web workers, configuring users, and updating settings. However, we want a complete separation of roles and machines. The ideal set of roles might look something like this:

Role

Apps

Proxy

Apache, load balancer

Web worker

Python, django, gunicorn

Task server

Python, django, celery

Form engine

Java, Jython

Postgres server

Postgres

Elastic server

ElasticSearch

Redis server

Redis

Caching server

Memcached

Rabbit server

RabbitMQ

Change listener server

Python, django

Cloudant / couch node

Cloudant / CouchDB

Note that this is not that far off from our current state (involves mainly splitting up the DB machine) but the hard part is being able to specify an arbitrary number of roles, map them to physical servers in an arbitrary way, and have single click installation and setup.

We are also very interested in adding redundancy where we currently don’t have any - particularly at the proxy and Postgres layers which are still single points of failure.

Deployment

We use fabric to deploy and everything is going pretty smoothly with our current fabfile. However there are a few things we can streamline.

From the deployer’s perspective

From the deployer’s perspective, the process goes something like this.

  1. Enter my personal dev environment
  1. cd to commcare-hq directory (the code-root)
  2. activate commcarehq virtualenv
  1. Run the preindex command: fab production preindex_views
  1. Runs for about 5-10 minutes
  1. Wait for an email to come in saying that preindexing is complete. This is usually immediate, but if a couchdb view has changed, it can take up to 10 hours as cloudant catches the new view up, before we deploy.
  2. Run the deploy command: fab production deploy
  1. If there is a postgres migration, then this can hang for 20 minutes or so in a way that affects the availability of some of our servcies but does not general cause across-the-board downtime. This is a pretty new issue, as we’ve only begun to push more of our data into postgres.

In the best case, when there aren’t any migrations or view updates, the whole thing only takes about 15 minutes.

From the technical side

The main things that happen during a preindex:

The main things that happen during deploy:

(this section is not yet completed)