Menzies by Jason Kivlighn and Scott Shawcroft
Mentored by Andy Schwerin
Menzies is a distributed datastore meant to replace the current backend for OpenStreetMap.org. OpenStreetMap is an attempt to bring collaborative wiki-like editing to mapping. The data is used for a number of purposes including map tile creation and directions. These consumers leverage two different methods of accessing the data. First, bulk data dumps and diffs provide read-only means of getting the data. Second, a RESTful API provides for live data manipulation.
Our Goals
Our goal with Menzies is to provide a drop in replacement for the existing OpenStreetMap (OSM) backend. Unlike our solution, their current setup involves one single computer running Ruby on Rails with a MySQL backend. The database server exposes the data in two ways: a REST API and bulk XML exports. We aim to provide an alternative implementation of the REST API. Our design does not remove the possibility of bulk XML exports. Instead of implementing it we prioritized other features, particularly bounding box queries. In the diagram below, we are redoing the "Database" and "Protocol Interface" portions. As a result, we can see that with our implementation we are able to run their map editors, such as Merkaartor, by simply directing it to our server.

So, our goal was to implement their REST API in such a way that it is a drop in replacement. We did prioritize portions of the API and left other sections unfinished so that we could focus on the scalability aspects. These unfinished parts include user authentication and changesets. They are completely doable but less relevant to scalability so we left them for later. We chose to implement create, get and bounding box REST calls and focused less on versioning, editing and deletion. The full API of which we implemented parts of is the OpenStreetMap 0.6 API. It is documented on their
wiki.
Since we decided to implement their API, we had to embrace their model for storing map data. It centers around three fundamental objects: nodes, ways and relations. Nodes are single points, ways are a sequence of nodes and relations tie together nodes, ways and other relations. For example, a road would be a way consisting of a number of nodes. This relationship between them leads to exponential relationships between the different types. More specific information about the feature type is stored via tags. These tags are simply key-value pairs which could potentially lead to very different ways of tagging features. However, OSM
documents a large set of standard tags to be used. Editors also use these standard tags, reducing the need for custom tags and preventing multiple different tags for the same feature type.
Use case
In order to understand our implementation design we first established a common use case for the API. As we can see from the above diagram, the API is largely used by editors and also to create tiles on demand. Both of these tasks are dependent on users modifying the map data. Using an editor, an OSM user modifies the map through the API and then utilizes Tiles@Home to generate new tiles for the area edited. This breaks down to at least one initial bounding box import, individual create, edit, delete calls and then another bounding box query to gather the data for tile generation.
Furthermore, we must take into consideration the limitations of a non-profit organization's scale. We did not design this system for a Google or Amazon scale. Currently, the OSM server is a single machine running MySQL. For the move to the 0.6 API the OSM folks are spending $15K on a
new single machine. Although we're not scaling to hundreds of machines, we've design this system to scale from one to three to ten machines. In our test setup we used four machines. Three machines only did node serving while the other machine ran the frontend, relation and way processes.
Implementation
Technologies
Menzies is written entirely in Python.
Communication between internal servers use
Thrift, a library that provides an RPC implementation and efficient object serialization. We also take advantage of Thrift's serialization to store objects in the database.
The data store consists of a set of Berkeley DB databases. There is a separate database for each of the ways, relations, and nodes. There is another set of databases that store reverse indexes as needed.
Server Components 
We implemented the OSM data store as four distinct pieces: the frontend server, a way server, a relation server, and a set of node servers. We decided on this task partitioning because the number and growth rate of nodes far outweighs that of the ways and relations. The figure at the right shows this relationship over time. Pink is nodes, Green is ways and Red is relations.
The frontend server accepts RESTful queries to the data store. The underlying HTTP server is built into Python. Communicating via Thrift, the frontend delegates to the way, relation, and node servers.
The way server stores all the ways and maintains a reverse index from node to ways that contain that node. The relation server stores all the relations and maintains reverse indices from node to relations that contain that node, way to relations that contain that way, and relation to relations that contain that relation. Each node server stores a subset of the nodes and a spatial index that returns all nodes in a given region.
Node Partitioning

The largest gains from distributing the data amongst a number of servers stem from the division of nodes across servers. As a result, we needed to design a mechanism to determine which node servers to talk to based on a given query. There are three node access mechanism to consider: bounding box, get by id, and create node. From the beginning we knew we wanted to partition the nodes geographically. Therefore, we would query all node servers to find a particular id. To do a bounding box we would query all node servers that intersect the given bounds.

When creating or editing a node, we know exactly which node server to communicate with.
Therefore, we needed a metric to divide the nodes. We could have split them randomly and thus gotten an even distribution across all servers. However, this would provide no geographic gains. We could split them such that nodes are evenly spaced across the servers or we could split it such that the incoming queries are evenly distributed across the server. Since the incoming queries are user edit driven, we figured that users have a bias towards editing near their homes and at particular times of the day. As a result, we settled on a latitude-based partitioning scheme, splitting as seen in the above map. This scheme has the benefit that some users will be editing while other users in the same partition are asleep. Looking at sample data from OSM hour diffs we have found that this does apply more often than naught.
Bounding Box Queries
The most computationally costly aspect of the OSM data store API is the bounding box call. This call returns all the nodes, ways, and relations in a given bounds. Implementation of this call involves looking up all nodes in the given bounds. Then for each of these nodes, we lookup all the ways and relations that contain each of these nodes. Then for each of these ways and relations, we lookup all the relations that contains these ways or relations.
The key to making such queries fast was using a spatial index to find which nodes were within the given bounds. We opted to build an R-tree to handle this task. An R-tree is, literally, a tree of rectangles. Data in internal nodes consists of a rectangle and a pointer to the next node in the tree for that rectangle. Data in the leaves consists of a rectangle and the node ids that fall within that rectangle.
To get good performance from the R-tree, we needed to ensure we build a balanced tree. Rather than building the tree by inserting one node at a time, we used a bulk loading method called Sort Tile Recursive (STR). STR works by laying out all the nodes in a 2-D plane and splits them into an n by n grid. Levels are built up by recursively building grids until you get a grid of 1 by 1. Building this index over nodes for the entire world (over 1 billion nodes), creates a tree of only 6 levels and queries take a fraction of a second.
Below is a visualization of three depths of the R-tree around Africa when searching for nodes within part of Ethiopia. This particular spatial index spans the entire world south of the United States and finding a point in Ethiopia required searching only 11 rectangles. Black rectangles are all the rectangles in the spatial index and those traversed are in red.
Performance Analysis
The current OSM database server is an Athlon 64 X2 5200+ (2.6GHz dual core with 1Mb L2 cache per core) with 8 GB of RAM. Our system is running on four boxes each with a quad core Intel(R) Xeon(TM) CPU 2.40GHz processor and 4 GB of RAM.
Below is a sample of average latencies of bounding box queries on our system and the existing system.
| OSM
| Menzies
|
University of Washington
| 1 min 37 sec
| 12 sec
|
London
| 4 min 4 sec | 24 sec
|
Brazil
| 12 sec
| 3 sec |
It is important to note that we queried the live OSM data store, which at any given time is under an unknown load. Our system, on the other hand, was under no load. Problems in setting up their existing system on one of our own machines prevented comparison under equal server load. Nevertheless, further testing of our system under heavy load still gave favorable results.
Disk caching also played a key role in our system's performance. For example, London initially took nearly 3 times longer than average to load. To lessen the effects of the cache for testing, we issued hundreds of bounding boxes queries across the globe to randomize the cache. Afterwards, all our test queries were right at average latencies. This may suggest that once the system has been running for awhile, it will hold and keep the most pertinent information (such as database indexes) in the cache at all times. One could also perform further manual tuning of Berkeley DB's cache size.
Scalability Challenges
Concurrency
One of the first challenges we identified was an issue with concurrency. While we had one single query working at a time, multiple queries at a time caused errors. We investigated Thrift as the source of errors and
found a few. However, none of these fixed the problem we were experiencing. We finally decided to look into our database concurrency. We had written the code assuming concurrency was taken care of because it was a stated feature of Berkeley DB. After some investigation we realized that we were indeed having concurrency issues because we were not correctly using the database. The fix was to start the database within a concurrency aware environment.
At this point, Jason profiled the frontend during a single bounding box query and found that it cascaded into tens of thousands of single node and way queries. More specifically, after gathering the nodes in a bounding box we then find all ways that reference those nodes. At first, we repeatedly called getWaysFromNode to get all the ways for all the nodes, one node at a time. This resulted in one RPC call for each node in the bounding box. On the profile graph it was clear we were spending a significant amount of time doing these calls. To combat this cost we tried doing one single call to get all the ways from all the nodes as once. This change immediately showed promising results. First, we reduced much of the overhead from so many RPC calls. Second, by loading the ways in bulk we avoided many duplicate fetches from the database (e.g. for a road, we did not re-fetch the way for each point on the entire road). Below, we can see this speedup in the dark green boxes for the getNode(s) call time.


A third issue we ran into once we got concurrency working without error, was extremely poor throughput. Queries in parallel took minutes longer than if done one after the other in series. We had heard of the Global Interpreter Lock (GIL) as an issue in Python threading and figured that we had ran into it. To avoid this constraint we switch from ThreadingTCPServers to ForkingTCPServers such that each request would have its own interpreter. We did not confirm that the GIL was indeed the problem but found an obviously positive impact on throughput. As a result, we've kept that change. In the above diagram, we can see the time reduced in the light green boxes -- time that was likely spent waiting on the GIL rather than doing any meaningful processing.
Impacts of an Interpreted Language
The last issue we encountered dealt with the noticeable impact of running a service with an interpreted language. After dealing with the above concurrency issues, we noticed a disproportionate amount of time in the frontend was spent doing XML processing. We expected most time to be spent waiting for RPC to finish. This was a problem for scalability because if the frontend server spends too much time on the CPU, it will become a bottleneck and no further amount of distributing the data store will help the system.
However, once we saw where the frontend was choking on the CPU, we swapped out the pure Python DOM implementation for one backed by a native C library (libxml2) and saw tremendous improvement. As seen in the graph below, latencies were nearly cut in half for the same queries as those in the graphs seen earlier.
Additionally, after further profiling, we found another hot-spot relating to serializing objects to and from strings. There was a Python module that could do the serialization in C and we immediately saw 10%+ speed gains.
Once all the tight loops were being processed by native rather than interpreted code, the effects of an interpreted language were negligible. Particularly for the frontend server, time is now well spent waiting on the network and accepting new requests, rather than getting bogged down on the CPU.
Reliability
Reliability was not a goal of our work. Since we are not scaling to hundreds of machines, we assume that maintenance can be performed manually on a case by case basis. Mechanisms such as redundancy could be added to Menzies to ease manual maintenance. However, we feel this is still much better than their current setup which relies on the maintenance of one super machine. Menzies provides some rudimentary fault tolerance with respect to a single machine because one machine's performance does not depend on another's. For example, if the node machine with the United States crashes and requires a restart, queries outside the United States will continue to work. With a single machine, a crash would take down the entire system.
What we didn't do - and how we could do it
Diffs
OSM currently provides diffs of all changes to the database at regular intervals (minutely, hourly, daily, etc). While we have not implemented this functionality, it would still be possible within the framework. Since all edits go through the frontend server, this server could keep a log of any edits and a timestamp. Regular diffs could then be created based on this log.
User management
OSM keeps a database of user accounts, requiring registration and authentication before modifying the database. Menzies currently allows anybody to modify the database, but it would be trivial for the frontend to manage a database of users and authenticate before any modifications are made.
Changesets
The largest feature they've added to their API 0.6 is the concept of a changeset. They are aimed at lumping changes into larger sets to ease the wiki nature of OSM. This involves create and close calls before and after the normal changes. We stumped out the create call in order to work with existing editors but do not track them on the backend. We could support changesets by adding an additional table and its corresponding logic. However, we decided that it was not core to the challenge of implementing the API.
Conclusion
In the end, we did many of the things we originally set out to do. In particular, we are able to point Merkaartor to our service and use it to edit the OSM global data. We've leveraged Python and Thrift to get a number of computers participating in a single query. Furthermore, our setup features two geographic optimizations, the partition of nodes across servers and the spatial index on each of those servers. We began to test our system against the existing implementation and found that our latency is better and postulate that our throughput would also be better. Unfortunately, these conclusions cannot be proven due to the difficulty in setting up the existing implementation on our own machine. Ultimately, we've created an alternative OSM implementation that leverages the scalability of a distributed data store and the geographic nature of the dataset.