1 of 32

Distributed Marketplace

draft - comments encouraged!

(pretty outdated as of dec 2014)

2 of 32

Current single datacenter architecture

Memcache

Zeus

Web

Rabbit

Celery

Redis

MySQL

Admin/Cron

Web

Web

Celery

Celery

Elastic Search

Memcache

Memcache

Internet

MySQL

MySQL

Filesystem

3 of 32

Goals in order of priority

  1. Redundancy (disaster planning)
  2. Performance (geolocation)

Specifically a non-goal:

  1. Geographic Segmentation of data (data privacy laws)

4 of 32

MySQL

5 of 32

App-level Routing

A synchronized index stores the location of all our data.

Master & Slaves

Index & Slaves

App Server

Master & Slaves

Master & Slaves

Master & Slaves

Index & Slaves

App Server

Where is User 26?

User 26 is in DC2, Shard 7.

Hi DC2, Shard 7. Please give me user 26.

Here is user 26. Please cache it well.

6 of 32

App-level Routing

  • Opens the door for geographic segmentation
  • “Free” sharding since it’s built in to the look up
  • Can scale (and rebalance) forever
  • An extra round trip to get the location data from the index
  • Requires extensive application modification to remove all direct SQL and categorize all current data by type (see later slides)

Pros

Cons

Conclusions: This is an expensive change (developer time) but leaves us a lot of options for scaling in the future.

7 of 32

Complete Abstraction of Data Objects

For effective caching and invalidation we need to cache objects, not queries. This is an extensive change to our code, replacing every database query with an equivalent object look up. At a minimum this is putting another abstraction layer on the ORM, in other places it may be replacing it completely (particularly if it is compiling an object’s data from multiple sources).

8 of 32

Replace APIPinningMiddleware with Remote Flags

We have naive method of avoiding stale data in APIPinningMiddleware. We should copy Facebook’s improved version of this where the databases flush the cache. This would require us to build a McSqueal clone (the non-open software they use to send signals from their db to memcache).

9 of 32

Legacy API Support

Older versions of our APIs are currently maintained in the code base with if statements or, in an extreme case, an entirely different method. This is not scalable for developers to work on or QA to test.

Instead of this fragile versioning strategy we should be tagging our API versions in git and releasing them separately. When we finish an API version we tag it 1.1 and ask the ops team to deploy it. Then when we deploy 1.2 it is a completely different system and doesn’t change 1.1 at all, despite potentially massive code changes. This is easier for QA, easier for developers, and if not easier, at least straight-forward for ops.

Note that this will require the Complete Abstraction of Objects in the previous slide so all the API versions don’t need to track db changes.

10 of 32

Elasticsearch

11 of 32

  • “The best practice is to not put anything in ES you couldn't stand to lose.” --Erik Rose citing this nightmare
  • ElasticSearch can be installed in each DC to operate (from ES’s point of view) independently.
    • We can build on top a lightweight messaging service (Heka or Pub/Sub) to send invalidation messages to each DC. Be sure to enable compression.
    • If we build a McSqueal clone we may be able to leverage it for invalidation as well.

12 of 32

Memcache

13 of 32

Memcache

  • Memcache is simple and can also be set up independently in each datacenter.
    • Reads will just miss the cache and the app will write to memcache when it reads the object
    • Writes will go to memcache as usual, but also to our inter-datacenter messaging service (see links in ES section)
    • Cache invalidation will be processed via the messaging service

14 of 32

Filesystem

15 of 32

  • Bug 859340 - [mkt] Move Netapp NFS shared storage to S3
  • We’ve been moving towards S3 for several months and are ready for it to be tested.
  • Likely little work left here.

16 of 32

Monolith

17 of 32

Monolith

This is a service that simply holds all our statistics (mysql). Our frontends all cache this in ElasticSearch so this doesn’t need to be super performant or super resilient.

Worst case: We could get by with a single master cluster in a single DC and just mirror this out everywhere else read-only. If we have a DC go down, we can manually bring this back up elsewhere. Stats are only calculated every 24 hours and we can backfill from the log files for any missing days.

Better case: What if we wanted stats faster than every 24 hours? The load balancers tie off log files every hour I think - what if we processed this every hour? Seems like that doesn’t change the worst case plan, it would still work fine...

@robhudson - thoughts on above?

Open Question: How do multiple data centers affect how we collect metrics? Currently we parse log files the next day. < Do we still parse log files, I thought it all came from Google Analytics or the database these days.

18 of 32

Redis

19 of 32

Redis

We use redis to keep track of lists of apps because it can append() without sending the whole list over the network. We originally tried this with memcache and saturated the network.

  • The last time we tried to get rid of Redis was bug 776156 (still open). Maybe with the newer versions of ElasticSearch we could abuse that in new ways to replace this? @robhudson or @erikrose could answer this?

One thing I think I’ve learned: No one runs Redis with slaves. All these stories of people weeping with delight at how incredible Redis is and how it’s unstoppably scalable, I’m pretty sure just run straight Redis, and not as a master/slave configuration. Perhaps we should do the same.

Any solutions we come up with here could leverage the messaging service to send invalidation lists between DCs, so worse case, we continue to use Redis as it is today.

20 of 32

Rabbit / Celery

21 of 32

Rabbit / Celery

tbd

22 of 32

Payments

23 of 32

Payments

Data about who has purchased what is stored in zamboni and solitude, with the latter being the definitive source.

When a new receipt needs to be issued - zamboni asks trunion to sign it.

  • trunion speaks to a key signing service to get that key. I spoke to Jason and this is a real big concern on how that could be done in multiple data centres.
  • when a payment is completed, notifications are sent out from solitude to multiple locations via an API, we could send it to multiple servers or just to one and let zamboni
  • solitude has a MySQL backend so could be subject to the same MySQL replication as Zamboni

I’d need to know what consistency MySQL zamboni has to see if it needs to be “stronger” than that.

24 of 32

Notes

  • Our sessions are currently client side which makes things a little easier
  • Ops is interested in leveraging AWS as our next DCs (see extra slide for locations)
  • If we use Amazon RDS we can also use the Multi-Availability Zones which will get us enhanced durability in the same region (essentially, 2 data centers in one region)
  • The one major new service we’d add is extensive use of a messaging service to communicate when to invalidate objects between DCs

25 of 32

Extra Slides

26 of 32

Google’s analysis of data storage engines across multiple data centers (2009). Today they offer both M/S and Paxos (Paxos is the default).

27 of 32

Current Datacenter Options

Mozilla

  • US West Coast

AWS (details)

  • US West Coast
  • US East Coast
  • Europe
  • APAC
  • Australia
  • China*

28 of 32

Service Breakdown

SLA

downtime is ok uptime is critical

Just a little Lots

Traffic

Search

Payments

Catalog

Ratings

App Submission

Abuse Reports

App Review

My Apps

Feedback

Metrics

FxA

Installation

Admin Tools

29 of 32

Backups

Database: 5 year retention. details

File System: NFS backed up incrementally daily and fully each month to tape. Also 4 hour and 24 hour snapshots. Once we move to AWS this will change.

30 of 32

Geographic Segmentation

31 of 32

Geographic Segmentation

This is an open conversation I’m having with legal and our stakeholders. It’s unclear still how much this would benefit us specifically for privacy (it would certainly help for performance, but that’s already a goal taken into account in the rest of this document) and where we would draw the line. Some scenarios:

  • App Author lives in Europe and their email address is stored in the EU. They get notified about each new rating on their app. A user in California downloads their app (which comes from the Phoenix data center) and then rates it. The Phoenix data center asks for the email address from the EU and then sends the email.
  • When the Phoenix DC retrieves the email address above, it is going to be cached in the US for a while (hours - days).
  • How does user administration work if we need to adjust settings or help users with their accounts?
  • How do app reviews work if the reviewers are in other countries?
  • How do we determine what DC to put someone in? How does that deal with the person who is on vacation across the world and then moves back home?
  • Our logs are aggregated into one DC for archiving - if we can’t do that there will be substantial changes to ops’ processes as well.

32 of 32

App-level Routing Bonus Slide (with sharding)

DC 1

Global Data

Regional Data Shards

DC 2

Global Data

Regional Data Shards

DC N

Global Data

Regional Data Shards

...

We would need to divide application data into Global or Regional. Global data examples would be Apps, Categories, Prices. Regional data examples would be Users, Logs.