1 of 34

Migrating Spotify's Runtime to Kubernetes

Spotify’s Infrastructure Transition

James Wen - jameswen@spotify.com@rochesterinnycO’Reilly Velocity Cloud Computing Day 2018�October 1st, 2018

2 of 34

About Me

  • Site Reliability Engineer

  • Once Provisioning + DNS

  • Runtime + Kubernetes Migration�
  • Service Mesh/Istio Adoption

3 of 34

Agenda

  • Intro Spotify Ops/Runtime
  • Migration Journey + How
  • Custom Extensions to Kubernetes
  • Future Problems
  • Learnings/Takeaways

4 of 34

5 of 34

About Us

  • 180M+ active users

  • 83M+ subscribers

  • 35M+ songs

  • 2B+ playlists

  • 1B+ plays per day

6 of 34

Complete Team Autonomy

Operations at Spotify

Ops in Teams

Infrastructure & Operations Org

Centralized Ops

Golden Path

7 of 34

Current Runtime

8 of 34

New Runtime

9 of 34

New Runtime*

10 of 34

Kubernetes: How We’re Migrating

11 of 34

Experiment: One Service on Cluster

12 of 34

Experiment: One Service on Cluster

  • Cluster Creation
  • DNS Setup
  • Networking + Routability Setup
  • Integration with existing service discovery
  • Integration with existing metrics system
  • Logging

13 of 34

Experiment: Three Services on Shared Cluster

14 of 34

Experiment: Three Services on Cluster

  • Permissioning by namespace

  • Resource Quotas by namespace

  • Developer Documentation

15 of 34

Alpha Phase: Volunteer Services on Shared Clusters

16 of 34

Alpha: Volunteer Services on Shared Clusters

  • Test Clusters�
  • Scripted Cluster creation�
  • Integration with existing secrets system�
  • Initial deployment tooling and CI Integration�
  • Lots of learnings from varied services�

17 of 34

Alpha Phase: Two High-Traffic Services on Shared Clusters

fanout

metadata-proxy

18 of 34

Alpha: Two High-Traffic Services on Clusters

  • Really understand network setup

  • Use/experimentation with autoscaling

  • Provide confidence/reference example

19 of 34

Beta Phase: Self-Service Migration

20 of 34

Beta Phase: Self-Service Migration

  • Ops (reliability, oncall, alerts, disaster recovery, backups)

  • Self-service of permissioning/quotas

  • Sustainable Deployment (Spinnaker)

  • Decision on manifests storage/VCS

21 of 34

GA Phase: “Golden Path”

22 of 34

GA Phase: “Golden Path”

  • Vertical Autoscaling (Right Sizing)
  • Custom Metrics Autoscaling
  • “Everything Else” (ex. ITGC)
  • Migration Road Team

23 of 34

Custom Extensions

24 of 34

Admission Controllers

  • Inject or enforce behavior on any/all k8s resources

  • Infrastructure Injection
    • ffwd-java-shim container injection
    • Env var injection

25 of 34

Custom Resources

  • CeloSecrets

26 of 34

Resource Labels & Metadata

  • Service Discovery Integration leveraging labels

  • “Give me all Service resources with this label + for each of these: register all of their ports under the metadata name”

27 of 34

Spinnaker for Deployment

  • Support complex deployment strategies to k8s
  • Abstract behind existing tugboat system

28 of 34

Metrics Integration

  • Metrics
    • Additional k8s-specific tags on existing metrics
    • container resource usage (metrics-api)
    • kubernetes state metrics (kube-state-metrics)

29 of 34

Event Exporting for ITGC

Kubernetes API Server

Log Backend (Stackdriver)

k8s API events Pub/Sub Topic

Cloud Function

ITGC Pub/Sub Topic

ITGC System

API events

Pub/Sub messages

Pub/Sub messages

Pub/Sub messages

Logs (Export Sink)

30 of 34

Future Problems

31 of 34

Cluster Management

  • Automated scripts → Declarative

  • Terraform or GCP Cloud Deployment Manager
    • Current Concern: Release/Update cycle for new/beta GKE features

32 of 34

DI + ML Cohabitation

  • Current scope = stateless backend services + simple workloads (ex. CronJobs)�
  • Future - Support:
    • Data Jobs
    • Machine Learning Workloads
    • GPU Workloads

33 of 34

Migration Learnings/Takeaways

  • Be mindful and deliberate in dictating terminology�
  • Migrate in stages/series of goals that increase in scope�
  • Talk to other companies about how they do infrastructure/solve infra problems

34 of 34

James Wen - jameswen@spotify.com�@rochesterinnyc

O’Reilly Velocity Cloud Computing Day 2018�October 1st, 2018�

Thanks for listening!

Join the Band!

spotifyjobs.com