1 of 26

Kubeflow at Spotify

How Kubeflow Pipelines fits into our Machine Learning ecosystem

2 of 26

Kubeflow at Spotify

How Kubeflow Pipelines fits into our Machine Learning ecosystem

3 of 26

Josh Baer

ML Platform Product Lead

Twitter: @j6aer

4 of 26

Music Streaming Service

Launched in 2008

230M Active Users

50M Tracks

79 Countries

5 of 26

6 of 26

Machine Learning Journey - Overview

7 of 26

How long does it take to build an ML prototype?

Most teams spend 1-3 “sprints” getting an initial prototype out

How many product teams will wait this long to get initial learnings?

8 of 26

How long does it take to go from prototype -> production-grade solution?

Over 30% of ML practitioners spend more than a quarter turning an idea into production software

9 of 26

Machine Learning Journey - Overview

01

02

03

04

Measurement, Experimentation and Tweaking

ML Productionization

Problem Definition

Tweak

Evaluate

Train

Develop

Model Prototyping

10 of 26

Problem Definition

Prototype

Productionize

Measure

4w

2w

1w

2w

1w

2w

2w

Many iterations per phase

11 of 26

Problem Definition

Prototype

Productionize

Measure

4w

2w

1w

2w

1w

2w

2w

14 weeks to go from a defined problem to a production solution!

12 of 26

Difficult to Collaborate

Keeping track of projects, artifacts and lineage was difficult

No common way of building workflows

Teams using N different frameworks in different ways. No shared learnings.

Slow feedback loops

Data analysis was separate from model training and model analysis. Each step is custom

Other Challenges

13 of 26

Kubeflow Pipelines

  • Started discussing it with Google in early 2018
  • Aligned our infra tooling with their direction
  • Product launched in late 2018
  • Promising early results!

14 of 26

Tensorflow Extended

  • Evaluated in mid 2018
  • Decided to replace our scala-based ML tooling with TFX

15 of 26

Kubeflow + TFX at Spotify

  • Launched a team to make Kubeflow Pipelines work for Spotify
  • Thin internal layer to help development speed and integrate with Spotify ecosystem

16 of 26

Test Cluster

Internal development cluster to test upgrades, run integration tests

Development Cluster

For running ad-hoc jobs, developing new workflows

Production Cluster

For regularly scheduled workloads

Higher availability SLA

“Spotify” Kubeflow Setup

17 of 26

Caching

Quicker resumption of failed tasks

Central Metadata

Keep track of what’s being built and run Spotify-wide

Command Line Tooling

Allows for scheduling and execution of jobs via luigi (Spotify orchestration

Shared-VPC Integration

Connect with other Spotify services

Common TFX Components

Easily run tfx-based pipelines

Other Spotify Kubeflow Features

18 of 26

Over 15,000 Kubeflow Pipeline Runs!

19 of 26

Machine Learning Journey - Updated

Problem Definition

Prototype

Productionize

Measure

4w

1w

2d

1d

1d

2d

1d

Shorter iteration cycles =>

faster time to production =>

better ML in our products

Kubeflow Pipelines

20 of 26

Recent Progress

  • During hack week, over 1000 runs of pipeline experiments
  • Developers are loving the integration of data validation, training and model analysis

Mention Hack week last week:

  • Nearly 1000 runs
  • (Maybe add a quote)

21 of 26

Our Kubeflow Timeline

August 2019

“Spotify” Kubeflow Pipeline Platform launched in alpha.

Jan 2020

Launch beta - open it up to the entire Spotify community

Aug 2018

Kubeflow Pipeline Launches

Jan 2019

First teams trying out Kubeflow. Start focusing infra efforts

We’re here

22 of 26

Our Vision for Kubeflow Pipelines

D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, and J.-F. Crespo. Hidden technical debt in machine learning systems. In Neural Information Processing Systems (NIPS). 2015.

23 of 26

Our Vision for Kubeflow Pipelines

Kubeflow

Components

24 of 26

Our Vision for Kubeflow Pipelines

Kubeflow

25 of 26

For building this community...

26 of 26

Want to hear more?

Check our Keshi and Ryan’s Talk at Kubecon: “Building and Managing a Centralized Kubeflow Platform at Spotify