1 of 37

1

Scaling Scala

@spotify

Spotify

Scala love

13.02.2021

2 of 37

2

about:jto

  • 3 years Spotifier (🇸🇪 Stockholm )
  • Data Engineer (Data and Insights)

Spotify

Scala love

13.02.2021

3 of 37

3

about:regadas

  • 5 years Spotifier (🇺🇸 NYC)
  • Data Engineer (Data and Insights)

Spotify

Scala love

13.02.2021

4 of 37

4

about:spotify

  • Audio streaming subscription service
  • 345M+ Monthly Active Users
  • 155M+ Subscribers
  • 70M+ Songs
  • 4B+ Playlists
  • 93 Markets

https://newsroom.spotify.com/company-info/

Spotify

Scala love

13.02.2021

5 of 37

5

Products

  • Discover Weekly
  • Your year Wrapped
  • Fan insights
  • Anonymisation
  • Fraud detection
  • ...

Spotify

Scala love

13.02.2021

6 of 37

6

data processing @Spotify

Spotify

Scala love

13.02.2021

7 of 37

7

data processing @Spotify

2015 - 2017

    • 68M MAU -> 132M MAU
    • Hadoop + Hive + Crunch + Scalding + Luigi + Spark + Storm + ...
    • ~2,500 nodes hadoop cluster (biggest in europe)
    • + 100TB data everyday, > 100 PB Capacity, > 100 TB Memory
    • 20K Jobs/Day from 2k unique workflows from 100 different teams
    • “Data-center is running out of physical space”

2018

    • 157M MAU
    • Monday June 4th - Hadoop cluster EOL
    • Completely migrated from on premise to cloud

2021

    • 345M MAU
    • 11,000+ unique Scala data-pipelines
    • 60,000+ execution per day
    • ~780B events per day
    • Batch and streaming

Spotify

Scala love

13.02.2021

8 of 37

8

data processing @Spotify

2021

    • ~900 users on our #data-support channel
    • ~500 users on our #scala channel

Spotify

Scala love

13.02.2021

9 of 37

9

Data Engineers

  • Java / Scala expertise
  • Data quality
  • Reliability
  • Performances / cost

Data Scientists

  • Python / R and SQL
  • Ease of use
  • Exploration
  • Experimentation

Backend Engineers

  • Java expertise
  • Reliability
  • Stability
  • Ease of use

Spotify

Scala love

13.02.2021

10 of 37

10

data processing @Spotify

Spotify

Scala love

13.02.2021

11 of 37

11

Scio

Spotify

Scala love

13.02.2021

12 of 37

12

about Scio

  • A Scala API for data processing
  • Based on Apache Beam
  • Unified batch and streaming
  • Open source (Apache v2.0)
  • Portable: runs on Dataflow, Spark, Flink, ...

val sc = ScioContext()

sc.textFile(sourcePath)

.flatMap { _

.split("\\s+")

.filter(_.nonEmpty)

}

.countByValue

.saveAsTextFile(target)

sc.run()

https://github.com/spotify/scio

Spotify

Scala love

13.02.2021

13 of 37

13

Why Scala ?

  • Immutable by default�Mutability makes distributed computing awkward
  • Strong type system�Guides the user, prevent a lots of human errors
  • JVM ecosystem�Lots of libraries, good tooling, battle tested
  • Performant*�Data processing is expensive

Spotify

Scala love

13.02.2021

14 of 37

14

Expressiveness

Spotify

Scala love

13.02.2021

15 of 37

15

Leveraging Scalac

Case class derivation �from BQ Schema�(ease of use - reliability)

BQ Schema derivation

from case class�(ease of use - reliability - data quality)

Conversion from BQ row to case class (ease of use - reliability)�Coder derivation (performance)

Coder derivation (performance)

Spotify

Scala love

13.02.2021

16 of 37

16

Scala at Spotify

  • Scala for Data engineering
  • ~900 users on our #data-support channel
  • ~500 users on our #scala channel
  • Scala Center advisory board member

Spotify

Scala love

13.02.2021

17 of 37

17

Open source libraries�https://github.com/spotify/

  • ScioA Scala API for Apache Beam and Google Cloud Dataflow.
  • FeatranA Scala feature transformation library for data science and machine learning
  • ElitzurA data validation toolkit
  • TfreaderTensorFlow TFRecord reader CLI tool
  • MagnolifyA collection of Magnolia add-on modules
  • ratatoolA tool for data sampling, data generation, and data diffing
  • noetherScala Aggregators used for ML Model metrics monitoring
  • ...

Spotify

Scala love

13.02.2021

18 of 37

18

After 6 years...

Spotify

Scala love

13.02.2021

19 of 37

19

Scala delivers �but...

Scala love

13.02.2021

Spotify

20 of 37

20

Hiring is hard

  • Not specific to Scala
  • Few engineers have data-engineering experience
  • Few engineers have professional Scala experience
  • No one will have experience with our exact stack�(Scala + Scio + GCP + Luigi + Styx...)

Spotify

Scala love

13.02.2021

21 of 37

21

Hiring is hard

  • Data-Engineering trainings�No Scala training specifically
  • Golden Path for data-engineering�Tutorial - Create real data pipelines. Test, deploy and monitor them. Publish datasets.
  • Tech Universities�Hands-on technical educational courses - from introduction to advanced topics. - 5 days each

Spotify

Scala love

13.02.2021

22 of 37

22

Data-scientists :(

  • Unfamiliar
  • *Really* different from PythonOptional brace syntax might help a little ?
  • Slower feedback loop
  • Many “confusing” features�Implicits, type system, macros...

Spotify

Scala love

13.02.2021

23 of 37

23

It also has�technical issues...

Scala love

13.02.2021

Spotify

24 of 37

24

Serialization issues

  • Not specific to Scala…
  • … but Scala makes it worse�(immutability = many many closures)
  • Leaks many implementation details
  • Nothing we can do about it
  • We just provide technical support when a problem arises

Spotify

Scala love

13.02.2021

25 of 37

25

Binary & source compat (internally)

  • We currently support 2.12 and 2.13 internally
  • All of our Scala libraries need to be cross-published
  • 96.2% of our jobs are still running on 2.12
    • 2.1% are running on 2.13
    • 1.7% are still running on 2.11
  • Migrating takes (our users) time�No incentive to do it
  • Scala 3 is coming… �Drop 2.12 support ? support 2.12 and 2.13 and 3.x ?

Spotify

Scala love

13.02.2021

26 of 37

26

Binary & source compat (externally)

  • Libraries have different life cycles.

Fast, slow, defunkt

  • Breaking changes lead to resistance�Scala version updates
  • Unmaintained librairies and SBT plugins
  • Forces EVERYONE to re-work / re-publish at the same time�Hopefully Tasty will fix this

Spotify

Scala love

13.02.2021

27 of 37

27

Dependency conflicts

  • Not specific to Scala…�❯ cs resolve org.apache.beam:beam-sdks-java-io-google-cloud-platform:2.27.0� Resolution error: Conflicting dependencies:� io.grpc:grpc-core:1.32.1 or 1.32.2 or 1.33.1 or 1.34.0 or [1.32.2] or [1.34.0] wanted by [...]
  • SBT / Coursier resolution ≠ Maven
  • Always caused by the same libs (grpc, netty, guava, ...)
  • versionReconciliation may help ?
  • Submitted a proposal: 020-sbt-transitive-dependencies-conflictshttps://github.com/scalacenter/sbt-missinglink

Spotify

Scala love

13.02.2021

28 of 37

28

We can address those...

Scala love

13.02.2021

Spotify

29 of 37

29

Monorepo for libraries

  • Common DE Scala / Java internal libraries
  • Atomic publish
  • Each lib is owned by one team
  • One team (ours) owns the build
  • Make cross-publishing bearable
  • Ensure consistency and compatibility�(libraries compatibility, transitive dependencies, versioning scheme, code style, etc…)

Spotify

Scala love

13.02.2021

30 of 37

30

SBT plugin

  • Base configuration for Data engineering @Spotify using Scio�(including sbt plugins)
  • Solves most dependency conflicts
  • Strong separation between common and project specific options
  • Still lets you customize every setting
  • Lives in the Monorepo
  • Easy upgrade�Just upgrade the plugin. Also upgrades docker image, Java and Python version, etc...
  • Consistency across projects
  • Scala Steward ❤️

Spotify

Scala love

13.02.2021

31 of 37

31

Scala 3�yea or nay ?

Scala love

13.02.2021

Spotify

32 of 37

32

Exciting new features...

  • New types�Union, intersection, match, λ, dependent fn, polymorphic fn.
  • Typeclass derivation
  • Better macros
  • TASTy
  • Multiversal Equality
  • Opaque type aliases
  • Runtime Multi-Stage Programming
  • ...

Spotify

Scala love

13.02.2021

33 of 37

33

Unclear cost / benefit

  • Changed keywords ?�Given / using / summon / extension vs implicits
  • Wildcard Imports with * ?
  • Optional braces ?
  • Renaming Imports ? �(=> vs as)
  • ...

Spotify

Scala love

13.02.2021

34 of 37

34

Adoption will be slow

  • Cross-building is a lot of work�Supporting 2.12, 2.13, 3.0, 3.x is just too much
  • Need dependencies to be cross-published too
  • withDottyCompat and transitive deps ?
  • Creates fragmentation internally
  • Some pipelines will probably never be upgraded�Production code running unsupported versions 🤢
  • “Why should we migrate ?”

Spotify

Scala love

13.02.2021

35 of 37

35

We’re trying to help

  • Collaboration with the Scala Center to test Scio + Scala 3
  • Bug reports
  • Feedback on migration guide
  • Try to find alternatives to dropped features�Macro annotations...

Spotify

Scala love

13.02.2021

36 of 37

36

github.com/spotify/scio

spotify.github.io/scio

@skaalf / @regadas

spotifyjobs.com

Spotify

Scala love

13.02.2021

37 of 37

37

Spotify

Scala love

13.02.2021