Distributed Systems and AI
A Growing Number of Use Cases
2
©2017 RISELab
The Machine Learning Ecosystem
Machine Learning Ecosystem
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
Model Serving
RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
Model Serving
Hyperparameter Search
RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
Model Serving
RL
Hyperparameter Search
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
Data Processing
Model Serving
Hyperparameter Search
RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
Model Serving
Hyperparameter Search
RL
Data Processing
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
Model Serving
Hyperparameter Search
RL
Data Processing
The Machine Learning Ecosystem
Machine Learning Ecosystem
Training
Data Processing
Streaming
Model Serving
Hyperparameter Search
RL
The Machine Learning Ecosystem
Machine Learning Ecosystem
Distributed System
Distributed
System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
The Machine Learning Ecosystem
Machine Learning Ecosystem
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
The Machine Learning Ecosystem
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Why distributed systems?
The Machine Learning Ecosystem
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Why distributed systems?
The Machine Learning Ecosystem
Machine Learning Ecosystem
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Aspects of a distributed system
Why distributed systems?
The Machine Learning Ecosystem
Machine Learning Ecosystem
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Aspects of a distributed system
Why distributed systems?
The Machine Learning Ecosystem
Machine Learning Ecosystem
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Aspects of a distributed system
Why distributed systems?
Why is this a problem?
What is Ray?
Machine Learning Ecosystem
Training
Data Processing
Streaming
RL
Model Serving
Hyperparameter Search
Aspects of a distributed system
Why distributed systems?
What is Ray?
Machine Learning Ecosystem
Distributed System (Ray)
Training
Data Processing
Streaming
RL
Model Serving
Hyperparameter Search
Aspects of a distributed system
Why distributed systems?
What is Ray?
Machine Learning Ecosystem
Distributed System (Ray)
Libraries
Training
Data Processing
Streaming
RL
Model Serving
Hyperparameter Search
Aspects of a distributed system
Why distributed systems?
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
Distributed System (Ray)
Libraries
Training
Data Processing
Streaming
RL
Model Serving
Hyperparameter Search
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
This requires a very general underlying distributed system.
Distributed System (Ray)
Libraries
Training
Data Processing
Streaming
RL
Model Serving
Hyperparameter Search
Distributed System (Ray)
Libraries
Training
Data Processing
Streaming
RL
Model Serving
Hyperparameter Search
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Horovod, Distributed TF, Parameter Server
Clipper, TensorFlow Serving
Flink, many others
Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL
MapReduce, Hadoop, Spark
Vizier, many internal systems at companies
This requires a very general underlying distributed system.
Generality comes from tasks (functions) and actors (classes).
Use Case: Online Machine Learning
Ray API
Functions -> Tasks
def read_array(file):
# read array “a” from “file”
return a
def add(a, b):
return np.add(a, b)
35
©2017 RISELab
Ray API
Functions -> Tasks
@ray.remote
def read_array(file):
# read array “a” from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
36
©2017 RISELab
Ray API
Functions -> Tasks
@ray.remote
def read_array(file):
# read array “a” from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote([5, 5])
id1
read_array
37
©2017 RISELab
Ray API
Functions -> Tasks
@ray.remote
def read_array(file):
# read array “a” from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote([5, 5])
id2 = read_array.remote([5, 5])
id1
id2
zeros
zeros
read_array
read_array
38
©2017 RISELab
Ray API
Functions -> Tasks
@ray.remote
def read_array(file):
# read array “a” from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote([5, 5])
id2 = read_array.remote([5, 5])
id3 = add.remote(id1, id2)
id1
id2
id3
add
read_array
read_array
39
©2017 RISELab
Ray API
Functions -> Tasks
@ray.remote
def read_array(file):
# read array “a” from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote([5, 5])
id2 = read_array.remote([5, 5])
id3 = add.remote(id1, id2)
ray.get(id3)
id1
id2
id3
read_array
add
read_array
40
©2017 RISELab
Ray API
Classes -> Actors
Functions -> Tasks
@ray.remote
def read_array(file):
# read array “a” from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote([5, 5])
id2 = read_array.remote([5, 5])
id3 = add.remote(id1, id2)
ray.get(id3)
41
©2017 RISELab
Ray API
Classes -> Actors
@ray.remote(num_gpus=1)
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
Functions -> Tasks
@ray.remote
def read_array(file):
# read array “a” from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote([5, 5])
id2 = read_array.remote([5, 5])
id3 = add.remote(id1, id2)
ray.get(id3)
42
©2017 RISELab
Ray API
Classes -> Actors
@ray.remote(num_gpus=1)
class Counter(object):
def __init__(self):
self.value = 0
def inc(self):
self.value += 1
return self.value
c = Counter.remote()
id4 = c.inc.remote()
id5 = c.inc.remote()
ray.get([id4, id5])
Functions -> Tasks
@ray.remote
def read_array(file):
# read array “a” from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote([5, 5])
id2 = read_array.remote([5, 5])
id3 = add.remote(id1, id2)
ray.get(id3)
43
©2017 RISELab
Parallelism Example: Tree Reduction
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
1
2
3
4
5
6
7
8
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
1
2
3
4
5
6
7
8
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
add
1
2
3
4
5
6
7
8
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
id3
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
id3
add
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
id3
add
id4
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
id3
add
add
id4
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
id3
add
add
id4
id5
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
id3
add
add
add
id4
id5
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
id3
add
add
add
id4
id5
id6
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
id3
add
add
add
add
id4
id5
id6
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
id2
add
id3
add
add
add
add
id4
id5
id6
id7
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
1
2
3
4
5
6
7
8
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
add
1
2
3
4
5
6
7
8
add
add
add
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
add
add
id2
id3
id4
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
add
add
id2
id3
id4
add
add
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
add
add
id2
id3
id4
add
add
id5
id6
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
add
add
id2
id3
id4
add
add
id5
id6
add
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
id1
add
1
2
3
4
5
6
7
8
add
add
add
id2
id3
id4
add
add
id5
id6
add
id7
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
What does the difference look like in code?
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
What does the difference look like in code?
Parallelism Example: Tree Reduction
Add a bunch of values (2 at a time).
What does the difference look like in code?
Actors: Parameter Server Example
@ray.remote
class ParameterServer(object):
def __init__(self):
self.params = np.zeros(10)
def get_params(self):
return self.params
def update_params(self, grad):
self.params -= grad
72
©2017 RISELab
Actors: Parameter Server Example
@ray.remote
class ParameterServer(object):
def __init__(self):
self.params = np.zeros(10)
def get_params(self):
return self.params
def update_params(self, grad):
self.params -= grad
@ray.remote(num_gpus=1)
def worker(ps):
while True:
params = ray.get(ps.get_params.remote())
grad = ... # Use TensorFlow
ps.update_params.remote(grad)
73
©2017 RISELab
Actors: Parameter Server Example
@ray.remote
class ParameterServer(object):
def __init__(self):
self.params = np.zeros(10)
def get_params(self):
return self.params
def update_params(self, grad):
self.params -= grad
@ray.remote(num_gpus=1)
def worker(ps):
while True:
params = ray.get(ps.get_params.remote())
grad = ... # Use TensorFlow
ps.update_params.remote(grad)
@ray.remote(num_gpus=1)
def worker(ps):
while True:
...
@ray.remote(num_gpus=1)
def worker(ps):
while True:
...
74
©2017 RISELab
Actors: Parameter Server Example
@ray.remote
class ParameterServer(object):
def __init__(self):
self.params = np.zeros(10)
def get_params(self):
return self.params
def update_params(self, grad):
self.params -= grad
@ray.remote(num_gpus=1)
def worker(ps):
while True:
params = ray.get(ps.get_params.remote())
grad = ... # Use TensorFlow
ps.update_params.remote(grad)
@ray.remote(num_gpus=1)
def worker(ps):
while True:
...
@ray.remote(num_gpus=1)
def worker(ps):
while True:
...
@ray.remote
class ParameterServer(object):
def __init__(self):
self.params = np.zeros(10)
def get_params(self):
return self.params
def update_params(self, grad):
self.params -= grad
@ray.remote
class ParameterServer(object):
def __init__(self):
self.params = np.zeros(10)
def get_params(self):
return self.params
def update_params(self, grad):
self.params -= grad
75
©2017 RISELab
Ray Architecture
Node 1
Node 2
Node 3
76
©2017 RISELab
Ray Architecture
Node 1
Node 2
Node 3
Worker
Driver
Worker
Worker
Worker
Worker
77
©2017 RISELab
Ray Architecture
Worker
Driver
Worker
Worker
Worker
Worker
Object Store
Object Store
Object Store
78
©2017 RISELab
Ray Architecture
Worker
Driver
Worker
Worker
Worker
Worker
Object Store
Object Store
Object Store
Object Manager
Object Manager
Object Manager
79
©2017 RISELab
Ray Architecture
Worker
Driver
Worker
Worker
Worker
Worker
Object Store
Object Store
Object Store
Raylet
Raylet
Raylet
Scheduler
Object Manager
Scheduler
Scheduler
Object Manager
Object Manager
80
©2017 RISELab
Ray Architecture
Worker
Driver
Worker
Worker
Worker
Worker
Object Store
Object Store
Object Store
Raylet
Raylet
Raylet
Scheduler
Object Manager
Scheduler
Scheduler
Object Manager
Object Manager
Global Control Store
Global Control Store
Global Control Store
Debugging Tools
Profiling Tools
Web UI
81
©2017 RISELab
How does this work under the hood?
How does this work under the hood?
@ray.remote
def read_array(file):
# read array “a” from “file”
return a
@ray.remote
def add(a, b):
return np.add(a, b)
id1 = read_array.remote([5, 5])
id2 = read_array.remote([5, 5])
id3 = add.remote(id1, id2)
ray.get(id3)
Tasks
id1
id2
id3
add
read_array
read_array
How does this work under the hood?
id1
id2
id3
add
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
read_array
read_array
add
read_array
read_array
add
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
read_array
add
read_array
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
read_array
add
read_array
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
obj2
obj1
add
read_array
read_array
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
obj2
obj1
add
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
obj2
obj1
add
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
obj2
obj1
add
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
obj2
obj1
add
obj2
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
obj2
obj1
add
obj2
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
obj2
obj1
add
obj2
obj3
read_array
read_array
The Ray Architecture
Worker
Worker
Object Store
Scheduler
Global Control Store
Global Control Store
Worker
Worker
Object Store
Scheduler
Global Control Store
Worker
Worker
Object Store
Scheduler
id1
id2
id3
add
obj2
obj1
add
obj2
obj3
id1
read_array
id2
read_array
read_array
read_array
Conclusion
98
©2017 RISELab
Conclusion
Distributed System
Distributed System
Distributed System
Distributed System
Training
Data Processing
Streaming
RL
Distributed System
Model Serving
Distributed System
Hyperparameter Search
Distributed System (Ray)
Libraries
Training
Data Processing
Streaming
RL
Model Serving
Hyperparameter Search
99
©2017 RISELab