1 of 99

Distributed Systems and AI

Ray: Programming at Any Scale

https://github.com/ray-project/ray

Robert Nishihara

2 of 99

A Growing Number of Use Cases

2

©2017 RISELab

3 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

4 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

5 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

6 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

7 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

8 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

RL

9 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

RL

10 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

RL

11 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

RL

12 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

RL

13 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Model Serving

RL

14 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Model Serving

Hyperparameter Search

RL

15 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Model Serving

RL

Hyperparameter Search

16 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Data Processing

Model Serving

Hyperparameter Search

RL

17 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Model Serving

Hyperparameter Search

RL

Data Processing

18 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Model Serving

Hyperparameter Search

RL

Data Processing

19 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Training

Data Processing

Streaming

Model Serving

Hyperparameter Search

RL

20 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Distributed System

Distributed

System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

21 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

22 of 99

The Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Why distributed systems?

23 of 99

The Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Why distributed systems?

  • More work (computation) than one machine can do in a reasonable amount of time.
  • More data than can fit in one machine.

24 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Aspects of a distributed system

Why distributed systems?

  • More work (computation) than one machine can do in a reasonable amount of time.
  • More data than can fit in one machine.

25 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Aspects of a distributed system

  • Units of work “tasks” executed in parallel
  • Scheduling (which tasks run on which machines and when)
  • Data transfer
  • Failure handling
  • Resource management (CPUs, GPUs, memory)

Why distributed systems?

  • More work (computation) than one machine can do in a reasonable amount of time.
  • More data than can fit in one machine.

26 of 99

The Machine Learning Ecosystem

Machine Learning Ecosystem

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Aspects of a distributed system

  • Units of work “tasks” executed in parallel
  • Scheduling (which tasks run on which machines and when)
  • Data transfer
  • Failure handling
  • Resource management (CPUs, GPUs, memory)

Why distributed systems?

  • More work (computation) than one machine can do in a reasonable amount of time.
  • More data than can fit in one machine.

Why is this a problem?

27 of 99

What is Ray?

Machine Learning Ecosystem

Training

Data Processing

Streaming

RL

Model Serving

Hyperparameter Search

Aspects of a distributed system

  • Units of work “tasks” executed in parallel
  • Scheduling (which tasks run on which machines and when)
  • Data transfer
  • Failure handling
  • Resource management (CPUs, GPUs, memory)

Why distributed systems?

  • More work (computation) than one machine can do in a reasonable amount of time.
  • More data than can fit in one machine.

28 of 99

What is Ray?

Machine Learning Ecosystem

Distributed System (Ray)

Training

Data Processing

Streaming

RL

Model Serving

Hyperparameter Search

Aspects of a distributed system

  • Units of work “tasks” executed in parallel
  • Scheduling (which tasks run on which machines and when)
  • Data transfer
  • Failure handling
  • Resource management (CPUs, GPUs, memory)

Why distributed systems?

  • More work (computation) than one machine can do in a reasonable amount of time.
  • More data than can fit in one machine.

29 of 99

What is Ray?

Machine Learning Ecosystem

Distributed System (Ray)

Libraries

Training

Data Processing

Streaming

RL

Model Serving

Hyperparameter Search

Aspects of a distributed system

  • Units of work “tasks” executed in parallel
  • Scheduling (which tasks run on which machines and when)
  • Data transfer
  • Failure handling
  • Resource management (CPUs, GPUs, memory)

Why distributed systems?

  • More work (computation) than one machine can do in a reasonable amount of time.
  • More data than can fit in one machine.

30 of 99

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

31 of 99

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

Distributed System (Ray)

Libraries

Training

Data Processing

Streaming

RL

Model Serving

Hyperparameter Search

32 of 99

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

This requires a very general underlying distributed system.

Distributed System (Ray)

Libraries

Training

Data Processing

Streaming

RL

Model Serving

Hyperparameter Search

33 of 99

Distributed System (Ray)

Libraries

Training

Data Processing

Streaming

RL

Model Serving

Hyperparameter Search

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Horovod, Distributed TF, Parameter Server

Clipper, TensorFlow Serving

Flink, many others

Baselines, RLlab, ELF, Coach, TensorForce, ChainerRL

MapReduce, Hadoop, Spark

Vizier, many internal systems at companies

This requires a very general underlying distributed system.

Generality comes from tasks (functions) and actors (classes).

34 of 99

Use Case: Online Machine Learning

  • 3 min, streaming + model training, from feature / label to model output
  • 5 min, streaming + training + serving, from feature / label to model deploy
  • 5% CTR improvement comparing to offline model; 1% CTR improvement comparing to blink solution

35 of 99

Ray API

Functions -> Tasks

def read_array(file):

# read array “a” from “file”

return a

def add(a, b):

return np.add(a, b)

35

©2017 RISELab

36 of 99

Ray API

Functions -> Tasks

@ray.remote

def read_array(file):

# read array “a” from “file”

return a

@ray.remote

def add(a, b):

return np.add(a, b)

36

©2017 RISELab

37 of 99

Ray API

Functions -> Tasks

@ray.remote

def read_array(file):

# read array “a” from “file”

return a

@ray.remote

def add(a, b):

return np.add(a, b)

id1 = read_array.remote([5, 5])

id1

read_array

37

©2017 RISELab

38 of 99

Ray API

Functions -> Tasks

@ray.remote

def read_array(file):

# read array “a” from “file”

return a

@ray.remote

def add(a, b):

return np.add(a, b)

id1 = read_array.remote([5, 5])

id2 = read_array.remote([5, 5])

id1

id2

zeros

zeros

read_array

read_array

38

©2017 RISELab

39 of 99

Ray API

Functions -> Tasks

@ray.remote

def read_array(file):

# read array “a” from “file”

return a

@ray.remote

def add(a, b):

return np.add(a, b)

id1 = read_array.remote([5, 5])

id2 = read_array.remote([5, 5])

id3 = add.remote(id1, id2)

id1

id2

id3

add

read_array

read_array

39

©2017 RISELab

40 of 99

Ray API

Functions -> Tasks

@ray.remote

def read_array(file):

# read array “a” from “file”

return a

@ray.remote

def add(a, b):

return np.add(a, b)

id1 = read_array.remote([5, 5])

id2 = read_array.remote([5, 5])

id3 = add.remote(id1, id2)

ray.get(id3)

id1

id2

id3

read_array

add

read_array

40

©2017 RISELab

41 of 99

Ray API

Classes -> Actors

Functions -> Tasks

@ray.remote

def read_array(file):

# read array “a” from “file”

return a

@ray.remote

def add(a, b):

return np.add(a, b)

id1 = read_array.remote([5, 5])

id2 = read_array.remote([5, 5])

id3 = add.remote(id1, id2)

ray.get(id3)

41

©2017 RISELab

42 of 99

Ray API

Classes -> Actors

@ray.remote(num_gpus=1)

class Counter(object):

def __init__(self):

self.value = 0

def inc(self):

self.value += 1

return self.value

Functions -> Tasks

@ray.remote

def read_array(file):

# read array “a” from “file”

return a

@ray.remote

def add(a, b):

return np.add(a, b)

id1 = read_array.remote([5, 5])

id2 = read_array.remote([5, 5])

id3 = add.remote(id1, id2)

ray.get(id3)

42

©2017 RISELab

43 of 99

Ray API

Classes -> Actors

@ray.remote(num_gpus=1)

class Counter(object):

def __init__(self):

self.value = 0

def inc(self):

self.value += 1

return self.value

c = Counter.remote()

id4 = c.inc.remote()

id5 = c.inc.remote()

ray.get([id4, id5])

Functions -> Tasks

@ray.remote

def read_array(file):

# read array “a” from “file”

return a

@ray.remote

def add(a, b):

return np.add(a, b)

id1 = read_array.remote([5, 5])

id2 = read_array.remote([5, 5])

id3 = add.remote(id1, id2)

ray.get(id3)

43

©2017 RISELab

44 of 99

Parallelism Example: Tree Reduction

45 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

1

2

3

4

5

6

7

8

46 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

1

2

3

4

5

6

7

8

47 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

add

1

2

3

4

5

6

7

8

48 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

49 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

50 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

51 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

52 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

id3

53 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

id3

add

54 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

id3

add

id4

55 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

id3

add

add

id4

56 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

id3

add

add

id4

id5

57 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

id3

add

add

add

id4

id5

58 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

id3

add

add

add

id4

id5

id6

59 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

id3

add

add

add

add

id4

id5

id6

60 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

id2

add

id3

add

add

add

add

id4

id5

id6

id7

61 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

1

2

3

4

5

6

7

8

62 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

add

1

2

3

4

5

6

7

8

add

add

add

63 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

add

add

id2

id3

id4

64 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

add

add

id2

id3

id4

add

add

65 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

add

add

id2

id3

id4

add

add

id5

id6

66 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

add

add

id2

id3

id4

add

add

id5

id6

add

67 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

id1

add

1

2

3

4

5

6

7

8

add

add

add

id2

id3

id4

add

add

id5

id6

add

id7

68 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

69 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

What does the difference look like in code?

70 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

What does the difference look like in code?

71 of 99

Parallelism Example: Tree Reduction

Add a bunch of values (2 at a time).

What does the difference look like in code?

72 of 99

Actors: Parameter Server Example

@ray.remote

class ParameterServer(object):

def __init__(self):

self.params = np.zeros(10)

def get_params(self):

return self.params

def update_params(self, grad):

self.params -= grad

72

©2017 RISELab

73 of 99

Actors: Parameter Server Example

@ray.remote

class ParameterServer(object):

def __init__(self):

self.params = np.zeros(10)

def get_params(self):

return self.params

def update_params(self, grad):

self.params -= grad

@ray.remote(num_gpus=1)

def worker(ps):

while True:

params = ray.get(ps.get_params.remote())

grad = ... # Use TensorFlow

ps.update_params.remote(grad)

73

©2017 RISELab

74 of 99

Actors: Parameter Server Example

@ray.remote

class ParameterServer(object):

def __init__(self):

self.params = np.zeros(10)

def get_params(self):

return self.params

def update_params(self, grad):

self.params -= grad

@ray.remote(num_gpus=1)

def worker(ps):

while True:

params = ray.get(ps.get_params.remote())

grad = ... # Use TensorFlow

ps.update_params.remote(grad)

@ray.remote(num_gpus=1)

def worker(ps):

while True:

...

@ray.remote(num_gpus=1)

def worker(ps):

while True:

...

74

©2017 RISELab

75 of 99

Actors: Parameter Server Example

@ray.remote

class ParameterServer(object):

def __init__(self):

self.params = np.zeros(10)

def get_params(self):

return self.params

def update_params(self, grad):

self.params -= grad

@ray.remote(num_gpus=1)

def worker(ps):

while True:

params = ray.get(ps.get_params.remote())

grad = ... # Use TensorFlow

ps.update_params.remote(grad)

@ray.remote(num_gpus=1)

def worker(ps):

while True:

...

@ray.remote(num_gpus=1)

def worker(ps):

while True:

...

@ray.remote

class ParameterServer(object):

def __init__(self):

self.params = np.zeros(10)

def get_params(self):

return self.params

def update_params(self, grad):

self.params -= grad

@ray.remote

class ParameterServer(object):

def __init__(self):

self.params = np.zeros(10)

def get_params(self):

return self.params

def update_params(self, grad):

self.params -= grad

75

©2017 RISELab

76 of 99

Ray Architecture

Node 1

Node 2

Node 3

76

©2017 RISELab

77 of 99

Ray Architecture

Node 1

Node 2

Node 3

Worker

Driver

Worker

Worker

Worker

Worker

77

©2017 RISELab

78 of 99

Ray Architecture

Worker

Driver

Worker

Worker

Worker

Worker

Object Store

Object Store

Object Store

78

©2017 RISELab

79 of 99

Ray Architecture

Worker

Driver

Worker

Worker

Worker

Worker

Object Store

Object Store

Object Store

Object Manager

Object Manager

Object Manager

79

©2017 RISELab

80 of 99

Ray Architecture

Worker

Driver

Worker

Worker

Worker

Worker

Object Store

Object Store

Object Store

Raylet

Raylet

Raylet

Scheduler

Object Manager

Scheduler

Scheduler

Object Manager

Object Manager

80

©2017 RISELab

81 of 99

Ray Architecture

Worker

Driver

Worker

Worker

Worker

Worker

Object Store

Object Store

Object Store

Raylet

Raylet

Raylet

Scheduler

Object Manager

Scheduler

Scheduler

Object Manager

Object Manager

Global Control Store

Global Control Store

Global Control Store

Debugging Tools

Profiling Tools

Web UI

81

©2017 RISELab

82 of 99

How does this work under the hood?

83 of 99

How does this work under the hood?

@ray.remote

def read_array(file):

# read array “a” from “file”

return a

@ray.remote

def add(a, b):

return np.add(a, b)

id1 = read_array.remote([5, 5])

id2 = read_array.remote([5, 5])

id3 = add.remote(id1, id2)

ray.get(id3)

Tasks

id1

id2

id3

add

read_array

read_array

84 of 99

How does this work under the hood?

id1

id2

id3

add

read_array

read_array

85 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

86 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

read_array

read_array

87 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

read_array

read_array

add

read_array

read_array

add

88 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

read_array

add

read_array

read_array

read_array

89 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

read_array

add

read_array

read_array

read_array

90 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

obj2

obj1

add

read_array

read_array

read_array

read_array

91 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

obj2

obj1

add

read_array

read_array

92 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

obj2

obj1

add

read_array

read_array

93 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

obj2

obj1

add

read_array

read_array

94 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

obj2

obj1

add

obj2

read_array

read_array

95 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

obj2

obj1

add

obj2

read_array

read_array

96 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

obj2

obj1

add

obj2

obj3

read_array

read_array

97 of 99

The Ray Architecture

Worker

Worker

Object Store

Scheduler

Global Control Store

Global Control Store

Worker

Worker

Object Store

Scheduler

Global Control Store

Worker

Worker

Object Store

Scheduler

id1

id2

id3

add

obj2

obj1

add

obj2

obj3

id1

read_array

id2

read_array

read_array

read_array

98 of 99

Conclusion

  • Ray is an open source project for distributed computing
  • special-purpose distributed systems -> general-purpose distributed system
  • Support for the full ML lifecycle (data collection, training, simulation, serving)

98

©2017 RISELab

99 of 99

Conclusion

  • Ray is an open source project for distributed computing
  • special-purpose distributed systems -> general-purpose distributed system
  • Support for the full ML lifecycle (data collection, training, simulation, serving)

Distributed System

Distributed System

Distributed System

Distributed System

Training

Data Processing

Streaming

RL

Distributed System

Model Serving

Distributed System

Hyperparameter Search

Distributed System (Ray)

Libraries

Training

Data Processing

Streaming

RL

Model Serving

Hyperparameter Search

99

©2017 RISELab