1 of 15

PA1 Discussion

An Introduction to Ray and Modin

2 of 15

Revisiting the Why

  • “I just want to be a data scientist, why should I learn all of this?”
  • Your job will most likely involve dealing with BIG data (think: TBs)
  • This course: learning to access and process big data
  • PA0: Understanding how to parallelize computations
  • Two takeaways:
    1. Work with Modin, but just write Pandas code.
      • Understand how to write efficient code.
    2. Work with Ray: understand how Ray can parallelize workloads

3 of 15

Why learn Ray?

4 of 15

Why learn Ray?

5 of 15

Why learn Ray?

6 of 15

What is Ray?

  • Open-source unified compute framework built to scale AI/Python workloads.
  • Provides data scientists an abstraction over distributed systems components; Scheduling, Scaling, Fault Tolerance etc.
  • Key Principles -

7 of 15

Ray Libraries

8 of 15

Ray Cluster

  • Worker nodes connected to a common Ray node. ray.init() -> Connect to the cluster.
  • Fixed-size or can be autoscaled.

9 of 15

Ray Core

  • The “scalable” part; build distributed systems that can run locally or on clusters.
  • Three main abstractions -
    • Tasks- Stateless Python functions.
    • Actors - Stateful Python classes.
    • Objects - Immutable objects that can be accessed anywhere on the cluster.

10 of 15

Ray Core- Tasks

  • Python functions that are executed asynchronously on separate workers within the Ray cluster.
  • Python functions that can gain from parallelization -> Convert to tasks.

@ray.remote

def square(x):

return x * x

# Launch four parallel square tasks.

futures = [square.remote(i) for i in range(4)]

# .remote() returns an ObjectRef

# Retrieve results.

print(ray.get(futures))

11 of 15

Ray Core - Actors

  • Tasks : Functions :: Actors : Classes
  • Stateful worker, it’s methods can access and change it’s state.

# Define the Counter actor.

@ray.remote

class Counter:

def __init__(self):

self.i = 0

def get(self):

return self.i

def incr(self, value):

self.i += value

# Create a Counter actor.

c = Counter.remote()

for _ in range(10):

c.incr.remote(1)

# Retrieve final actor state.

print(ray.get(c.get.remote()))

12 of 15

Ray Core - Objects & Object Store

  • Objects - Immutable and can be accessed from anywhere in the cluster, using ObjectRef.
  • ObjectRef - Pointers to refer to remote objects.
  • Object Store - Store remote objects. Ensures objects shared amongst workers.

13 of 15

Ray AI Runtime (AIR)

14 of 15

Modin- Multi-Core Pandas

  • Pandas drop replacement for multicore settings - speed up workflows by scaling up Pandas to use all available cores.
  • Works on different compute engines e.g Ray, Dask, Unidist.
  • Covers ~ 90% of the entire Pandas API

15 of 15

References

  1. https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb