1 of 16

Fraida Fund

Teaching and Learning ML Ops/Systems on Chameleon

NYU

Supported by:�NSF OAC-2230079

2 of 16

Learning Intro ML

3 of 16

Real-world ML Systems

4 of 16

Executive Summary

Machine Learning Systems Engineering and Operations - zoom out from model-centric view, build and operate large-scale ML systems.

The vision

Weekly lab assignments on Chameleon, followed by open-ended project implemented in groups of 3-4.

ML systems, cloud computing, DevOps/MLOps, data systems, large scale training, operationalizing model training, serving, evaluation & monitoring, safeguarding, commercial clouds

The *core* topics

The hands-on elements

Image source: D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. NeurIPS ’15.

5 of 16

6 of 16

7 of 16

8 of 16

9 of 16

10 of 16

Cloud

DevOps

Data

Model training

Model serving

Evaluation and monitoring

Compose

Tech Stack for Lab Assignments

11 of 16

109,837

instance hours on Chameleon�

for lab assignments in Spring 2025�

12 of 16

Estimated cost on commercial cloud: $110-$125/student�

"most expensive" student would �cost > $600

13 of 16

Open-ended project usage

76,855 hours compute (5,466 hours GPU)

9 TB block storage volumes

1.5 TB object storage

14 of 16

Things that *didn't* work right out of the box…

15 of 16

  • 2nd iteration: Spring 2026
  • 110 students
  • Labs are more robust
  • Project structure is improved
  • New topics: LLMOps, RAG, agents

16 of 16

Course materials available at: