Fraida Fund
Teaching and Learning ML Ops/Systems on Chameleon
NYU
Supported by:�NSF OAC-2230079
Learning Intro ML
Real-world ML Systems
Executive Summary
Machine Learning Systems Engineering and Operations - zoom out from model-centric view, build and operate large-scale ML systems.
The vision
Weekly lab assignments on Chameleon, followed by open-ended project implemented in groups of 3-4.
ML systems, cloud computing, DevOps/MLOps, data systems, large scale training, operationalizing model training, serving, evaluation & monitoring, safeguarding, commercial clouds
The *core* topics
The hands-on elements
Image source: D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. NeurIPS ’15.
Cloud | | | | | ||
DevOps | | | | | ||
Data | | | | | ||
Model training | | | | | ||
Model serving | | | | | ||
Evaluation and monitoring | | | | | ||
Compose
Tech Stack for Lab Assignments
109,837
instance hours on Chameleon�
for lab assignments in Spring 2025�
Estimated cost on commercial cloud: $110-$125/student�
"most expensive" student would �cost > $600
Open-ended project usage
76,855 hours compute (5,466 hours GPU)
9 TB block storage volumes
1.5 TB object storage
Things that *didn't* work right out of the box…
Course materials available at: