4 of 53

1/11

2023 PBC (2022 review due this Friday), strategy, foundation model …

PBC: 1) project perspective 3) personal career

Formal project management (3 releases a year, 430, 730, 1030)

Research/investigation topic also needs goal and deliverables
Insights report in Dec
Efficient updates to Sponsor

130 release

1st version Fission based serverless ML framework

Epoch based resource reallocation/optimization
Function container with GPU access
Redis for data and model exchange, (Option Alluxio) (DAG dependency,)
Stretch: decouple data preprocessing function and save and distributed preprocessed data

Collect performance/accuracy metrics, compare with Pytorch DDP and reference KubeML

Individual updates

Ning → Huide → Yaohui → Ke → Steven → Nikunj

5 of 53

Questions

RDMA solution performance comparison

SRIOV vs macVlan

Kubernetes CNI is a big project, don’t try to rewrite

Investigate the changes/configuration to enable RDMA CNI

E-2-E GPU deep learning training, cuda package rewrite

Reference existing DL example
Task breakdown, and timeline planning
our position on DALI, any comparison plan

6 of 53

12/20

Individual updates

Steven →Ke → Ning → Yaohui → Nikunj

7 of 53

12/14

Holiday celebration, Seattle (Thursday), Santa Clara(Friday)
3 days a week in office next year
Aritra passed Ph.D. prelim exam “Applications of Predictive Analytics on Cloud Platform”
Sponsor Topic alignment

AI platform engineering
Special container for AI and Big Data
Data Lake/warehouse (safely execute user defined function on the warehouse, web assembly)
Data exchange/governance, confidencial computing

130 release

Fission based Serverless Training

Break data preprocessing and training to two functions (CV image and test case) (Yaohui/Steven)
Break training by epoch, adjust function counts, optimize aggregation (Ning/Aritra)
Function container support RDMA (notes: only supports one container per NIC) (Huide)
Support pod fast start with pod checkpoint or CRIU (Nikunj)
Provide In-memory distributed storage service for model/data/images (?redis/alluxio/…) (Ke/Nikunj/Ning)

Individual updates

Nikunj → Steven →Ke → Huide → Ning → (Aritra)

8 of 53

12/7

AWS reinvent

SageMaker ML architecture (steven), Large model, distributed training (Ning), EMR (Zhaobo)
Asynchronous, serverless (full data stack), customer cost reduction

Serverless AI platform

Generic service: Task orchestration, storage service for workers’ data exchange, data caching
AI specific functions, e.g., date preprocessing, gradient/model aggregation
Advanced Infra: secure container runtime, rdma, nvme

Weekly update (first person <=20 mins, others 10 <= mins)

Ning → Nikunj → Steven →Ke → Huide →Yaohui

9 of 53

11/30

Cloud lab other relevant projects

Quark container runtime, with RDMA service (CNI)

https://github.com/QuarkContainer/Quark

Fornax-serverless platform

Weekly update

Yaohui → Huide → Ke → Ning → Nikunj → Steven

10 of 53

11/23

AWS re:invent (leadership session, breakout session,
Deliverables on next release 130
Next Year Goal,
Weekly update

Steven → Aritra → Ning → Nikunj → Huide → Ke

11 of 53

Argo workflow

Install argo-workflow (argo-server, workflow-controller)
Install argo cli
Write workflow template

Steps (sequential, parallel)
Graph (dependency)

Submit workflow “argo submit workflow.yaml”

Orchestrate container/function execution order, “with output sharing”
Test scheduling decision (a job execution order), compare average job completion time

12 of 53

Output parameter

https://argoproj.github.io/argo-workflows/walk-through/output-parameters/

13 of 53

11/16

Updates order, 1st person <30 mins, others < 10 mins. Keep meeting within 1.5 hrs

Huide → Steven → Yaohui → Ning → Nikunj

Aritra will be the 2nd, he joins bi-weekly

14 of 53

10/26

V0.5.0 release updates
Alignment meeting, next release planning

Nerf rendering
EI (next Monday) Serverless platform components (investigation)

Progress update

Steven
Aritra
Huide
Ning
Nik
Yaohui

15 of 53

10/19

A100 arrived
Nov. 11 team building with Bellevue crew
EI team sponsor alignment (next week)
Send our progress on Nerf to Media Lab
Release updates (code review and check-in)

Alluxio caching
Nerface avatar training and rendering
RDMA image, and DDP test
Profiler

stabilize “alnair_gpu_util”, “alnair_gpu_mem_util” metrics
event-drive pod metadata and utils data collection and store
Standalone profiler intercept lib

16 of 53

10/13/2022

Alluxio review

In-memory FS, add/delete alluxio worker (expand and shrink worker)
Installation in k8s
Default duplication/HA
What Alnair Alluxio operator does?
Copy user’s data from source location to Alluxio’s memory space
Delete the running pod
mount alluxio space to your pod instead of mount from the source location

Imperfect

18 of 53

installation

Create PV

Install Alluxio master, worker,

19 of 53

10/12/2022

Office Day: Monday Thursday
10.145.83.36, 10.145.83.37 GPU upgrade from K80 to P100
Progress Update

Nikunj:

20 of 53

10/5/2022

Upgrade 6 K80 -> P100 from Reza next Monday
Server reboot/disconnect issues
Progress updates/issues, target release date 10/21

Zhaobo, serverless computing survey, profiler, alnair pod
Ning, create NeRF Intro, need a discussion of NeRF acceleration
Nikunj, complete Alluxio,

Q from Dr. Xiong: why we need to build CRD and operator for Alluxio as a platform provider?
New updates with Affinity are coming soon
Need to carefully quantify the improvement by Alluxio

Huide: still working on distributed training over RDMA

Issue: segmentation fault, may caused by openib, trying OpenMPI (5.0.RC) now
Maybe talk with Xu Hao, he has done something of OpenMPI

Yaohui: wrap up the code and prepare an avatar using my face video
Steven: NeMO LLM investigation, need to get approved from Nvidia to use the service

Will download the toolkit and play with

Reza: build k8s cluster, maybe start with some warm up projects?

21 of 53

9/28/2022

New member from Device, Reza
Dr. Xiong’s Sponsor Updates
Progress updates

Aritra

Updates (with 930 goal)

Zhaobo
Yaohui
Ning
Huide
Steven
Nik

Serverless Insights

23 of 53

Feature development

Alnair Profiler (refactor)

Collect pod metadata, max cpu and memory utilization, max gpu and mem utilization, save to mongo DB
Integrate new metrics to Alnair-exporter, report data loading time (

Alnair Pod and controller

Mount cuda intercept lib
Mount rdma lib (future)

Clustering and scheduling service

Integrate existing cluster and job pairing functions into Alnair platform
Explore Argo to orchestrate 24 jobs for scheduler efficiency test

Brainstorming on acceleration Data caching/distributed training
Design Alnair Job and controller (for distributed training -> auto distributed)

24 of 53

Profiler

(?)How to decide the time spent on data loading

From the process starts to gpu busy (utils >0%)

(?)How to measure the time spent on network (Different model has different network demand)

Count the time i/o utils > 0% (but i/o usage maybe overlapped with compute usage)

https://dl.acm.org/doi/pdf/10.1145/3448016.3459240

25 of 53

Alluxio Cluster

Challenge:

hot switch, when cache is hydrated, switch from remote store to alluxio cache

Currently wait

Limited cache space management, how to efficiently server large number of jobs

26 of 53

9/21/2022

Model split to speed up AI inference
GTC updates
Website updates
Nerf
Profiler development
Alluxio development

27 of 53

9/14/2022

AI hardware summit

Gradient notebook: powering next generation app from ML to 3D graphics
SambaNova: foundation model platform (GPT), dataflow-as-a-service with pretrained model
Cerebras: wafer level cluster, and simplify distribution engineering (data, model, no. of cores)
Mythic: in-memory compute (analogue chip), accelerate edge AI
Meta: AI cluster, disaggregated compute and storage, super fast network fabric

930 release continue from last week

RDMA, Nerf PoC
Usable Alluxio and Profiler

28 of 53

9/8/2022

New member: Ning Wang (edge intelligence)
UCLA workshop, possible collaboration (their strength?, https://www.breezeml.ai/)

data/remote memory,
elastic training (deepspeed, pipeline parallel)

930 Release features

RDMA PoC (demo speed gain)
Nerf face Poc (resource/latency requirements to check feasibility for potential app)
Alluxio enabled Alnair Pods (cache affinity considered or not, baseline is NSF or local SSD)
Standalone profiler intercept lib, report H2D,
Profiler, pod based utilization data store (meta data include vgpu or not spec)

Rotating updates

29 of 53

8/31/2022

•Back to office (2 days a week)

•Monday, Weds

•Next weds (9/7) group meeting -> Thurs afternoon.

•Weds Workshops: Metaverse, NG infra for Cloud computing and ML

•Alnair.sh (area assignment)

•Nik (data orchestration) Data format, preprocess/pipeline, dataset, dataloader, data prefetch and caching

•Huide (network), CNI, RDMA smart NIC, backend communication

•Steven (CUDA), GPU intercept/sharing, GDS, profiling

•Yaohui (AI algorithms), 3D rendering and self-driving, intelligent system/scheduler)

•930 key feature: data(?) and profiler

•Updates

•Nerface reproduce; exporter/profiler feature confirm

•RDMA config/test, Ibping <GUID> (not ip) on infinity band. Throughput related with thread setup

30 of 53

Alnair Platform

32 of 53

8/24/2022

Problems to be solved in 930 (to be discussed)

Profiler (store the max gpu utilization, memory util, io, … meta data for each jobs)
Metrics exporter (report data loading, preprocessing, forward/backward path time)
Distributed job controller (pytorch DDP, torch elastic, compare with horovd performance)
Data store (in-memory FS, prefetch data to limited memory)
Nerf/transformer algorithm performance deep dive
RDMA for distributed training poc

Rotating updates

Huide
Steven
Nik
Yaohui (Brown bag)

33 of 53

8/17/2022

Weekly update

Get a swin-transformer object detection (object segmentation) distributed training case ready

Model, dataset, framework(torch elastic, horovd)

Data pipeline orchestration (example Pachyderm)

https://www.youtube.com/watch?v=nPNmuiNn_Z4

A100, NIC purchase

Alnair Position

AI training with optimized scenarios (AV, 3D rendering)
Features: data pipeline, gpu sharing, cross-layer monitoring, RDMA communication, pytorch improve plugin

Rotating updates

Yaohui
Steven
Nik
Huide
Aritra
Xu
Cyan

34 of 53

8-12-2022

Alnair (cloud-native AI platform) position

Prototype/MVP, not complete solution, focus on innovation, root/key technologies, enable new cloud apps
Currently is built on Kubernetes, manage nvidia GPUs, running AI training jobs (will add inference jobs)
Identified and work on two challenges: GPU sharing, Data caching
Infrastructure improvement (ongoing): RDMA network, GPU Direct Storage (for container)
ML framework improvement: operator, transformer implementation etc.
Use cases: Self driving, Immersive meeting (Virtual human, 3D rendering),
Short term goal (1 year): smoothly serve training and inference job with intelligent resource allocation, data orchestration, cross-layer monitoring, and cutting-edge network and storage technologies. (small/medium job <= 32 cards)
Long term: new virtualization, new data/programming model, new severless model, new applications

Weekly updates

Data Orchestration, AV (Nik)
Network, server (Huide): https://github.com/pachyderm/pachyderm
Storage, NVMe SSD (Steven)
Facial expression control/mapping (Cyan): https://github.com/NumesSanguis/FACSvatar
GPU sharing scheduling algorithm (Aritra): scheduling criteria: packing saving or individual slow down

36 of 53

7-13-2022

630 Release (7/15->7/20)

Storage caching (cache cluster based on Redis, alnair-pod and its controller) — further verify Hazelcast, redis single thread becomes bottleneck (Dahai Liu <dliu@futurewei.com>)
Profiler (alnair-exporter, custom collector with new metrics) – add memused and kernel launch counter from intercept lib
Open data (charactering AI workloads, feature analysis)
GPU sharing (fix retrieve cgroup.proc info, cuda intercept support >= 11.3) – intercept lib now support both 11.3 and after

Weekly Updates

Container runtime/RDMA, GPU Direct (SSD)

RoCE vs Infinity band
Can we run any ML examples on existing Qingming’s RDMA clusters? Any code changes?
Just buy 2 Smart NIC CX-4 for Bellevue GPU servers, reuse Dell switch

GPU sharing
Profiler
3D rendering engine, real-time rendering
Storage caching

37 of 53

7-6-2022

630 Release (7/15)

Storage caching (cache cluster based on Redis, alnair-pod and its controller)
Profiler (alnair-exporter, custom collector with new metrics)
Open data (charactering AI workloads, feature analysis)
GPU sharing (fix retrieve cgroup.proc info, cuda intercept support >= 11.3)

Weekly Updates

Container runtime/RDMA, GPU sharing

IOctl system call is not ready in Quark (Plan A stops here)

Normal VM has GPU passthrough, but not ready in QVM,

Plan B: enable RDMA and GPU support on runc ⇒ Alnair container runtime (?)

Profiler
3D rendering
Storage caching

38 of 53

6-29-2022

OSS Info sharing

Alluxio, WASM, unstructured data, Jina, Horovod updates, “security”, Open DataSet

630 Release (7/15)

Storage caching (cache cluster based on Redis, alnair-pod and its controller)
Profiler (alnair-exporter, new metrics)
Open data (charactering AI workloads, feature analysis)
GPU sharing (fix retrieve cgroup.proc info, cuda intercept support >= 11.3)

Network upgrades (CPU<->GPU 10Gbps, GPU<->GPU 25Gbps)

4 Mellenox CX-6, 4 SSD with gpu-direct supports, 1 RDMA switch

Weekly Updates

Storage caching
GPU sharing, Quark GPU support first and then verify RDMA network help AI workloads
3D rendering
Profiler

39 of 53

Alnair-exporter

Exporter + Collector

When Prometheus Server pull Alnair-exporter, collector goes to retrieve metrics
Implement prometheus collector interface for custom metrics

Metrics Source

Existing API servers (e.g., k8s get pods, nvml get device)
Files (e.g., /proc/stat)
Custom servers (?)

Metrics example

nvml-collector

Alnair-GPU-UTIL: 35% {GPU-UUID, PID, Node-Name}
Alnair-GMEM-USED: 8 {GPU-UUID, PID, Node-Name}

CUDA-collector

Alnair-GPU-Burst: 1 {GPU-UUID, Pod-ID, Node-Name}
Alnair-CUDA-Kernel: timestamp {Function-name, GPUUID, Pod-NAME}

DaemonSet on each worker node
(?)Additional Server to hand over metrics from intercept lib

40 of 53

Architecture

Worker node (multiple GPUs)

Alnair-Exporter

Exporter

nvml-collector

cuda-collector

User Pod with intercept lib

User Pod with intercept lib 1

intercept lib 2

Alnair-metrics-server

Option 1: write to file /var/lib/alnair/workspace/ABCD/metrics

Option 2: dial to a custom metrics server

python-collector

41 of 53

3D Rendering Stack

3D software

Blender/Maya/Autodesk 3D Max

Render jobs

Real-time, offline (select benchmarks)
Format: obj/fbx/usd, blendfile (single object and all-in-one software based file)

Render engine & algorithm

cycle/evee/appleseed
Algorithms: rasterization, ray-tracing, neural radiance field (Nerf)

Infrastructure

CPU vs GPU
render farm (distributed, network rendering)
Management/orchestration tool

42 of 53

6-22-2022

630 Release

Storage caching (cache cluster, CPU workloads prove data loading speedup)
Profiler (custom exporter, new metrics)
open data
GPU sharing (fix)

Weekly Update

Storage Caching

Nik (FS), Fluid, Alluxio
Zhuangwei (KV), client-manager design

Profiler

Steven, Jonathan

GPU container runtime and Sharing

Hao
Aritra

AI application

Yaohui
Cyan, 3D rendering

43 of 53

6-15-2022

Weekly Update

Storage Caching

CPU network upgraded to 10Gbps
Nik (FS), Fluid
Zhuangwei (KV), client-manager design

Redis loading improvement (1.67sec → 0.45 sec)

GPU container runtime and Sharing

Hao: file size 0 causes segmentation fault
Aritra

AI application

Yaohui, Giraffe
Cyan, 3D rendering

Profiler

Metrics needs: process level GPU utilization (verify GPU sharing compute limits work)

44 of 53

6-8-2022

Meetings

Seattle Center internal meeting, OSS

Weekly Update

AI application

Yaohui, Giraffe
Cyan, 3D rendering

GPU container runtime and Sharing

Hao: configure Quark as low-level docker runtime and Nvidia docker runtime on Top
Aritra

Profiler

Metrics needs: process level GPU utilization (verify GPU sharing compute limits work)

Storage Caching

Network bottleneck identified 1Gbps
Zhuangwei (KV), client-manager design
Nik (FS), Fluid

45 of 53

Giraffe GPU utilization

GPU has some idle times, not always high like training

46 of 53

AI Demo Nerf + Giraffe

Job completion time comparison with various Packing Config

47 of 53

6-1-2022

Weekly Update

Storage Caching

Nik (FS)

Question: network protocol to transfer data among worker and worker to user

Zhuangwei (KV)

kubernetes/containerization add overhead compared to baremetal
Data format matters, byte is faster than string
Redis is faster than Hazelcast

GPU container runtime and Sharing

Hao
Aritra

Profiler

Steven demo added metrics on Prometheus

AI application

Yaohui
Cyan

Additional Topics

48 of 53

5-25-2022

Welcome Intern Cyan(Taiqing)
Weekly Update

Profiler (Steven)

Prometheus exporter implementation

AI application (Yaohui) (Nerf and giraffe)

Touching image combination/reuse with Giraffe model.

Quark runtime (Hao)

Array is immutable. Hard coded prestart hook, mount nvidia library, but file size is zero
Two ways: quark with prestart hook, nvidia-runtime with quark

Storage Caching (cluster setup, loading time comparison)

Alluxio (Nikunj)
Hazelcast (Zhuangwei)

GPU Sharing (Aritra)

Add 4 different categories of training jobs, use cluster center to pair and test

vGPU server containerization (Zhaobo)

Use docker top cmd to obtain pids of a container, instead of /sys/fs/cgroup mount

Additional Topics/Issues

49 of 53

05-18-2022

Welcome Intern Aritra and Zhuangwei!
Weekly Update

AI application (Yaohui Ding)
GPU sharing and Quark runtime (Hao Xu)

GPU sharing remaining: 1) Multi-job tests 2) pytorch 1.8 intercept 3) process id pass through 4)socket communication timeout, 5)workload placement
Issues: nvidia prehook code in, but → segment fault

Profiler (Steven Wang)
Storage Caching (Nikunj Parekh)

Measured network speed, dish read speed, memory read speed
Intercept cuMemcpyH2D with timestamp for future data loading speed measurement
Alluxio progress

Additional Topics/Issues

50 of 53

05-10-2022

Weekly Update

GPU sharing, GPU container runtime

Unix socket timeout corner case
Quark runtime (hello world example)
Nvidia docker runtime (prestart hook on nvidia library): expose GPU to container

Profiling

Exporter server coding and debugging (steven)
Two metrics: kernel burst speed, memcpy timestamp

3D/Rendering Application

OpenRoom dataset

Storage cache

Additional Issues/Topics

Needs: special hardware, turn on (vm) bios, nvidia package reinstall

51 of 53

Profiler Design

GPU node

User pod

Cuda Intercept layer

CPU node

Collector

Prometheus

Exporter

Analyzer

52 of 53

Customized Cache layer to improve I/O efficiency for DLT

IO Characteristics of DLT

•Shareability

•Intra-job: epoch reuse

•Inter-job: 1)hyper-parameter tuning, 2)AutoML, 3)head-heavy, popular dataset reused among teams

•Random access

•Pytorch access example, permutate the indices

•Cache thrash, if whole dataset cannot fit in cache

•Substitutabiilty

•A random sample of inputs can be substituted by what is available in the cache

•Predictability

•DLT job is predictable across mini-batches, known time per mini-batch, rank the sensitivity each job to I/O performance, determine cache placement and eviction policy, therefore the higher priority jobs can benefit the most from caching

1 of 53

2 of 53

3 of 53