1 of 53

Alnair Group Meeting Notes

Starting 2022 May

2 of 53

Blank on purpose

3 of 53

1/25/2023

  • 130 release feature
    • Effectiveness improvement
      • Optimizer saving
      • Data shuffling
    • Efficiency improvement
      • DALI pipeline
      • Scheduler policy
  • Ongoing tasks
    • Async function setting, Alluxio testing, fast start (serverless optimized container)
    • RDMA kubernetes enabling
  • 430 planning
  • Today’s order
    • Ning, Yaohui, Nikunj, Steven, Huide

4 of 53

1/11

  • 2023 PBC (2022 review due this Friday), strategy, foundation model …
    • PBC: 1) project perspective 3) personal career
  • Formal project management (3 releases a year, 430, 730, 1030)
    • Research/investigation topic also needs goal and deliverables
    • Insights report in Dec
    • Efficient updates to Sponsor
  • 130 release
    • 1st version Fission based serverless ML framework
      • Epoch based resource reallocation/optimization
      • Function container with GPU access
      • Redis for data and model exchange, (Option Alluxio) (DAG dependency,)
      • Stretch: decouple data preprocessing function and save and distributed preprocessed data
    • Collect performance/accuracy metrics, compare with Pytorch DDP and reference KubeML
  • Individual updates
    • Ning → Huide → Yaohui → Ke → Steven → Nikunj

5 of 53

Questions

RDMA solution performance comparison

  • SRIOV vs macVlan

Kubernetes CNI is a big project, don’t try to rewrite

  • Investigate the changes/configuration to enable RDMA CNI

E-2-E GPU deep learning training, cuda package rewrite

  • Reference existing DL example
  • Task breakdown, and timeline planning
  • our position on DALI, any comparison plan

6 of 53

12/20

  • Individual updates
    • Steven →Ke → Ning → Yaohui → Nikunj

7 of 53

12/14

  • Holiday celebration, Seattle (Thursday), Santa Clara(Friday)
  • 3 days a week in office next year
  • Aritra passed Ph.D. prelim exam “Applications of Predictive Analytics on Cloud Platform”
  • Sponsor Topic alignment
    • AI platform engineering
    • Special container for AI and Big Data
    • Data Lake/warehouse (safely execute user defined function on the warehouse, web assembly)
    • Data exchange/governance, confidencial computing
  • 130 release
    • Fission based Serverless Training
      1. Break data preprocessing and training to two functions (CV image and test case) (Yaohui/Steven)
      2. Break training by epoch, adjust function counts, optimize aggregation (Ning/Aritra)
      3. Function container support RDMA (notes: only supports one container per NIC) (Huide)
      4. Support pod fast start with pod checkpoint or CRIU (Nikunj)
      5. Provide In-memory distributed storage service for model/data/images (?redis/alluxio/…) (Ke/Nikunj/Ning)
  • Individual updates
    • Nikunj → Steven →Ke → Huide → Ning → (Aritra)

8 of 53

12/7

  • AWS reinvent
    • SageMaker ML architecture (steven), Large model, distributed training (Ning), EMR (Zhaobo)
    • Asynchronous, serverless (full data stack), customer cost reduction
  • Serverless AI platform
    • Generic service: Task orchestration, storage service for workers’ data exchange, data caching
    • AI specific functions, e.g., date preprocessing, gradient/model aggregation
    • Advanced Infra: secure container runtime, rdma, nvme
  • Weekly update (first person <=20 mins, others 10 <= mins)
    • Ning → Nikunj → Steven →Ke → Huide →Yaohui

9 of 53

11/30

  • Cloud lab other relevant projects
    • Quark container runtime, with RDMA service (CNI)
    • Fornax-serverless platform
  • Weekly update
    • Yaohui → Huide → Ke → Ning → Nikunj → Steven

10 of 53

11/23

  • AWS re:invent (leadership session, breakout session,
  • Deliverables on next release 130
  • Next Year Goal,
  • Weekly update
    • Steven → Aritra → Ning → Nikunj → Huide → Ke

11 of 53

Argo workflow

  1. Install argo-workflow (argo-server, workflow-controller)
  2. Install argo cli
  3. Write workflow template
    1. Steps (sequential, parallel)
    2. Graph (dependency)
  4. Submit workflow “argo submit workflow.yaml”

  • Orchestrate container/function execution order, “with output sharing”
  • Test scheduling decision (a job execution order), compare average job completion time

12 of 53

Output parameter

13 of 53

11/16

Updates order, 1st person <30 mins, others < 10 mins. Keep meeting within 1.5 hrs

Huide → Steven → Yaohui → Ning → Nikunj

Aritra will be the 2nd, he joins bi-weekly

14 of 53

10/26

  • V0.5.0 release updates
  • Alignment meeting, next release planning
    • Nerf rendering
    • EI (next Monday) Serverless platform components (investigation)
  • Progress update
    • Steven
    • Aritra
    • Huide
    • Ning
    • Nik
    • Yaohui

15 of 53

10/19

  • A100 arrived
  • Nov. 11 team building with Bellevue crew
  • EI team sponsor alignment (next week)
  • Send our progress on Nerf to Media Lab
  • Release updates (code review and check-in)
    • Alluxio caching
    • Nerface avatar training and rendering
    • RDMA image, and DDP test
    • Profiler
      • stabilize “alnair_gpu_util”, “alnair_gpu_mem_util” metrics
      • event-drive pod metadata and utils data collection and store
      • Standalone profiler intercept lib

16 of 53

10/13/2022

Alluxio review

  • In-memory FS, add/delete alluxio worker (expand and shrink worker)
  • Installation in k8s
  • Default duplication/HA
  • What Alnair Alluxio operator does?
  • Copy user’s data from source location to Alluxio’s memory space
  • Delete the running pod
  • mount alluxio space to your pod instead of mount from the source location

Imperfect

17 of 53

18 of 53

installation

Create PV

Install Alluxio master, worker,

19 of 53

10/12/2022

  • Office Day: Monday Thursday
  • 10.145.83.36, 10.145.83.37 GPU upgrade from K80 to P100
  • Progress Update
    • Nikunj:

20 of 53

10/5/2022

  • Upgrade 6 K80 -> P100 from Reza next Monday
  • Server reboot/disconnect issues
  • Progress updates/issues, target release date 10/21
    • Zhaobo, serverless computing survey, profiler, alnair pod
    • Ning, create NeRF Intro, need a discussion of NeRF acceleration
    • Nikunj, complete Alluxio,
      • Q from Dr. Xiong: why we need to build CRD and operator for Alluxio as a platform provider?
      • New updates with Affinity are coming soon
      • Need to carefully quantify the improvement by Alluxio
    • Huide: still working on distributed training over RDMA
      • Issue: segmentation fault, may caused by openib, trying OpenMPI (5.0.RC) now
      • Maybe talk with Xu Hao, he has done something of OpenMPI
    • Yaohui: wrap up the code and prepare an avatar using my face video
    • Steven: NeMO LLM investigation, need to get approved from Nvidia to use the service
      • Will download the toolkit and play with
    • Reza: build k8s cluster, maybe start with some warm up projects?

21 of 53

9/28/2022

  • New member from Device, Reza
  • Dr. Xiong’s Sponsor Updates
  • Progress updates
    • Aritra
  • Updates (with 930 goal)
    • Zhaobo
    • Yaohui
    • Ning
    • Huide
    • Steven
    • Nik

  • Serverless Insights

22 of 53

23 of 53

Feature development

  • Alnair Profiler (refactor)
    • Collect pod metadata, max cpu and memory utilization, max gpu and mem utilization, save to mongo DB
    • Integrate new metrics to Alnair-exporter, report data loading time (
  • Alnair Pod and controller
    • Mount cuda intercept lib
    • Mount rdma lib (future)
  • Clustering and scheduling service
    • Integrate existing cluster and job pairing functions into Alnair platform
    • Explore Argo to orchestrate 24 jobs for scheduler efficiency test

  • Brainstorming on acceleration Data caching/distributed training
  • Design Alnair Job and controller (for distributed training -> auto distributed)

24 of 53

Profiler

  • (?)How to decide the time spent on data loading
    • From the process starts to gpu busy (utils >0%)

  • (?)How to measure the time spent on network (Different model has different network demand)
    • Count the time i/o utils > 0% (but i/o usage maybe overlapped with compute usage)

https://dl.acm.org/doi/pdf/10.1145/3448016.3459240

25 of 53

Alluxio Cluster

  • Challenge:
    • hot switch, when cache is hydrated, switch from remote store to alluxio cache
      • Currently wait
    • Limited cache space management, how to efficiently server large number of jobs

26 of 53

9/21/2022

  • Model split to speed up AI inference
  • GTC updates
  • Website updates
  • Nerf
  • Profiler development
  • Alluxio development

27 of 53

9/14/2022

  • AI hardware summit
    • Gradient notebook: powering next generation app from ML to 3D graphics
    • SambaNova: foundation model platform (GPT), dataflow-as-a-service with pretrained model
    • Cerebras: wafer level cluster, and simplify distribution engineering (data, model, no. of cores)
    • Mythic: in-memory compute (analogue chip), accelerate edge AI
    • Meta: AI cluster, disaggregated compute and storage, super fast network fabric
  • 930 release continue from last week
    • RDMA, Nerf PoC
    • Usable Alluxio and Profiler

28 of 53

9/8/2022

  • New member: Ning Wang (edge intelligence)
  • UCLA workshop, possible collaboration (their strength?, https://www.breezeml.ai/)
    • data/remote memory,
    • elastic training (deepspeed, pipeline parallel)
  • 930 Release features
    • RDMA PoC (demo speed gain)
    • Nerf face Poc (resource/latency requirements to check feasibility for potential app)
    • Alluxio enabled Alnair Pods (cache affinity considered or not, baseline is NSF or local SSD)
    • Standalone profiler intercept lib, report H2D,
    • Profiler, pod based utilization data store (meta data include vgpu or not spec)
  • Rotating updates

29 of 53

8/31/2022

•Back to office (2 days a week)

•Monday, Weds

•Next weds (9/7) group meeting -> Thurs afternoon.

•Weds Workshops: Metaverse, NG infra for Cloud computing and ML

•Alnair.sh (area assignment)

•Nik (data orchestration) Data format, preprocess/pipeline, dataset, dataloader, data prefetch and caching

•Huide (network), CNI, RDMA smart NIC, backend communication

•Steven (CUDA), GPU intercept/sharing, GDS, profiling

•Yaohui (AI algorithms), 3D rendering and self-driving, intelligent system/scheduler)

•930 key feature: data(?) and profiler

•Updates

•Nerface reproduce; exporter/profiler feature confirm

•RDMA config/test, Ibping <GUID> (not ip) on infinity band. Throughput related with thread setup

30 of 53

Alnair Platform

31 of 53

32 of 53

8/24/2022

  • Problems to be solved in 930 (to be discussed)
    • Profiler (store the max gpu utilization, memory util, io, … meta data for each jobs)
    • Metrics exporter (report data loading, preprocessing, forward/backward path time)
    • Distributed job controller (pytorch DDP, torch elastic, compare with horovd performance)
    • Data store (in-memory FS, prefetch data to limited memory)
    • Nerf/transformer algorithm performance deep dive
    • RDMA for distributed training poc
  • Rotating updates
      • Huide
      • Steven
      • Nik
      • Yaohui (Brown bag)

33 of 53

8/17/2022

  • Weekly update
    • Get a swin-transformer object detection (object segmentation) distributed training case ready
      • Model, dataset, framework(torch elastic, horovd)
    • Data pipeline orchestration (example Pachyderm)
    • A100, NIC purchase
  • Alnair Position
    • AI training with optimized scenarios (AV, 3D rendering)
    • Features: data pipeline, gpu sharing, cross-layer monitoring, RDMA communication, pytorch improve plugin
  • Rotating updates
    • Yaohui
    • Steven
    • Nik
    • Huide
    • Aritra
    • Xu
    • Cyan

34 of 53

8-12-2022

  • Alnair (cloud-native AI platform) position
    • Prototype/MVP, not complete solution, focus on innovation, root/key technologies, enable new cloud apps
    • Currently is built on Kubernetes, manage nvidia GPUs, running AI training jobs (will add inference jobs)
    • Identified and work on two challenges: GPU sharing, Data caching
    • Infrastructure improvement (ongoing): RDMA network, GPU Direct Storage (for container)
    • ML framework improvement: operator, transformer implementation etc.
    • Use cases: Self driving, Immersive meeting (Virtual human, 3D rendering),
    • Short term goal (1 year): smoothly serve training and inference job with intelligent resource allocation, data orchestration, cross-layer monitoring, and cutting-edge network and storage technologies. (small/medium job <= 32 cards)
    • Long term: new virtualization, new data/programming model, new severless model, new applications
  • Weekly updates
    • Data Orchestration, AV (Nik)
    • Network, server (Huide): https://github.com/pachyderm/pachyderm
    • Storage, NVMe SSD (Steven)
    • Facial expression control/mapping (Cyan): https://github.com/NumesSanguis/FACSvatar
    • GPU sharing scheduling algorithm (Aritra): scheduling criteria: packing saving or individual slow down

35 of 53

36 of 53

7-13-2022

  • 630 Release (7/15->7/20)
    • Storage caching (cache cluster based on Redis, alnair-pod and its controller) — further verify Hazelcast, redis single thread becomes bottleneck (Dahai Liu <dliu@futurewei.com>)
    • Profiler (alnair-exporter, custom collector with new metrics) – add memused and kernel launch counter from intercept lib
    • Open data (charactering AI workloads, feature analysis)
    • GPU sharing (fix retrieve cgroup.proc info, cuda intercept support >= 11.3) – intercept lib now support both 11.3 and after
  • Weekly Updates
    • Container runtime/RDMA, GPU Direct (SSD)
      • RoCE vs Infinity band
      • Can we run any ML examples on existing Qingming’s RDMA clusters? Any code changes?
      • Just buy 2 Smart NIC CX-4 for Bellevue GPU servers, reuse Dell switch
    • GPU sharing
    • Profiler
    • 3D rendering engine, real-time rendering
    • Storage caching

37 of 53

7-6-2022

  • 630 Release (7/15)
    • Storage caching (cache cluster based on Redis, alnair-pod and its controller)
    • Profiler (alnair-exporter, custom collector with new metrics)
    • Open data (charactering AI workloads, feature analysis)
    • GPU sharing (fix retrieve cgroup.proc info, cuda intercept support >= 11.3)
  • Weekly Updates
    • Container runtime/RDMA, GPU sharing
      • IOctl system call is not ready in Quark (Plan A stops here)
        • Normal VM has GPU passthrough, but not ready in QVM,
      • Plan B: enable RDMA and GPU support on runc ⇒ Alnair container runtime (?)
    • Profiler
    • 3D rendering
    • Storage caching

38 of 53

6-29-2022

  • OSS Info sharing
    • Alluxio, WASM, unstructured data, Jina, Horovod updates, “security”, Open DataSet
  • 630 Release (7/15)
    • Storage caching (cache cluster based on Redis, alnair-pod and its controller)
    • Profiler (alnair-exporter, new metrics)
    • Open data (charactering AI workloads, feature analysis)
    • GPU sharing (fix retrieve cgroup.proc info, cuda intercept support >= 11.3)
  • Network upgrades (CPU<->GPU 10Gbps, GPU<->GPU 25Gbps)
    • 4 Mellenox CX-6, 4 SSD with gpu-direct supports, 1 RDMA switch
  • Weekly Updates
    • Storage caching
    • GPU sharing, Quark GPU support first and then verify RDMA network help AI workloads
    • 3D rendering
    • Profiler

39 of 53

Alnair-exporter

  • Exporter + Collector
    • When Prometheus Server pull Alnair-exporter, collector goes to retrieve metrics
    • Implement prometheus collector interface for custom metrics
  • Metrics Source
    • Existing API servers (e.g., k8s get pods, nvml get device)
    • Files (e.g., /proc/stat)
    • Custom servers (?)
  • Metrics example
    • nvml-collector
      • Alnair-GPU-UTIL: 35% {GPU-UUID, PID, Node-Name}
      • Alnair-GMEM-USED: 8 {GPU-UUID, PID, Node-Name}
    • CUDA-collector
      • Alnair-GPU-Burst: 1 {GPU-UUID, Pod-ID, Node-Name}
      • Alnair-CUDA-Kernel: timestamp {Function-name, GPUUID, Pod-NAME}
  • DaemonSet on each worker node
  • (?)Additional Server to hand over metrics from intercept lib

40 of 53

Architecture

Worker node (multiple GPUs)

Alnair-Exporter

Exporter

nvml-collector

cuda-collector

User Pod with intercept lib

User Pod with intercept lib 1

intercept lib 2

Alnair-metrics-server

Option 1: write to file /var/lib/alnair/workspace/ABCD/metrics

Option 2: dial to a custom metrics server

python-collector

41 of 53

3D Rendering Stack

  • 3D software
    • Blender/Maya/Autodesk 3D Max
  • Render jobs
    • Real-time, offline (select benchmarks)
    • Format: obj/fbx/usd, blendfile (single object and all-in-one software based file)
  • Render engine & algorithm
    • cycle/evee/appleseed
    • Algorithms: rasterization, ray-tracing, neural radiance field (Nerf)
  • Infrastructure
    • CPU vs GPU
    • render farm (distributed, network rendering)
    • Management/orchestration tool

42 of 53

6-22-2022

  • 630 Release
    • Storage caching (cache cluster, CPU workloads prove data loading speedup)
    • Profiler (custom exporter, new metrics)
    • open data
    • GPU sharing (fix)
  • Weekly Update
    • Storage Caching
      • Nik (FS), Fluid, Alluxio
      • Zhuangwei (KV), client-manager design
    • Profiler
      • Steven, Jonathan
    • GPU container runtime and Sharing
      • Hao
      • Aritra
    • AI application
      • Yaohui
      • Cyan, 3D rendering

43 of 53

6-15-2022

  • Weekly Update
    • Storage Caching
      • CPU network upgraded to 10Gbps
      • Nik (FS), Fluid
      • Zhuangwei (KV), client-manager design
        • Redis loading improvement (1.67sec → 0.45 sec)
    • GPU container runtime and Sharing
      • Hao: file size 0 causes segmentation fault
      • Aritra
    • AI application
      • Yaohui, Giraffe
      • Cyan, 3D rendering
    • Profiler
      • Metrics needs: process level GPU utilization (verify GPU sharing compute limits work)

44 of 53

6-8-2022

  • Meetings
    • Seattle Center internal meeting, OSS
  • Weekly Update
    • AI application
      • Yaohui, Giraffe
      • Cyan, 3D rendering
    • GPU container runtime and Sharing
      • Hao: configure Quark as low-level docker runtime and Nvidia docker runtime on Top
      • Aritra
    • Profiler
      • Metrics needs: process level GPU utilization (verify GPU sharing compute limits work)
    • Storage Caching
      • Network bottleneck identified 1Gbps
      • Zhuangwei (KV), client-manager design
      • Nik (FS), Fluid

45 of 53

Giraffe GPU utilization

GPU has some idle times, not always high like training

46 of 53

AI Demo Nerf + Giraffe

Job completion time comparison with various Packing Config

47 of 53

6-1-2022

  • Weekly Update
    • Storage Caching
      • Nik (FS)
        • Question: network protocol to transfer data among worker and worker to user
      • Zhuangwei (KV)
        • kubernetes/containerization add overhead compared to baremetal
        • Data format matters, byte is faster than string
        • Redis is faster than Hazelcast
    • GPU container runtime and Sharing
      • Hao
      • Aritra
    • Profiler
      • Steven demo added metrics on Prometheus
    • AI application
      • Yaohui
      • Cyan
  • Additional Topics

48 of 53

5-25-2022

  • Welcome Intern Cyan(Taiqing)
  • Weekly Update
    • Profiler (Steven)
      • Prometheus exporter implementation
    • AI application (Yaohui) (Nerf and giraffe)
      • Touching image combination/reuse with Giraffe model.
    • Quark runtime (Hao)
      • Array is immutable. Hard coded prestart hook, mount nvidia library, but file size is zero
      • Two ways: quark with prestart hook, nvidia-runtime with quark
    • Storage Caching (cluster setup, loading time comparison)
      • Alluxio (Nikunj)
      • Hazelcast (Zhuangwei)
    • GPU Sharing (Aritra)
      • Add 4 different categories of training jobs, use cluster center to pair and test
    • vGPU server containerization (Zhaobo)
      • Use docker top cmd to obtain pids of a container, instead of /sys/fs/cgroup mount
  • Additional Topics/Issues

49 of 53

05-18-2022

  • Welcome Intern Aritra and Zhuangwei!
  • Weekly Update
    • AI application (Yaohui Ding)
    • GPU sharing and Quark runtime (Hao Xu)
      • GPU sharing remaining: 1) Multi-job tests 2) pytorch 1.8 intercept 3) process id pass through 4)socket communication timeout, 5)workload placement
      • Issues: nvidia prehook code in, but → segment fault
    • Profiler (Steven Wang)
    • Storage Caching (Nikunj Parekh)
      • Measured network speed, dish read speed, memory read speed
      • Intercept cuMemcpyH2D with timestamp for future data loading speed measurement
      • Alluxio progress
  • Additional Topics/Issues

50 of 53

05-10-2022

  • Weekly Update
    • GPU sharing, GPU container runtime
      • Unix socket timeout corner case
      • Quark runtime (hello world example)
      • Nvidia docker runtime (prestart hook on nvidia library): expose GPU to container
    • Profiling
      • Exporter server coding and debugging (steven)
      • Two metrics: kernel burst speed, memcpy timestamp
    • 3D/Rendering Application
      • OpenRoom dataset
    • Storage cache
  • Additional Issues/Topics
    • Needs: special hardware, turn on (vm) bios, nvidia package reinstall

51 of 53

Profiler Design

GPU node

User pod

Cuda Intercept layer

CPU node

Collector

Prometheus

Exporter

Analyzer

52 of 53

Customized Cache layer to improve I/O efficiency for DLT

IO Characteristics of DLT

•Shareability

•Intra-job: epoch reuse

•Inter-job: 1)hyper-parameter tuning, 2)AutoML, 3)head-heavy, popular dataset reused among teams

•Random access

•Pytorch access example, permutate the indices

•Cache thrash, if whole dataset cannot fit in cache

•Substitutabiilty

•A random sample of inputs can be substituted by what is available in the cache

•Predictability

•DLT job is predictable across mini-batches, known time per mini-batch, rank the sensitivity each job to I/O performance, determine cache placement and eviction policy, therefore the higher priority jobs can benefit the most from caching

53 of 53