Alnair Group Meeting Notes
Starting 2022 May
Blank on purpose
1/25/2023
1/11
Questions
RDMA solution performance comparison
Kubernetes CNI is a big project, don’t try to rewrite
E-2-E GPU deep learning training, cuda package rewrite
12/20
12/14
12/7
11/30
11/23
Argo workflow
Output parameter
11/16
Updates order, 1st person <30 mins, others < 10 mins. Keep meeting within 1.5 hrs
Huide → Steven → Yaohui → Ning → Nikunj
Aritra will be the 2nd, he joins bi-weekly
10/26
10/19
10/13/2022
Alluxio review
Imperfect
installation
Create PV
Install Alluxio master, worker,
10/12/2022
10/5/2022
9/28/2022
Feature development
Profiler
https://dl.acm.org/doi/pdf/10.1145/3448016.3459240
Alluxio Cluster
9/21/2022
9/14/2022
9/8/2022
8/31/2022
•Back to office (2 days a week)
•Monday, Weds
•Next weds (9/7) group meeting -> Thurs afternoon.
•Weds Workshops: Metaverse, NG infra for Cloud computing and ML
•Alnair.sh (area assignment)
•Nik (data orchestration) Data format, preprocess/pipeline, dataset, dataloader, data prefetch and caching
•Huide (network), CNI, RDMA smart NIC, backend communication
•Steven (CUDA), GPU intercept/sharing, GDS, profiling
•Yaohui (AI algorithms), 3D rendering and self-driving, intelligent system/scheduler)
•930 key feature: data(?) and profiler
•Updates
•Nerface reproduce; exporter/profiler feature confirm
•RDMA config/test, Ibping <GUID> (not ip) on infinity band. Throughput related with thread setup
Alnair Platform
8/24/2022
8/17/2022
8-12-2022
7-13-2022
7-6-2022
6-29-2022
Alnair-exporter
Architecture
Worker node (multiple GPUs)
Alnair-Exporter
Exporter
nvml-collector
cuda-collector
User Pod with intercept lib
User Pod with intercept lib 1
intercept lib 2
Alnair-metrics-server
Option 1: write to file /var/lib/alnair/workspace/ABCD/metrics
Option 2: dial to a custom metrics server
python-collector
3D Rendering Stack
6-22-2022
6-15-2022
6-8-2022
Giraffe GPU utilization
GPU has some idle times, not always high like training
AI Demo Nerf + Giraffe
Job completion time comparison with various Packing Config
6-1-2022
5-25-2022
05-18-2022
05-10-2022
Profiler Design
GPU node
User pod
Cuda Intercept layer
CPU node
Collector
Prometheus
Exporter
Analyzer
Customized Cache layer to improve I/O efficiency for DLT
IO Characteristics of DLT
•Shareability
•Intra-job: epoch reuse
•Inter-job: 1)hyper-parameter tuning, 2)AutoML, 3)head-heavy, popular dataset reused among teams
•Random access
•Pytorch access example, permutate the indices
•Cache thrash, if whole dataset cannot fit in cache
•Substitutabiilty
•A random sample of inputs can be substituted by what is available in the cache
•Predictability
•DLT job is predictable across mini-batches, known time per mini-batch, rank the sensitivity each job to I/O performance, determine cache placement and eviction policy, therefore the higher priority jobs can benefit the most from caching