2018 HPC-AI Competition Sharing - Team NTU

Ziji Shi, 12 Mar 2019

zshi005@e.ntu.edu.sg

Outline

  • Brief History of Team NTU
  • APAC HPC-AI Competition
  • Preparation and Schedule
  • Optimization Techniques
  • Result of Team NTU

Brief history of Team NTU

  • Team NTU was formed in 2014
    • Started as school club under School of Computer Science and Engineering
    • Participated in ASC from 2014 to 2016, ISC from 2016, and SC from 2016
  • Current team advisor:
    • Assoc. Prof. Francis Lee
    • Dr. Ta Nguyen Binh Duong

Awards won by Team NTU

  • Silver Award at ASC14
  • Highest LINPACK Award (11.92 TFLOPS/s) at ASC15
  • Application Innovation Award at ASC16
  • Deep Learning Excellence Award at ISC17
  • Overall Champion and Highest LINPACK Award at SC17
  • 1st Runner-up at ISC18
  • 1st Runner-up and Highest LINPACK Award at SC18 (world record at 56.77 TFLOPS)
  • Merit Prize at APAC HPC-AI Competition 2018

Why HPC-AI Competition?

Weather forecasting, artificial intelligence, high-performance computing

Compiler, operating system, computer architecture, parallel computing, etc.

Communication, presentation, strategic planning, etc.

Domain knowledge

CS knowledge

Soft skills

Soft skills

HPC-AI Competition - Rule

  • Participants set baseline on NSCC-available cluster or their own cluster
  • Two tasks are given:
  • Optimizing distributed training with RDMA on Tensorflow or Caffe2
    • Inception V3
    • ResNet 152
    • VGG16
    • Dataset : ImageNet 2012
  • Optimizing Weather Research & Forecast (WRF)

3. Final result was submitted through presentation and report.

HPC-AI Competition - Scoring

  • Optimizing distributed training with RDMA on Tensorflow or Caffe2
    • Judged on training throughput and convergence time
    • Inception V3: 50 % on convergence time, 50% on images per second
    • ResNet 152 : 100% on images per second
    • VGG16: 100% on images per second

  • Optimizing Weather Research & Forecast (WRF)
    • Judged on simulation speed

Preparation

  • Divide-and-Conquer approach
    • Every two members work on the same sub-problem
  • Used it as a learning opportunity
    • Not everyone has the same level of knowledge
    • Learnt through online courses by Mellanox Academy, including RDMA training and more
  • Reused past experience
    • Kept an internal wiki as knowledge base
  • Researched on latest research paper on dist. ML
    • reviewed many dist. ML methods
  • Consulted domain expert
    • Assist. Prof. Lin Guosheng provided insights into the latest trend

Cluster Setup

  • School cluster
    • GPU node x1
      • 2x Intel Xeon E5-2699 V4
      • 256GB DDR4 memory
      • 10x PCI-E V100 (16GB)
    • CPU node x1
      • 2x Intel Xeon Gold 6148 CPU @ 2.40GHz
      • 256GB DDR4 memory
  • NSCC
    • For practicing purpose
  • NSCC (DGX-1)
    • For training imagenet

Schedule

  • We held weekly meetings to discuss, synchronize, and plan
  • Majority of time was spent on self-learning RDMA and TensorFlow
  • Most optimization was done after ISC18

Optimization Techniques

  • Configuration optimization
    • Build optimization:
      • CUDA
      • cuDNN
    • Mixed-precision training

  • Hyperparameter tuning
    • Optimizer selection
    • Hyperparameter tuning

Optimization Techniques

  • Communication pattern optimization
    • All-reduce implementation
      • ring-based all-reduce + NCCL + horovod
    • Hierarchical copy
    • Gradient Repacking
      • more efficient cross-device communication
    • FP16 compressor and decompressor
      • add support for communication in half-precision

Result

  • ResNet 152: 2852 img/sec
  • Inception V3: 3200 img/sec, 74% accuracy (after 2hr 20mins)
  • VGG16: 3400 img/sec, 74% (after 4hrs)
  • WRF: total running time 2.5 hrs

Thank you!

You can know more about us at

  • Website: https://ntuhpc.org
  • GitHub: https://github.com/ntuhpc
  • Twitter: @realntuhpc
  • LinkedIn: NTU HPC Club

1_1710_Ziji_Shi - Google Slides