NUS School of Computing:
Student High Performance Computing (HPC) Team
Overview
If you’re reading this, you’re likely interested in the Student HPC Team @ NUS SoC (otherwise known as Kent Ridge HPC), and potentially, how to be involved.
We’re glad you’re interested! Please read further to see if this is really for you. There are some reasons why this may or may not be interesting to you, regardless of how cool the name is :)
Goals
The goals of Kent Ridge HPC are to:
- Learn industry-relevant skills for high-performance computing, specifically in cluster hardware design, and software optimization for benchmarking, and
- Put these skills to practice via participating in student cluster competitions, e.g.,
Often, these competitions are conducted on-site, co-located with an academic HPC conference, and we’d need to send a team of students for such a competition. Other times, the competitions may be held virtually. Either way, we’re building teams of students to participate in these competitions.
What Student Cluster Competitions are About
In (very) short: The goal of a student cluster competition is to have undergraduate students build and tune a high-performance computing cluster to achieve the fastest performance on specific software benchmarks while maintaining energy efficiency (e.g., with a power budget). For a more detailed description, you can see the page describing the most popular Student Cluster Competition (SCC @ SuperComputing) here.
TLDR: design and actually build a cluster of machines (within a 42U rack), and then optimize the hardware, toolchains, OS, networking, software.. and beat other teams at certain benchmark scores.
What Student Cluster Competitions are NOT About
It’s important to note that, from experience, these competitions do not focus on / prioritize specific aspects of high-performance / parallel computing that you might be interested in. For instance:
- Most of the work is in optimizing the OS / network / system setup / compilers/toolchain / flags / benchmark settings / distribution of processes among nodes / etc.
- Sometimes we may be asked to optimize the code itself that we are running: e.g., we are usually given a specific function or module of benchmark programs ( HPL, NAMD, etc ) and tasked to parallelize it.
Therefore, even if you only have specific interests in writing fast code / optimizing existing code / runtime optimizations / hardware / networking / etc, you may find those skills as readily used in competitions.
Useful Skillsets + What You Might Learn
With that said, we can enumerate some useful skill sets that we rely on for these competitions. Note that we don’t need you to know these skills beforehand, though knowing some of it will speed up the process. However, being interested in at least some of these skillsets is a must.
- Linux Familiarity: Much of the work revolves around setting up and optimizing Linux-based HPC clusters. It’s useful to be familiar with Linux commands, bash shell scripting, general management of Linux systems, etc. Often, we use common HPC OS images such as Rocky Linux.
- Cluster Configuration and Management: At the heart of these competitions is understanding how to deploy and manage multiple nodes in a cluster. Skills in cluster setup tools such as Ansible, how to use job scheduling systems (like Slurm), package and toolchain installation, are very useful.
- Systems Programming: Experience with C, C++, is key; you may be asked to make extensive use of frameworks such as OpenMP, OpenACC and C++17 STL parallel algorithms to optimize existing code. You will also need a solid understanding of the executable build process from compilation and linking to loading, including the use of HPC SDKs such as OneAPI and NVHPC, build tools like CMake and package managers such as Spack.
- Performance Tuning: Knowledge of how to optimize system performance at the OS, network, and hardware level is very important. This involves configuring compilers, selecting the right flags, tuning network interfaces, changing kernel parameters, and even balancing power consumption against performance. Overall, it’s very useful to understand systems as a whole, so that the right parts can be tuned.
- Benchmarking: You will, over time, develop a deep understanding of the benchmarks we use, such as HPL, HPCG, NAMD, ICON, MLPerf (and too many more to name). Such deep knowledge will allow you to identify optimization opportunities.
- Hardware Knowledge: From CPUs and GPUs to interconnects and memory subsystems, you should become familiar with how different hardware components contribute to overall performance.
- Hardware Selection: To build a cluster, familiarity with how rack-mounted servers are put together and how to select CPUs, GPUs, memory, storage, power supplies, switches, network interfaces, etc, from vendors, will prove useful for those involved in the hardware parts of this competition.
- Networking: Since HPC clusters are distributed systems, knowledge of networking protocols, MPI, network topologies, and performance optimization in communication-heavy environments is useful.
- OS Knowledge: The OS is a significant part of the compute overhead. Tuning it in terms of filesystems, networking configuration, kernel parameters, memory and cache subsystems, kernel compilation settings, etc, might help.
- Energy Efficiency: Balancing performance with a strict power budget is one of the key challenges in student cluster competitions. Knowing how to make trade-offs between speed and power consumption, and implementing those trade-offs effectively, can be a game-changer.
- Teamwork and Collaboration: These competitions are team-based under strict time limits. We really care that our teammates are not just technically strong, but kind to each other as well. Technically, teammates should be able to play to their strengths, while communicating effectively under these time-pressure environments.
Again, we want to emphasize that you don’t have to know all of this (and most actual participants will just specialize in a few things!). This is mostly informational so that you know what’s useful in competition settings.
Commitment Requirements
Now that you know the overall structure of the competition and skillsets, we’d like to outline what being a part of Kent Ridge HPC will entail. Please note that this is a serious commitment, since competitions are hard deadlines and preparing for them takes a lot of effort.
- Weekly Meetings: at a minimum, the team meets once per week to learn new skillsets, listen to presentations, practice optimization and benchmarking, etc. The actual content of the meeting varies based on what the closest milestone is (competition, prep, etc)
- Continuous Learning: meeting once a week is not sufficient to keep up with trends and expand your skillsets. We expect that members learn things on their own, relevant to their interests and specialities.
- (Potentially) Overseas travel for competitions: If you are part of a team that is selected for a student cluster competition, you may have to travel overseas, likely during Recess/Reading week, or during a holiday period. Usually once a team is accepted by an organization, last minute replacements are likely not allowed.
Trying out for the team
If you’re here and still interested in joining us, that’s great. We’d like to see that you’re a self-starter who can learn and explore things on your own, specific to the competition.
Please follow the instructions in: Kent Ridge HPC Onboarding - Option 1: Software Benchmarking.