UCSD’s DSMLP instructional GPU cluster, a service of ITS/Educational Technology Services (formerly ACMS), provides students in all disciplines/divisions access to 80+ modern GPUs running on 10 physical hardware nodes located at SDSC. Funding for the cluster was provided by ITS, JSOE, and CogSci departments.
DSMLP jobs are executed in the form of Docker “containers” - these are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Kubernetes container management/orchestration system routes users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.
Please be considerate and terminate idle containers: while containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s 80 GPU cards are assigned to users on an exclusive basis. When attached to a container they become unusable by others even if completely idle.
To report problems with DSMLP, or to request assistance, please contact the ITS Service Desk, via email to firstname.lastname@example.org, or via phone/walk-in at the AP&M service desk. Your instructor or TA will be your best resource for course-specific questions.
Access to the “ieng6” front-end/submission node
Launching a Container
Bash shell / Command Line
Monitoring Resource Usage within Jupyter/Python Notebooks
Container Run Time Limits
Container Termination Messages
Data Storage / Datasets
Copying Data Into the Cluster: from an ieng6 home directory
Copying Data Into the Cluster: SFTP from your computer
Copying Data Into the Cluster: rsync
Customization of the Container Environment
Adjusting CPU/GPU/RAM limits
Alternate Docker Images
Launch Script Command-line Options
Custom Python Packages (Anaconda/PIP)
Background Execution / Long-Running Jobs
Common CUDA Run-Time Error Messages
(59) device-side assert
(2) out of memory
(30) unknown error
Monitoring Cluster Status
Example of a PyTorch Session
To start a Pod (container), first login via SSH to the ITS/ETS “dsmlp-login.ucsd.edu" Linux server. (You may also use "ieng6.ucsd.edu" if you have been given an account there.) These systems act as front-end/submission nodes for our cluster; computation is handled elsewhere.
ITS/ETS will provide instructors with login information for Instructor, TA, and student-test accounts for the courses.
Students should login to the front-end nodes using either their UCSD email username (e.g. ‘jsmith’), or in some cases, a “course specific” account, e.g. “cs253wXX" for CSE253, Winter 2018. Consult the ITS/ETS Account Lookup Tool for instructions on activating course specific accounts. UCSD Extension/Concurrent Enrollment students: see Extension for a course account token, then complete the ITS/ETS Concurrent Enrollment Computer Account form.
Students logging in to 'ieng6' with their UCSD username (e.g. 'jsmith') must use the 'prep' command to activate their course environment and gain access to the GPU tools. Select the relevant option from the menu (e.g. cs253w, cs291w).
('prep' is implicit on 'dsmlp-login', or when using a course-specific account on ieng6.)
Assistance with sign-on to the front-end nodes may be obtained from the ITS Service Desk, via email to email@example.com, or via phone/walk-in at the AP&M service desk. Your instructor or TA will be your best resource for course-specific questions.
After signing-on to the front-end node, you may start a Pod/container using either of the following commands:
Python 3.6, PyTorch 1.0.1, TensorFlow 1.12.0 (WI19, replaces ets-pytorch)
Docker container image and CPU/GPU/RAM settings are all configurable; see the “Customization” and "Launch Script Command-line Options" sections below.
We encourage you to use non-GPU (CPU-only) containers until your code is fully tested and a simple training run is successful. ( PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU.)
Once started, containers can provide Bash (shell/command-line), as well as Jupyter/Python Notebook environments.
The predefined launch scripts initiate an interactive Bash shell similar to ‘ssh’; containers terminate when this interactive shell exits. Our ‘pytorch’ image includes the GNU Screen utility, which may be used to manage multiple terminal sessions in a window-like manner.
The default container configuration creates an interactive web-based Jupyter/Python Notebook which may be accessed via a TCP proxy URL output by the launch script. Note that access to the TCP proxy URL requires a UCSD IP address: either on-campus wired/wireless, or VPN. See http://blink.ucsd.edu/go/vpn for instructions on the campus VPN.
Users of the stock containers will find CPU/Memory/GPU utilization noted at the top of the Jupyter notebook screen:
By default, containers are limited to 6 hours execution time to minimize impact of abandoned/runaway jobs. This limit may be increased, up to 12 hours, by modifying the "K8S_TIMEOUT_SECONDS" configuration variable. Please contact your TA or instructor if you require more than 12 hours.
Containers may occasionally exit with one of the following error messages:
Container memory (CPU RAM) limit was reached.
Container time limit (default 6 hours) exceeded - see above.
Unspecified error. Contact ITS/ETS for assistance.
Two types of persistent file storage are available within containers: a private home directory ($HOME) for each user, as well as a shared directory /datasets used to distribute common data (e.g. CIFAR-10, Tiny ImageNet).
ImageNet Fall 2011
ImageNet 32x32 2010
Synthetic Word Dataset
Contact ITS to request installation of additional datasets.
Standard utilities such as 'git', 'scp', 'sftp', and 'curl' are included in the 'pytorch' container image and may be used to retrieve code or data from on- or off-campus servers.
Files also may be copied into the cluster from the outside using the following procedures.
Note that file transfer is only offered through 'dsmlp-login.ucsd.edu', even if you normally launch jobs from 'ieng6'.
Updated Process, October 2018
Data may be copied to/from the cluster using the "SCP" or "SFTP" file transfer protocol from a Mac or Linux terminal window, or on Windows using a freely downloadable utility. We recommend this option for most users.
Example using the Mac/Linux 'sftp' command line program:
slithy:Downloads agt$ sftp <username>@dsmlp-login.ucsd.edu
pod agt-4049 up and running; starting sftp
Connected to ieng6.ucsd.edu
sftp> put 2017-11-29-raspbian-stretch-lite.img
Uploading 2017-11-29-raspbian-stretch-lite.img to /datasets/home/08/108/agt/2017-11-29-raspbian-stretch-lite.img
2017-11-29-raspbian-stretch-lite.img 100% 1772MB 76.6MB/s 00:23
sftp complete; deleting pod agt-4049
On Windows, we recommend the WinSCP utility.
Updated Process, October 2018
'rsync' also may be used from a Mac or Linux terminal window to synchronize data sets:
slithy:ME198 agt$ rsync -avr tub_1_17-11-18 <username>@dsmlp-login.ucsd.edu
pod agt-9924 up and running; starting rsync
building file list ... done
rsync complete; deleting pod agt-9924
sent 557671 bytes received 20 bytes 53113.43 bytes/sec
total size is 41144035 speedup is 73.78
Each launch script specifies the default Docker image to use, the required number of CPU cores, GPU cards, and GB RAM assigned to its containers. An example of such a launch configuration is as follows:
K8S_NUM_GPU=1 # max of 1 (contact ETS to raise limit)
K8S_NUM_CPU=4 # max of 8 ("")
K8S_GB_MEM=32 # max of 64 ("")
# Controls whether an interactive Bash shell is started
# Sets up proxy URL for Jupyter notebook inside
Instructors and TAs may directly modify the coursewide scripts located in ../public/bin.
Otherwise, users may copy an existing launch script into their home directory, then modifying that private copy:
$ cp -p `which launch-pytorch.sh` $HOME/my-launch-pytorch.sh
$ nano $HOME/my-launch-pytorch.sh
The maximum limits (8 CPU, 64GB, 1 GPU) apply to all of your running containers: you may run 8 1 CPU-core containers, or 1 8-core container, or anything in-between. Contact ETS to request increases to these default limits.
Increases to GPU allocations require consent of TA, instructor or advisor.
Besides GPU/CPU/RAM settings, you may specify an alternate Docker image: our servers will pull container images from dockerhub.io or elsewhere if requested. ITS/ETS is happy to assist you with creation or modification of Docker images as needed, or you may do so on your own.
Defaults set within launch scripts' environment variables may be overridden using the following command-line options:
Adjust # CPU cores
Adjust # GPU cards
Adjust # GB RAM
Docker image name
Docker image ENTRYPOINT/CMD
Request specific cluster node (1-10)
Request specific GPU (gtx1080ti,k5200,titan)
Request background pod
[cs190f @ieng6-201]:~:56$ launch-py3torch-gpu.sh -m 64 -v k5200
Users may install personal Python packages within their containers using the standard Anaconda package management system; please see Anaconda's Getting Started guide for a 30-minute introduction. Furthermore, instructors and TAs may construct shared course-wide Anaconda environments for their students; contact ETS for assistance doing so.
Example of installation using 'pip':
agt@agt-10859:~$ pip install --user imutils
Building wheels for collected packages: imutils
Running setup.py bdist_wheel for imutils ... done
Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71
Successfully built imutils
Installing collected packages: imutils
Successfully installed imutils-0.4.5
To support longer training runs, we permit background execution of student containers, up to 12 hours execution time, via the "-b" command line option.
Use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container, and ‘kubectl delete pod <pod-name>’ to terminate.
Please be considerate and terminate any unused background jobs: GPU cards are assigned to containers on an exclusive basis, and when attached to a container are unusable by others even if idle.
cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCTensorCopy.c:18
Indicates a run-time error in the CUDA code executing on the GPU, commonly due out-of-bounds array access. Consider running in CPU-only mode (remove .cuda() call) to obtain more specific debugging messages.
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503968623488/work/torch/lib/THC/generic/THCStorage.cu:66
GPU memory has been exhausted. Try reducing your batch size, or confine your job to 11GB GTX1080Ti cards rather than 6GB Titan or 8GB K5200 (see Launch Script Command-line Options).
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/THCGeneral.c:70
This indicates a hardware error on the assigned GPU, and usually requires a reboot of the cluster node to correct. As a temporary workaround, you may explicitly direct your job to another node; see Launch Script Command-line Options. Please report these errors to ITS/ETS support - firstname.lastname@example.org
The ‘cluster-status’ command provides insight into the number of jobs currently running and GPU/CPU/RAM allocated.
ITS/ETS plans to deploy more sophisticated monitoring tools over the coming months.
Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment:
pip install -U --user jupyter-tensorboard
jupyter nbextension enable jupyter_tensorboard/tree --user
You’ll need to exit your Pod/container and restart for the change to take effect.
Usage instructions for ‘jupyter_tensorboard’ are available at:
Cluster architecture diagram
2xXeon Gold 6130
Nodes are connected via an Arista 7150 10Gb ethernet switch.
Additional nodes can be added into the cluster at peak times.
slithy:~ agt$ ssh email@example.com
Last login: Thu Oct 12 12:29:30 2017 from slithy.ucsd.edu
============================ NOTICE =================================
Authorized use of this system is limited to password-authenticated
usernames which are issued to individuals and are for the sole use of
the person to whom they are issued.
Privacy notice: be aware that computer files, electronic mail and
accounts are not private in an absolute sense. For a statement of
"ETS (formerly ACMS) Acceptable Use Policies" please see our webpage
Disk quotas for user cs190f (uid 59457):
Filesystem blocks quota limit grace files quota limit grace
11928 5204800 5204800 272 9000 9000
Check Account Lookup Tool at http://acms.ucsd.edu
Thu Oct 12, 2017 12:34pm - Prepping cs190f
[cs190f @ieng6-201]:~:56$ launch-pytorch-gpu.sh
Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 1 GPU units. (Edit /home/linux/ieng6/cs190f/public/bin/launch-pytorch.sh to change this configuration.)
pod "cs190f -4953" created
Thu Oct 12 12:34:41 PDT 2017 starting up - pod status: Pending ;
Thu Oct 12 12:34:47 PDT 2017 pod is running with IP: 10.128.7.99
tensorflow/tensorflow:latest-gpu is now active.
Please connect to: http://ieng6-201.ucsd.edu:4957/?token=669d678bdb00c89df6ab178285a0e8443e676298a02ad66e2438c9851cb544ce
Connected to cs190f-4953; type 'exit' to terminate processes and close Jupyter notebooks.
cs190f@cs190f-4953:~$ git clone https://github.com/yunjey/pytorch-tutorial.git
Cloning into 'pytorch-tutorial'...
remote: Counting objects: 658, done.
remote: Total 658 (delta 0), reused 0 (delta 0), pack-reused 658
Receiving objects: 100% (658/658), 12.74 MiB | 24.70 MiB/s, done.
Resolving deltas: 100% (350/350), done.
Checking connectivity... done.
cs190f@cs190f-4953:~$ cd pytorch-tutorial/
cs190f@cs190f-4953:~/pytorch-tutorial$ cd tutorials/02-intermediate/bidirectional_recurrent_neural_network/
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ python main-gpu.py
Epoch [1/2], Step [100/600], Loss: 0.7028
Epoch [1/2], Step [200/600], Loss: 0.2479
Epoch [1/2], Step [300/600], Loss: 0.2467
Epoch [1/2], Step [400/600], Loss: 0.2652
Epoch [1/2], Step [500/600], Loss: 0.1919
Epoch [1/2], Step [600/600], Loss: 0.0822
Epoch [2/2], Step [100/600], Loss: 0.0980
Epoch [2/2], Step [200/600], Loss: 0.1034
Epoch [2/2], Step [300/600], Loss: 0.0927
Epoch [2/2], Step [400/600], Loss: 0.0869
Epoch [2/2], Step [500/600], Loss: 0.0139
Epoch [2/2], Step [600/600], Loss: 0.0299
Test Accuracy of the model on the 10000 test images: 97 %
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ cd $HOME
Thu Oct 12 13:30:59 2017
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 27C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |