User Guide: Data Science/Machine Learning Platform (DSMLP)

A service of ITS/Educational Technology Services

http://go.ucsd.edu/2CZladZ

Introduction

UCSD’s DSMLP instructional GPU cluster, a service of ITS/Educational Technology Services (formerly ACMS), provides students in all disciplines/divisions access to 80+ modern GPUs running on 10 physical hardware nodes located at SDSC.  Funding for the cluster was provided by ITS, JSOE, and CogSci departments.

DSMLP jobs are executed in the form of Docker “containers” - these are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes.     The  Kubernetes container management/orchestration system routes users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.  

Please be considerate and terminate idle containers:  while containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s 80 GPU cards are assigned to users on an exclusive basis.  When attached to a container they become unusable by others even if completely idle.     

To report problems with DSMLP, or to request assistance, please contact the ITS Service Desk, via email to servicedesk@ucsd.edu, or via phone/walk-in at the AP&M service desk.   Your instructor or TA will be your best resource for course-specific questions.  

Access to the “ieng6” front-end/submission node

Launching a Container

Bash shell / Command Line

Jupyter/Python Notebooks

Monitoring Resource Usage within Jupyter/Python Notebooks

Container Run Time Limits

Container Termination Messages

Data Storage / Datasets

Standard Datasets

File Transfer

Copying Data Into the Cluster: from an ieng6 home directory

Copying Data Into the Cluster: SFTP from your computer

Copying Data Into the Cluster: rsync

Customization of the Container Environment

Adjusting CPU/GPU/RAM limits

Alternate Docker Images

Launch Script Command-line Options

Custom Python Packages (Anaconda/PIP)

Background Execution / Long-Running Jobs

Common CUDA Run-Time Error Messages

(59) device-side assert

(2) out of memory

(30) unknown error

Monitoring Cluster Status

Hardware Specifications

Example of a PyTorch Session

Access to the “ieng6” front-end/submission node

To start a Pod (container),  first login via SSH to the ITS/ETS “dsmlp-login.ucsd.edu" Linux server.  (You may also use "ieng6.ucsd.edu" if you have been given an account there.)  These systems act as front-end/submission nodes for our cluster; computation is handled elsewhere.

ITS/ETS will provide instructors with login information for Instructor, TA, and student-test accounts for the courses.

Students should login to the front-end nodes using either their UCSD email username (e.g. ‘jsmith’), or in some cases, a “course specific” account, e.g. “cs253wXX" for CSE253, Winter 2018.     Consult the ITS/ETS Account Lookup Tool for instructions on activating course specific accounts.  UCSD Extension/Concurrent Enrollment students: see Extension for a course account token, then complete the ITS/ETS Concurrent Enrollment Computer Account form.

Students logging in to 'ieng6' with their UCSD username (e.g. 'jsmith') must use the 'prep' command to activate their course environment and gain access to the GPU tools.  Select the relevant option from the menu (e.g. cs253w, cs291w).

('prep' is implicit on 'dsmlp-login', or when using a course-specific account on ieng6.)

Assistance with sign-on to the front-end nodes may be obtained from the ITS Service Desk, via email to servicedesk@ucsd.edu, or via phone/walk-in at the AP&M service desk.  Your instructor or TA will be your best resource for course-specific questions.  

Launching a Container

After signing-on to the front-end node,  you may start a Pod/container using either of the following commands:

Launch Script

Description

#GPU

#CPU

RAM

launch-scipy-ml.sh

Python 3.6, PyTorch 1.0.1, TensorFlow 1.12.0 (WI19, replaces ets-pytorch)

0

2

8

launch-scipy-ml-gpu.sh

1

4

16

Docker container image and CPU/GPU/RAM settings are all configurable; see the “Customization” and "Launch Script Command-line Options" sections below.

We encourage you to use non-GPU (CPU-only) containers until your code is fully tested and a simple training run is successful.  ( PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU.)

Once started, containers can provide Bash (shell/command-line), as well as Jupyter/Python Notebook environments.

Bash shell / Command Line

The predefined launch scripts initiate an interactive Bash shell similar to ‘ssh’; containers terminate when this interactive shell exits.   Our ‘pytorch’ image includes the GNU Screen utility, which may be used to manage multiple terminal sessions in a window-like manner.  

Jupyter/Python Notebooks

The default container configuration creates an interactive web-based Jupyter/Python Notebook which may be accessed via a TCP proxy URL output by the launch script.   Note that access to the TCP proxy URL requires a UCSD IP address: either on-campus wired/wireless, or VPN.  See http://blink.ucsd.edu/go/vpn for instructions on the campus VPN.

Monitoring Resource Usage within Jupyter/Python Notebooks

Users of the stock containers will find CPU/Memory/GPU utilization noted at the top of the Jupyter notebook screen:

Container Run Time Limits

By default, containers are limited to 6 hours execution time to minimize impact of abandoned/runaway jobs.  This limit may be increased, up to 12 hours, by modifying the "K8S_TIMEOUT_SECONDS" configuration variable.   Please contact your TA or instructor if you require more than 12 hours.

Container Termination Messages

Containers may occasionally exit with one of the following error messages:

OOMKilled

Container memory (CPU RAM) limit was reached.

DeadlineExceeded

Container time limit (default 6 hours) exceeded - see above.

Error

Unspecified error.  Contact ITS/ETS for assistance.

Data Storage / Datasets

Two types of persistent file storage are available within containers: a private home directory ($HOME) for each user, as well as a shared directory /datasets used to distribute common data (e.g. CIFAR-10, Tiny ImageNet).    

Standard Datasets

Name

Path

Size

#Files

Notes

MNIST

/datasets/MNIST

53M

4

ImageNet Fall 2011

/datasets/imagenet

1300G

14M

ImageNet 32x32 2010

/datasets/imagenet-ds

1800M

2.6M

ILSVRC2012

Downsampled 32x32,64x64

Tiny-ImageNet

/datasets/Tiny-ImageNet

353M

120k

CIFAR-10

/datasets/CIFAR-10

178M

9

Caltech256

/datasets/Caltech256

1300M

30k

ShapeNet

/datasets/ShapeNet

204G

981k

ShapeNetCore v1/v2

MJSynth 

/datasets/MJSynth

36G

8.9M

Synthetic Word Dataset

Contact ITS to request installation of additional datasets.

File Transfer

Standard utilities such as 'git', 'scp', 'sftp', and 'curl' are included in the 'pytorch' container image and may be used to retrieve code or data from on- or off-campus servers.    

Files also may be copied into the cluster from the outside using the following procedures.

Note that file transfer is only offered through 'dsmlp-login.ucsd.edu', even if you normally launch jobs from 'ieng6'.

Copying Data Into the Cluster: SCP/SFTP from your computer

Updated Process, October 2018

Data may be copied to/from the cluster using the "SCP" or "SFTP" file transfer protocol from a Mac or Linux terminal window, or on Windows using a freely downloadable utility.  We recommend this option for most users.

Example using the Mac/Linux 'sftp' command line program:

slithy:Downloads agt$ sftp <username>@dsmlp-login.ucsd.edu

pod agt-4049 up and running; starting sftp

Connected to ieng6.ucsd.edu

sftp> put 2017-11-29-raspbian-stretch-lite.img

Uploading 2017-11-29-raspbian-stretch-lite.img to /datasets/home/08/108/agt/2017-11-29-raspbian-stretch-lite.img

2017-11-29-raspbian-stretch-lite.img             100% 1772MB  76.6MB/s   00:23    

sftp> quit

sftp complete; deleting pod agt-4049

slithy:Downloads agt$

On Windows, we recommend the WinSCP utility.

Copying Data Into the Cluster: rsync

Updated Process, October 2018

'rsync' also may be used from a Mac or Linux terminal window to synchronize data sets:

slithy:ME198 agt$ rsync -avr tub_1_17-11-18 <username>@dsmlp-login.ucsd.edu

pod agt-9924 up and running; starting rsync

building file list ... done

rsync complete; deleting pod agt-9924

sent 557671 bytes  received 20 bytes  53113.43 bytes/sec

total size is 41144035  speedup is 73.78

slithy:ME198 agt$

Customization of the Container Environment

Each launch script specifies the default Docker image to use, the required number of CPU cores, GPU cards, and GB RAM assigned to its containers.  An example of such a launch configuration is as follows:

K8S_DOCKER_IMAGE="ucsdets/instructional:cse190fa17-latest"

K8S_ENTRYPOINT="/run_jupyter.sh"

K8S_NUM_GPU=1  # max of 1 (contact ETS to raise limit)

K8S_NUM_CPU=4  # max of 8 ("")

K8S_GB_MEM=32  # max of 64 ("")

# Controls whether an interactive Bash shell is started

SPAWN_INTERACTIVE_SHELL=YES

# Sets up proxy URL for Jupyter notebook inside

PROXY_ENABLED=YES

PROXY_PORT=8888

Instructors and TAs may directly modify the coursewide scripts located in ../public/bin.  
Otherwise, users may
 copy an existing launch script into their home directory, then modifying that private copy: 

 

$ cp -p `which launch-pytorch.sh` $HOME/my-launch-pytorch.sh

$ nano $HOME/my-launch-pytorch.sh    

$ $HOME/my-launch-pytorch.sh

Adjusting CPU/GPU/RAM limits

The maximum limits (8 CPU, 64GB, 1 GPU) apply to all of your running containers:  you may run 8 1 CPU-core containers, or 1 8-core container, or anything in-between.    Contact ETS to request increases to these default limits.

Increases to GPU allocations require consent of TA, instructor or advisor.  

Alternate Docker Images

Besides GPU/CPU/RAM settings, you may specify an alternate Docker image:  our servers will pull container images from dockerhub.io or elsewhere if requested.  ITS/ETS is happy to assist you with creation or modification of Docker images as needed, or you may do so on your own.

Launch Script Command-line Options

Defaults set within launch scripts' environment variables may be overridden using the following command-line options:

Option

Description

Example

-c N

Adjust # CPU cores

-c 8

-g N

Adjust # GPU cards

-g 2

-m N

Adjust # GB RAM

-m 64

-i IMG

Docker image name

-i nvidia/cuda:latest

-e ENTRY

Docker image ENTRYPOINT/CMD

-e /run_jupyter.sh

-n N

Request specific cluster node (1-10)

-n 7

-v

Request specific GPU (gtx1080ti,k5200,titan)

-v k5200

-b

Request background pod

(see below)

Example:

[cs190f @ieng6-201]:~:56$  launch-py3torch-gpu.sh -m 64 -v k5200

Custom Python Packages (Anaconda/PIP)

Users may install personal Python packages within their containers using the standard Anaconda  package management system; please see Anaconda's Getting Started guide for a 30-minute introduction.  Furthermore, instructors and TAs may construct shared course-wide Anaconda environments for their students; contact ETS for assistance doing so.

Example of installation using 'pip':

agt@agt-10859:~$ pip install --user imutils

Collecting imutils

  Downloading imutils-0.4.5.tar.gz

Building wheels for collected packages: imutils

  Running setup.py bdist_wheel for imutils ... done

  Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71

Successfully built imutils

Installing collected packages: imutils

Successfully installed imutils-0.4.5

Background Execution / Long-Running Jobs

To support longer training runs, we permit background execution of student containers, up to 12 hours execution time, via the "-b" command line option.  

Use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container, and ‘kubectl delete pod <pod-name>’ to terminate.

Please be considerate and terminate any unused background jobs:  GPU cards are assigned to containers on an exclusive basis, and when attached to a container are unusable by others even if idle.


Common CUDA Run-Time Error Messages

(59) device-side assert 

cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCTensorCopy.c:18

Indicates a run-time error in the CUDA code executing on the GPU, commonly due out-of-bounds array access. Consider running in CPU-only mode (remove .cuda() call) to obtain more specific debugging messages.

(2) out of memory

 RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503968623488/work/torch/lib/THC/generic/THCStorage.cu:66

GPU memory has been exhausted.  Try reducing your batch size, or confine your job to 11GB GTX1080Ti cards rather than 6GB Titan or 8GB K5200 (see Launch Script Command-line Options).

(30) unknown error

RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/THCGeneral.c:70

This indicates a hardware error on the assigned GPU, and usually requires a reboot of the cluster node to correct.  As a temporary workaround, you may explicitly direct your job to another node; see Launch Script Command-line Options.  Please report these errors to ITS/ETS support - servicedesk@ucsd.edu

Monitoring Cluster Status

The ‘cluster-status’ command provides insight into the number of jobs currently running and GPU/CPU/RAM allocated.

ITS/ETS plans to deploy more sophisticated monitoring tools over the coming months.

Installing TensorBoard

Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment:

 

pip install -U --user jupyter-tensorboard

jupyter nbextension enable jupyter_tensorboard/tree --user

You’ll need to exit your Pod/container and restart for the change to take effect.

Usage instructions for ‘jupyter_tensorboard’ are available at:

 

https://github.com/lspvic/jupyter_tensorboard#usage

Hardware Specifications


Cluster architecture diagram

 Node

CPU Model

#Cores ea.

RAM ea

#GPU

GPU Model

Family

CUDA Cores

GPU RAM

GFLOPS

Nodes 1-4

2xE5-2630 v4

20

384Gb

8

GTX 1080Ti

Pascal

3584 ea.

11Gb

10600

Nodes 5-8

2xE5-2630 v4

20

256Gb

8

GTX 1080Ti

Pascal

3584 ea.

11Gb

10600

Node 9

2xE5-2650 v2

16

128Gb

8

GTX Titan
(2014)

Kepler

2688 ea.

6Gb

4500

Node 10

2xE5-2670 v3

24

320Gb

7

GTX 1070Ti

Pascal

2432 ea.

8Gb

7800

Nodes 11-12

2xXeon Gold 6130

32

384Gb

8

GTX 1080Ti

Pascal

3584 ea.

11Gb

10600

Nodes 13-15

2xE5-2650v1

16

320Gb

n/a

n/a

n/a

n/a

n/a

n/a

Nodes 16-18

2xAMD 6128

24

256Gb

n/a

n/a

n/a

n/a

n/a

n/a

Nodes are connected via an Arista 7150 10Gb ethernet switch.  

Additional nodes can be added into the cluster at peak times.

Example of a PyTorch Session

 

slithy:~ agt$

slithy:~ agt$ ssh cs190f@ieng6.ucsd.edu

Password:

Last login: Thu Oct 12 12:29:30 2017 from slithy.ucsd.edu

============================ NOTICE =================================

Authorized use of this system is limited to password-authenticated

usernames which are issued to individuals and are for the sole use of

the person to whom they are issued.

 

Privacy notice: be aware that computer files, electronic mail and

accounts are not private in an absolute sense.  For a statement of

"ETS (formerly ACMS) Acceptable Use Policies" please see our webpage

at http://acms.ucsd.edu/info/aup.html.

=====================================================================

 

 

Disk quotas for user cs190f (uid 59457):

     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace

acsnfs4.ucsd.edu:/vol/home/linux/ieng6

                      11928  5204800 5204800                 272        9000        9000      

=============================================================

Check Account Lookup Tool at http://acms.ucsd.edu

=============================================================

 

[…]

 

Thu Oct 12, 2017 12:34pm - Prepping cs190f

[cs190f @ieng6-201]:~:56$ launch-pytorch-gpu.sh

Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 1 GPU units.  (Edit /home/linux/ieng6/cs190f/public/bin/launch-pytorch.sh to change this configuration.)

pod "cs190f -4953" created

Thu Oct 12 12:34:41 PDT 2017 starting up - pod status: Pending ;

Thu Oct 12 12:34:47 PDT 2017 pod is running with IP: 10.128.7.99

tensorflow/tensorflow:latest-gpu is now active.

 

Please connect to: http://ieng6-201.ucsd.edu:4957/?token=669d678bdb00c89df6ab178285a0e8443e676298a02ad66e2438c9851cb544ce

 

Connected to cs190f-4953; type 'exit' to terminate processes and close Jupyter notebooks.

cs190f@cs190f-4953:~$ ls

TensorFlow-Examples

cs190f@cs190f-4953:~$

cs190f@cs190f-4953:~$ git clone https://github.com/yunjey/pytorch-tutorial.git

Cloning into 'pytorch-tutorial'...

remote: Counting objects: 658, done.

remote: Total 658 (delta 0), reused 0 (delta 0), pack-reused 658

Receiving objects: 100% (658/658), 12.74 MiB | 24.70 MiB/s, done.

Resolving deltas: 100% (350/350), done.

Checking connectivity... done.

cs190f@cs190f-4953:~$ cd pytorch-tutorial/

cs190f@cs190f-4953:~/pytorch-tutorial$ cd tutorials/02-intermediate/bidirectional_recurrent_neural_network/

cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ python main-gpu.py

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Processing...

Done!

Epoch [1/2], Step [100/600], Loss: 0.7028

Epoch [1/2], Step [200/600], Loss: 0.2479

Epoch [1/2], Step [300/600], Loss: 0.2467

Epoch [1/2], Step [400/600], Loss: 0.2652

Epoch [1/2], Step [500/600], Loss: 0.1919

Epoch [1/2], Step [600/600], Loss: 0.0822

Epoch [2/2], Step [100/600], Loss: 0.0980

Epoch [2/2], Step [200/600], Loss: 0.1034

Epoch [2/2], Step [300/600], Loss: 0.0927

Epoch [2/2], Step [400/600], Loss: 0.0869

Epoch [2/2], Step [500/600], Loss: 0.0139

Epoch [2/2], Step [600/600], Loss: 0.0299

Test Accuracy of the model on the 10000 test images: 97 %

cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ cd $HOME

cs190f@cs190f-4953:~$ nvidia-smi    

Thu Oct 12 13:30:59 2017      

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 384.81                     Driver Version: 384.81                          |

|-------------------------------+----------------------+----------------------+

| GPU  Name            Persistence-M   | Bus-Id            Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|             Memory-Usage | GPU-Util  Compute M.     |

|===============================+======================+======================|

|   0  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                      N/A    |

| 23%   27C        P0        56W / 250W |          0MiB / 11172MiB |          0%          Default      |

+-------------------------------+----------------------+----------------------+

cs190f@cs190f-4953:~$ exit