1 of 21

SOFA Basic

1

2 of 21

Quick Start (1/2): Prepare, Install, and Run

  • For download:
    • SOFA: git clone https://github.com/cyliustack/sofa (community edition)
  • cd sofa
  • ./tools/prepare.sh
  • ./tools/empower-tcpdump.sh $(whoami)
    • Logout and then login to make changes effective; Then, cd sofa
  • ./install.sh /opt/sofa
  • source /opt/sofa/tools/activate.sh
  • [optional] ./tools/enable_strace_perf_pcm.py
  • sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100"
  • sofa report --verbose

2

3 of 21

Quick Start (2/2): Visualization

3

Visualization display of heterogeneous performance data are stored in directory of ./sofalog/ , you could

  • Use command: sofa report --with-gui
  • or, compress and download the whole sofalog directory to local storage, and use Browser to open sofalog/index.html

4 of 21

4

X-axis = Unix Time Timestamps (seconds); Y-axis = Metrics with different Units (log10-scale)

CPU

CPU time (seconds)

NET

Payload of each packet (bytes)

VMSTAT_CS/VMSTAT_BI/VMSTAT_BO

counts/seconds

STRACE

duration (seconds)

MPSTAT_USR

Seconds per 10-ms

GPU Kernel,

CUDA_COPY_H2D (Host-to-Device)

CUDA_COPY_D2H (Device-to-Host)

Duration (seconds)

5 of 21

Heterogenous Traces Visualization in SOFA

5

GPU H2D memcpy

GPU D2H memcpy

GPU DNN Backward Propagation

GPU DNN Forward Propagation

CPU Utilization

Network Bandwidth

6 of 21

SOFA & Deep Learning

6

7 of 21

Case Study: Storage

7

Commands:

sudo sysctl -w vm.drop_caches=3

sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"

8 of 21

Case Study: Storage (cont.)

8

Commands:

sudo sysctl -w vm.drop_caches=3

sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"

sofa report --with_gui

9 of 21

Case Study: Storage (cont.)

9

Commands:

sudo sysctl -w vm.drop_caches=3

sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"

sofa report --with_gui

10 of 21

Case Study: Storage (cont.)

10

ls -lah /dev/mapper/

...

lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-home -> ../dm-2

lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-root -> ../dm-0

lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-swap -> ../dm-1

COMMAND:

sofa record "dd if=/dev/zero of=dummy.out bs=1M count=1000"

sofa report --with_gui

10 Hz diskstat monitoring, unit: read/write sectors.

11 of 21

Case Study: Storage (cont.)

11

MPSTAT Profiling:

CPU Utilization (%):

core USR SYS IDL IOW IRQ

0 0 0 97 0 0

1 0 9 75 14 0

2 0 0 99 0 0

3 1 3 88 6 0

4 1 7 90 0 0

5 0 54 32 12 0

6 0 6 46 46 0

7 0 0 95 3 0

CPU Time (s):

core USR SYS IDL IOW IRQ

0 0.03 0.03 3.11 0.02 0.00

1 0.00 0.32 2.41 0.46 0.00

2 0.01 0.02 3.17 0.00 0.00

3 0.06 0.10 2.84 0.22 0.00

4 0.04 0.25 2.91 0.00 0.00

5 0.00 1.72 1.02 0.40 0.00

6 0.03 0.20 1.48 1.48 0.00

7 0.00 0.03 3.06 0.11 0.00

Active CPU Time (s): 5.510

Active CPU ratio (%): 22

Def, Active CPU ratio = total non-idle time / ( elapsed time * CPU cores)

12 of 21

Case Study: Storage (cont.)

12

Exercise 1

  • Clean up files cached in memory
    • sudo sysctl -w vm.drop_caches=3
  • Write “zero bytes” into a file placed in local SSD 500 times with block size of 10MB
    • sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
    • sofa report --with_gui
    • Please notice the overhead changes since time = 1550681154.2 in page 37

Exercise 2

  • Mount ramdisk onto /mnt/tmpfs
    • mkdir /mnt/tmpfs
    • mount -t tmpfs -o size=10g none /mnt/tmpfs
  • Write “zero bytes” into a file placed in ramdisk 500 times with block size of 10MB
    • sofa record "dd if=/dev/zero of=/mnt/tmpfs/dummy.out bs=10M count=500"
    • sofa report --with_gui

Exercise 3

  • Please check which choice of the block size is optimal on your computer regarding I/O throughput (i.e. bytes/s)? Why? Can you use SOFA or the other profiling tools to explain?

13 of 21

Case Study: CUDA Memory Copy

13

What is the reason that cause network traces (i.e. tcpdump traces)?

Command:

sofa record ~/NVIDIA_CUDA-9.1_Samples/1_Utilities/bandwidthTest/bandwidthTest

14 of 21

SOFA - Advanced Usage

14

15 of 21

SOFA Advanced Usage

15

usage: sofa [-h] [--logdir /path/to/logdir/] [--verbose] [--pid PID]

[--profile_all_cpus] [--enable_strace] [--enable_tcpdump]

[--enable_py_stacks]

[--perf_events "cycles,instructions,cache-misses"]

[--blkdev BLKTRACE_DEVICE] [--netstat_interface NETSTAT_INTERFACE]

[--nvprof_inside] [--skip_preprocess]

[--gpu_filters "keyword1:color1,keyword2:color2"]

[--cpu_filters "keyword1:color1,keyword2:color2"]

[--cluster_ip "192.168.0.1,192.168.0.2"] [--cpu_top_k N]

[--num_iterations N] [--num_swarms N] [--cpu_time_offset_ms N]

[--strace_min_time F] [--plot_ratio N] [--viz_port N]

[--enable_aisi] [--enable_encode_decode] [--aisi_via_strace]

[--display_swarms] [--enable_swarms] [--base_logdir BASE_LOGDIR]

[--match_logdir MATCH_LOGDIR] [--hsg_multifeatures]

[--enable_vmstat] [--network_filters "ip1,ip2,ip3"]

[--cuda_api_tracing] [--potato_server "ip:port"]

[--absolute_timestamp] [--profile_region begin_time,end_time]

[--spotlight_gpu] [--with_gui] [--nvsmi_time_zone 8]

<SOFA_COMMAND> [<PROFILED_COMMAND>]

16 of 21

SOFA Advanced Usage (cont.)

16

More performance metrics:

sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500" --perf_events="cycles,instructions,cache-misses,branch-misses"

More performance metrics:

sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --cuda_api_tracing

More performance metrics:

sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --enable_strace

More performance metrics:

sofa record "sleep 5" --enable_tcpdump

Background recording for daemon or multiple-command bash file

sofa record "sleep 20" --profile_all_cpus

Then, execute the target command

17 of 21

SOFA Advanced Usage (cont.)

17

Verbose mode to show more information, like the progress of generating report or displaying detailed reports (e.g., total system call time)�sofa report --verbose

Automatically identification iterative swarm and then expose per-iteration performance summary�sofa report --enable_aisi --num_iterations 20

Display top-10 hotspot swarms which are highlighted with different colors

sofa report --verbose --display_swarms

Reduce the number of points shown on visualization interfaces �sofa report --plot_ratio 10

Absolute or Relative (default) Timestamp

sofa report

sofa report --absoluate_timestamp

18 of 21

SOFA Advanced Usage (cont.)

18

Apply filters to highlight interested traces �sofa report --cpu_filters=’tensorflow:orange’ --gpu_filters=’fw:blue’ --gpu_filters=’bw:red’ --gpu_filters=nccl:purple’

Compare two-run traces swarm-by-swarm to find the affected swarms due to hardware/software/system changes:

sofa record "dd if=/dev/zero of=dummy.out bs=100M count=10" --logdir log1

sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100" --logdir log2

sofa diff --base_logdir log1 --match_logdir log2

19 of 21

Absolute or Relative (default) Timestamp

Command:

  • sofa record ~/samples/1_Utilities/bandwidthTest/bandwidthTest
  • sofa report OR sofa report --absoluate_timestamp

20 of 21

blktrace full example

sudo sofa record "sleep 10" --blkdev=/dev/sda1

OR

sudo sofa record "dd if=/dev/zero of=dummy.out bs=1K count=2000000" --blkdev /dev/sda1

sudo sofa report --blkdev=/dev/sda1 --with_gui

21 of 21

Appendix