A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Processes: | Group A: GPU:1 Batch-size:128 | Group B: GPU:2 Batch-size:256 | Commands | ||||||||||||||||||||||
2 | Avg compute time per batch: | 0.7514985561 | 0.7704561472 | Diff: | 0.01895759106 | python main-measure-time.py -a resnet18 -j 4 --gpu 1 --epochs 1 -b 128 --profile-batches 10 ~/data/test-accuracy/imagenette2 &> output128.txt | ||||||||||||||||||||
3 | I first divided compute time into 3 parts: model, criterion and the rest, so I can see which one differ the most | python main-measure-time.py -a resnet18 -j 4 --epochs 1 -b 256 --profile-batches 10 ~/data/test-accuracy/imagenette2 &> output256.txt | ||||||||||||||||||||||||
4 | model | 0.2536421537 | 0.2659458399 | Diff: | 0.01230368614 | Notes: | ||||||||||||||||||||
5 | criterion | 0.00032394 | 0.000420237 | 0.000096297 | Take average but don't take the data from first two batches. | |||||||||||||||||||||
6 | the rest | 0.4971353531 | 0.5041868925 | 0.007051539421 | All data are per batch | |||||||||||||||||||||
7 | Quick Conclusion: It seems the most diff comes from model, then the rest, then the criterion | Sum diff: | 0.01945152283 | |||||||||||||||||||||||
8 | Now I should look into forward and backward, especially backward, where the inter 2 GPUs will sync up. | |||||||||||||||||||||||||
9 | First check out functions in model: | |||||||||||||||||||||||||
10 | resnet.py -> _forward_impl | 0.2006229401 | 0.206294024 | Diff: | 0.005671083927 | |||||||||||||||||||||
11 | Could the difference between the two forward times be caused by the transfer time? Ans: Not the two forward here, but in the compute time's model()'s scatter, where images are put onto GPU. | |||||||||||||||||||||||||
12 | First, check if 1 GPU's transfer time is larger than 2 GPUs' transfer time, if it is, then we are sure that when 2 GPUs, we put the images onto GPU later. | |||||||||||||||||||||||||
13 | Transfer Time: | 0.0064349704 | 0.0002668964 | -0.0061680741 | Time to put images on GPU for 1 GPU | |||||||||||||||||||||
14 | Note that the diff of transfer time is really close the diff between forward_impl yet a bit larger, so it is definite that the transfer of images onto image happened in model function. | |||||||||||||||||||||||||
15 | Quick note that in main.py torch.nn.DataParallel(model).cuda() is only called once when using 2 GPUs before compute time, so cannot be part of the reason why compute time is longer. | |||||||||||||||||||||||||
16 | Now need to find where did we put the images onto GPU in model, try stack calls first. | |||||||||||||||||||||||||
17 | I got this by adding in one more input to get an error message and see who called model(images) | |||||||||||||||||||||||||
18 | model(images)->module.py's _call_impl | |||||||||||||||||||||||||
19 | Went into file module.py | 07/21/2023 | ||||||||||||||||||||||||
20 | 1. where was the transfer place for images in 2 GPUs? Ans: done in /home/cc/pytorch-meng/torch/csrc/cuda/comm.cpp -> | chunks[i] = chunks[i].to( | ||||||||||||||||||||||||
21 | 2. forward overhead for 2 GPUs, where did it come from? | {DeviceType::CUDA, device_index}, | ||||||||||||||||||||||||
22 | 3. where does the 0.05 sec inside "model" happen, if it's not in _forward_impl | /*non_blocking=*/true, | ||||||||||||||||||||||||
23 | 1. 100% sure that the images got onto the GPU after inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids) | /*copy=*/false, | ||||||||||||||||||||||||
24 | Measure the time took for scatter, for both 1 GPU and 2 GPUs | /*memory_format=*/at::MemoryFormat::Preserve); | ||||||||||||||||||||||||
25 | Scatter Time: | 0 | 0.0064803362 | Diff: | 0.0064803362 | |||||||||||||||||||||
26 | We notice that the diff sum in sactter plus forward add up really close to the diff in model time: | 0.0121514 | Actual diff in model: | 0.0123036861 | Diff: | 0.0001522660 | ||||||||||||||||||||
27 | Why the time to put images on GPU are similar? Shouldn't Scatter time be smaller? 1 channel? Ans: There is just a for loop, not two parallel threads, so the time are similar, not shorter. | |||||||||||||||||||||||||
28 | Found in _functions.py Scatter's forward. | |||||||||||||||||||||||||
29 | Inside: /home/cc/pytorch-meng/torch/nn/parallel/comm.py there is a function called: torch._C._scatter(tensor, devices, chunk_sizes, dim, streams) which puts tensor onto CUDA | |||||||||||||||||||||||||
30 | What are the actions done to the images? Before the images reach the point where it gets put onto GPU. Ans: Recorded right below. | |||||||||||||||||||||||||
31 | Actions done to images when using 2 GPUs: | |||||||||||||||||||||||||
32 | 1. In main.py, the images is sent as an input for function model(): output = model(images) | |||||||||||||||||||||||||
33 | 2. model then calls the forward function in module.py in /home/cc/pytorch-meng/torch/nn/modules/module.py: result = self.forward(*input, **kwargs), note that input=images | |||||||||||||||||||||||||
34 | 3. the function above then calls the forward function in /home/cc/pytorch-meng/torch/nn/parallel/data_parallel.py, in def forward(self, *inputs, **kwargs):, and *inputs=images | |||||||||||||||||||||||||
35 | 4. the images then went into function scatter: inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids), the len(inputs) goes from 1->2 after scatter, and both will be on GPU afterward. | |||||||||||||||||||||||||
36 | 5. get into /home/cc/pytorch-meng/torch/nn/parallel/scatter_gather.py: call scatter_kwargs(inputs, kwargs, target_gpus, dim=0) then scatter(inputs, target_gpus, dim=0) | |||||||||||||||||||||||||
37 | 6. The images will be rearranged into len(inputs)=256 so that it can distribute the data onto GPU. It calls return Scatter.apply(target_gpus, None, dim, obj), here obj=images but len(obj)=256 | |||||||||||||||||||||||||
38 | 7. It then gets to /home/cc/pytorch-meng/torch/nn/parallel/_functions.py: call outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams) | |||||||||||||||||||||||||
39 | 8. It then gets to /home/cc/pytorch-meng/torch/nn/parallel/comm.py then call tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams)), and the images are distirbuted then all C++ code. | |||||||||||||||||||||||||
40 | 9. Now the divided images are on GPU now, then it will call forward in resnet.py. | |||||||||||||||||||||||||
41 | ||||||||||||||||||||||||||
42 | Quick check on dim in /home/cc/pytorch-meng/torch/nn/parallel/comm.py when using 2 GPUs: | |||||||||||||||||||||||||
43 | 1 GPU: | 2 GPUs: | ||||||||||||||||||||||||
44 | Now check on: | resnet.py -> _forward_impl | 0.20062294 | 0.20629402 | Diff of _forward_impl: | 0.00567108 | ||||||||||||||||||||
45 | Two ways to emulate: 1. emulate time. 2. actual emulate when it gets up to GPU. | Diff double checked: | 0.00591906 | |||||||||||||||||||||||
46 | The diffs below are recorded from the whole epoch | |||||||||||||||||||||||||
47 | Diff in A: | Diff in B: | Diff in C: | |||||||||||||||||||||||
48 | 0.0036772 | -0.0001204 | 0.0003126 | |||||||||||||||||||||||
49 | The diffs below are recorded from 10 batches | |||||||||||||||||||||||||
50 | Diff in A: | Diff in B: | Diff in C: | |||||||||||||||||||||||
51 | 0.0032056 | 0.0025234 | 0.0001092 | |||||||||||||||||||||||
52 | We notice that in section B and section C, the differences are indeed trivial for the whole epoch, so we will dig into section A. | |||||||||||||||||||||||||
53 | More specifically, the second call on forward takes about 0.008 more second to run, and 0.001 second on the first forward to run. | |||||||||||||||||||||||||
54 | After some considerations, I think the size difference between the tensors might be the key. | |||||||||||||||||||||||||
55 | The size of the tensor is: torch.Size([128, 3, 224, 224]) | |||||||||||||||||||||||||
56 | The size of the tensor is: torch.Size([128, 3, 224, 224]) | |||||||||||||||||||||||||
57 | Not really, the size of the tensors is the same. | |||||||||||||||||||||||||
58 | Stop here, try work on "the rest" section first | Review that time diff in the rest section is: | 0.0070515 | |||||||||||||||||||||||
59 | Forward does not contribute to the time diff in model | |||||||||||||||||||||||||
60 | Now known scatter takes 0.006, find the other 0.009 in time diff in model | |||||||||||||||||||||||||
61 | Replicate: will call the function in replicate.py in parallel directory: /home/cc/pytorch-meng/torch/nn/parallel/replicate.py | |||||||||||||||||||||||||
62 | Quick summary: self.module is the resnet architecture and there is no simple copy method of copying the whole architecture. Therefore, the function will produce two identical modules called replicas. | |||||||||||||||||||||||||
63 | The function replicate mainly consisted of enumerations and detailed check-ins of each layer. Thenumeration takes O(M*N*K) time depending on the layers and basic blocks. | |||||||||||||||||||||||||
64 | The enumerations are the fundamental factors that can determine how fast the function runs. | |||||||||||||||||||||||||
65 | parallel_apply: will take in the two inputs and the two modules and make threads to call them. I have found exactly where it calls forward and we see that if any overhead, it's caused by parallelization. | |||||||||||||||||||||||||
66 | gather: Gathers tensors from different GPUs on a specified device (-1 means the CPU). | |||||||||||||||||||||||||
67 | The function of gather is very similar to that of scatter because it eventually leads back to where it calls ._C._gather function that works in C++. | |||||||||||||||||||||||||
68 | About gather: what did it gather? | The outputs are related to automatic differentiation and gradient computation for backpropagation. | Outputs: | [tensor([[-7.8075e-01, -5.9706e-02, 1.7076e-01, ..., 1.2223e-01, | ||||||||||||||||||||||
69 | The self.output_device is: 0 | So it gathers the information to the first device. | -3.2969e-01, 7.1572e-04], | |||||||||||||||||||||||
70 | Format of outputs? Shape? | The shape of the outputs is: [0]: torch.Size([128, 1000]), [1]: torch.Size([128, 1000]), two different outputs | [-8.7854e-01, 2.9962e-02, 2.4474e-01, ..., 2.2326e-01, | |||||||||||||||||||||||
71 | What did it do after calling gather? | In /home/cc/pytorch-meng/torch/nn/parallel/comm.py | -4.3616e-01, 5.7772e-02], | |||||||||||||||||||||||
72 | Goes here eventually: /home/cc/pytorch-meng/torch/csrc/cuda/comm.cpp | [-8.7176e-01, -7.0602e-02, 2.1274e-01, ..., 1.5945e-01, | ||||||||||||||||||||||||
73 | Week9: | -3.4520e-01, -1.7627e-03], | ||||||||||||||||||||||||
74 | criterion: | ..., | ||||||||||||||||||||||||
75 | criterion = nn.CrossEntropyLoss().cuda(args.gpu) | [-6.5878e-01, -3.7832e-02, 1.5339e-01, ..., 2.4703e-01, | ||||||||||||||||||||||||
76 | CrossEntropyLoss (torch/nn/modules/loss.py) | -2.2015e-01, 1.4134e-01], | ||||||||||||||||||||||||
77 | loss = criterion(output) | [-3.2882e-01, -1.5878e-01, 3.6805e-01, ..., 1.5858e-01, | ||||||||||||||||||||||||
78 | CrossEntropyLoss.forward(output, target) | -4.5815e-01, -6.8504e-02], | ||||||||||||||||||||||||
79 | cross_entropy (torch/nn/functional.py) -> nll_loss() | [-1.5053e-01, -4.8872e-02, 3.4245e-01, ..., 2.0129e-01, | ||||||||||||||||||||||||
80 | When 1GPU: The shape of output before criterion is: torch.Size([128, 1000]) The shape of target after criterion is: torch.Size([128]) | -5.8841e-01, 7.4387e-02]], device='cuda:0', grad_fn=<AddmmBackward>), tensor([[-1.3464, 0.0287, 0.1616, ..., -0.0100, -0.5908, -0.0448], | ||||||||||||||||||||||||
81 | When 2 GPUs: The shape of output before criterion is: torch.Size([256, 1000]) The shape of target after criterion is: torch.Size([256]) | [-1.4379, -0.0654, -0.0129, ..., -0.0857, -0.6645, -0.1842], | ||||||||||||||||||||||||
82 | The result loss is a specific floating number. | [-0.9367, -0.0723, 0.0170, ..., 0.1129, -0.2279, 0.0269], | ||||||||||||||||||||||||
83 | When using 2 GPUs, output and target are both on GPU 0. | |||||||||||||||||||||||||
84 | ..., | |||||||||||||||||||||||||
85 | Measuring the rest section: | [-0.5194, -0.1425, 0.3726, ..., 0.2731, -0.5479, -0.0980], | ||||||||||||||||||||||||
86 | Found out that the time difference in the rest section is not quite constant. Neet to verify twice. | [-0.8766, -0.1182, 0.0995, ..., 0.2809, -0.2709, -0.0759], | ||||||||||||||||||||||||
87 | Mark out the set of batches that are too big or odd | [-0.9181, -0.1650, -0.0538, ..., 0.0423, -0.4404, -0.2072]], | ||||||||||||||||||||||||
88 | From the data collected in "The rest section", we can conclude that the main time difference happens in the backward() function. | device='cuda:1', grad_fn=<AddmmBackward>)] | ||||||||||||||||||||||||
89 | Notice that when using 1 GPU, .backward() calls: | |||||||||||||||||||||||||
90 | File "/home/cc/gpufs/emulator/inter-gpu-study/imagenette/main-measure-time.py", line 412, in train loss.backward(0) | |||||||||||||||||||||||||
91 | File "/home/cc/anaconda3/lib/python3.9/site-packages/torch/tensor.py", line 227, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) | |||||||||||||||||||||||||
92 | It does not go into line 90 apply function in /home/cc/anaconda3/lib/python3.9/site-packages/torch/autograd/function.py | |||||||||||||||||||||||||
93 | ||||||||||||||||||||||||||
94 | We see that it goes into torch inside the python packages instead of going into Pytorch | |||||||||||||||||||||||||
95 | However, when we have 2 GPUs, backward() does have some shared functions as used by scatter. | |||||||||||||||||||||||||
96 | File "/home/cc/gpufs/emulator/inter-gpu-study/imagenette/main-measure-time.py", line 413, in train | |||||||||||||||||||||||||
97 | loss.backward() | |||||||||||||||||||||||||
98 | File "/home/cc/anaconda3/lib/python3.9/site-packages/torch/tensor.py", line 227, in backward | |||||||||||||||||||||||||
99 | torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) | |||||||||||||||||||||||||
100 | File "/home/cc/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 145, in backward |