inter-GPU-communication

	A	B	C	D	E	F	G	H
1	Processes:	Group A: GPU:1 Batch-size:128	Group B: GPU:2 Batch-size:256			Commands
2	Avg compute time per batch:	0.7514985561	0.7704561472	Diff:	0.01895759106	python main-measure-time.py -a resnet18 -j 4 --gpu 1 --epochs 1 -b 128 --profile-batches 10 ~/data/test-accuracy/imagenette2 &> output128.txt
3	I first divided compute time into 3 parts: model, criterion and the rest, so I can see which one differ the most					python main-measure-time.py -a resnet18 -j 4 --epochs 1 -b 256 --profile-batches 10 ~/data/test-accuracy/imagenette2 &> output256.txt
4	model	0.2536421537	0.2659458399	Diff:	0.01230368614	Notes:
5	criterion	0.00032394	0.000420237		0.000096297	Take average but don't take the data from first two batches.
6	the rest	0.4971353531	0.5041868925		0.007051539421	All data are per batch
7	Quick Conclusion: It seems the most diff comes from model, then the rest, then the criterion			Sum diff:	0.01945152283
8	Now I should look into forward and backward, especially backward, where the inter 2 GPUs will sync up.
9	First check out functions in model:
10	resnet.py -> _forward_impl	0.2006229401	0.206294024	Diff:	0.005671083927
11	Could the difference between the two forward times be caused by the transfer time? Ans: Not the two forward here, but in the compute time's model()'s scatter, where images are put onto GPU.
12	First, check if 1 GPU's transfer time is larger than 2 GPUs' transfer time, if it is, then we are sure that when 2 GPUs, we put the images onto GPU later.
13	Transfer Time:	0.0064349704	0.0002668964		-0.0061680741	Time to put images on GPU for 1 GPU
14	Note that the diff of transfer time is really close the diff between forward_impl yet a bit larger, so it is definite that the transfer of images onto image happened in model function.
15	Quick note that in main.py torch.nn.DataParallel(model).cuda() is only called once when using 2 GPUs before compute time, so cannot be part of the reason why compute time is longer.
16	Now need to find where did we put the images onto GPU in model, try stack calls first.
17			I got this by adding in one more input to get an error message and see who called model(images)
18	model(images)->module.py's _call_impl
19	Went into file module.py	07/21/2023
20	1. where was the transfer place for images in 2 GPUs? Ans: done in /home/cc/pytorch-meng/torch/csrc/cuda/comm.cpp ->				chunks[i] = chunks[i].to(
21	2. forward overhead for 2 GPUs, where did it come from?				{DeviceType::CUDA, device_index},
22	3. where does the 0.05 sec inside "model" happen, if it's not in _forward_impl				/non_blocking=/true,
23	1. 100% sure that the images got onto the GPU after inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)				/copy=/false,
24	Measure the time took for scatter, for both 1 GPU and 2 GPUs				/memory_format=/at::MemoryFormat::Preserve);
25	Scatter Time:	0	0.0064803362	Diff:	0.0064803362
26	We notice that the diff sum in sactter plus forward add up really close to the diff in model time:			0.0121514	Actual diff in model:	0.0123036861	Diff:	0.0001522660
27	Why the time to put images on GPU are similar? Shouldn't Scatter time be smaller? 1 channel? Ans: There is just a for loop, not two parallel threads, so the time are similar, not shorter.
28	Found in _functions.py Scatter's forward.
29	Inside: /home/cc/pytorch-meng/torch/nn/parallel/comm.py there is a function called: torch._C._scatter(tensor, devices, chunk_sizes, dim, streams) which puts tensor onto CUDA
30	What are the actions done to the images? Before the images reach the point where it gets put onto GPU. Ans: Recorded right below.
31	Actions done to images when using 2 GPUs:
32	1. In main.py, the images is sent as an input for function model(): output = model(images)
33	2. model then calls the forward function in module.py in /home/cc/pytorch-meng/torch/nn/modules/module.py: result = self.forward(input, *kwargs), note that input=images
34	3. the function above then calls the forward function in /home/cc/pytorch-meng/torch/nn/parallel/data_parallel.py, in def forward(self, inputs, kwargs):, and inputs=images
35	4. the images then went into function scatter: inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids), the len(inputs) goes from 1->2 after scatter, and both will be on GPU afterward.
36	5. get into /home/cc/pytorch-meng/torch/nn/parallel/scatter_gather.py: call scatter_kwargs(inputs, kwargs, target_gpus, dim=0) then scatter(inputs, target_gpus, dim=0)
37	6. The images will be rearranged into len(inputs)=256 so that it can distribute the data onto GPU. It calls return Scatter.apply(target_gpus, None, dim, obj), here obj=images but len(obj)=256
38	7. It then gets to /home/cc/pytorch-meng/torch/nn/parallel/_functions.py: call outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
39	8. It then gets to /home/cc/pytorch-meng/torch/nn/parallel/comm.py then call tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams)), and the images are distirbuted then all C++ code.
40	9. Now the divided images are on GPU now, then it will call forward in resnet.py.
41
42	Quick check on dim in /home/cc/pytorch-meng/torch/nn/parallel/comm.py when using 2 GPUs:
43			1 GPU:	2 GPUs:
44	Now check on:	resnet.py -> _forward_impl	0.20062294	0.20629402	Diff of _forward_impl:	0.00567108
45	Two ways to emulate: 1. emulate time. 2. actual emulate when it gets up to GPU.				Diff double checked:	0.00591906
46	The diffs below are recorded from the whole epoch
47	Diff in A:	Diff in B:	Diff in C:
48	0.0036772	-0.0001204	0.0003126
49	The diffs below are recorded from 10 batches
50	Diff in A:	Diff in B:	Diff in C:
51	0.0032056	0.0025234	0.0001092
52	We notice that in section B and section C, the differences are indeed trivial for the whole epoch, so we will dig into section A.
53	More specifically, the second call on forward takes about 0.008 more second to run, and 0.001 second on the first forward to run.
54	After some considerations, I think the size difference between the tensors might be the key.
55	The size of the tensor is: torch.Size([128, 3, 224, 224])
56	The size of the tensor is: torch.Size([128, 3, 224, 224])
57	Not really, the size of the tensors is the same.
58	Stop here, try work on "the rest" section first		Review that time diff in the rest section is:	0.0070515
59	Forward does not contribute to the time diff in model
60	Now known scatter takes 0.006, find the other 0.009 in time diff in model
61	Replicate: will call the function in replicate.py in parallel directory: /home/cc/pytorch-meng/torch/nn/parallel/replicate.py
62	Quick summary: self.module is the resnet architecture and there is no simple copy method of copying the whole architecture. Therefore, the function will produce two identical modules called replicas.
63	The function replicate mainly consisted of enumerations and detailed check-ins of each layer. Thenumeration takes O(MNK) time depending on the layers and basic blocks.
64	The enumerations are the fundamental factors that can determine how fast the function runs.
65	parallel_apply: will take in the two inputs and the two modules and make threads to call them. I have found exactly where it calls forward and we see that if any overhead, it's caused by parallelization.
66	gather: Gathers tensors from different GPUs on a specified device (-1 means the CPU).
67	The function of gather is very similar to that of scatter because it eventually leads back to where it calls ._C._gather function that works in C++.
68	About gather: what did it gather?	The outputs are related to automatic differentiation and gradient computation for backpropagation.			Outputs:	[tensor([[-7.8075e-01, -5.9706e-02, 1.7076e-01, ..., 1.2223e-01,
69		The self.output_device is: 0	So it gathers the information to the first device.			-3.2969e-01, 7.1572e-04],
70	Format of outputs? Shape?	The shape of the outputs is: [0]: torch.Size([128, 1000]), [1]: torch.Size([128, 1000]), two different outputs				[-8.7854e-01, 2.9962e-02, 2.4474e-01, ..., 2.2326e-01,
71	What did it do after calling gather?	In /home/cc/pytorch-meng/torch/nn/parallel/comm.py				-4.3616e-01, 5.7772e-02],
72		Goes here eventually: /home/cc/pytorch-meng/torch/csrc/cuda/comm.cpp				[-8.7176e-01, -7.0602e-02, 2.1274e-01, ..., 1.5945e-01,
73	Week9:					-3.4520e-01, -1.7627e-03],
74	criterion:					...,
75	criterion = nn.CrossEntropyLoss().cuda(args.gpu)					[-6.5878e-01, -3.7832e-02, 1.5339e-01, ..., 2.4703e-01,
76	CrossEntropyLoss (torch/nn/modules/loss.py)					-2.2015e-01, 1.4134e-01],
77	loss = criterion(output)					[-3.2882e-01, -1.5878e-01, 3.6805e-01, ..., 1.5858e-01,
78	CrossEntropyLoss.forward(output, target)					-4.5815e-01, -6.8504e-02],
79	cross_entropy (torch/nn/functional.py) -> nll_loss()					[-1.5053e-01, -4.8872e-02, 3.4245e-01, ..., 2.0129e-01,
80	When 1GPU: The shape of output before criterion is: torch.Size([128, 1000]) The shape of target after criterion is: torch.Size([128])					-5.8841e-01, 7.4387e-02]], device='cuda:0', grad_fn=<AddmmBackward>), tensor([[-1.3464, 0.0287, 0.1616, ..., -0.0100, -0.5908, -0.0448],
81	When 2 GPUs: The shape of output before criterion is: torch.Size([256, 1000]) The shape of target after criterion is: torch.Size([256])					[-1.4379, -0.0654, -0.0129, ..., -0.0857, -0.6645, -0.1842],
82	The result loss is a specific floating number.					[-0.9367, -0.0723, 0.0170, ..., 0.1129, -0.2279, 0.0269],
83	When using 2 GPUs, output and target are both on GPU 0.
84						...,
85	Measuring the rest section:					[-0.5194, -0.1425, 0.3726, ..., 0.2731, -0.5479, -0.0980],
86	Found out that the time difference in the rest section is not quite constant. Neet to verify twice.					[-0.8766, -0.1182, 0.0995, ..., 0.2809, -0.2709, -0.0759],
87	Mark out the set of batches that are too big or odd					[-0.9181, -0.1650, -0.0538, ..., 0.0423, -0.4404, -0.2072]],
88	From the data collected in "The rest section", we can conclude that the main time difference happens in the backward() function.					device='cuda:1', grad_fn=<AddmmBackward>)]
89	Notice that when using 1 GPU, .backward() calls:
90	File "/home/cc/gpufs/emulator/inter-gpu-study/imagenette/main-measure-time.py", line 412, in train loss.backward(0)
91	File "/home/cc/anaconda3/lib/python3.9/site-packages/torch/tensor.py", line 227, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
92	It does not go into line 90 apply function in /home/cc/anaconda3/lib/python3.9/site-packages/torch/autograd/function.py
93
94	We see that it goes into torch inside the python packages instead of going into Pytorch
95	However, when we have 2 GPUs, backward() does have some shared functions as used by scatter.
96	File "/home/cc/gpufs/emulator/inter-gpu-study/imagenette/main-measure-time.py", line 413, in train
97	loss.backward()
98	File "/home/cc/anaconda3/lib/python3.9/site-packages/torch/tensor.py", line 227, in backward
99	torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
100	File "/home/cc/anaconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 145, in backward