Example first responses to reviewer comments

	A	B	C
1	Reviewer comment	Author 1	Author N
2	1. There existing several related papers discussion of the using human attention map in image captioning and visual question answering. For example, (1) Liu et al. Attention correctness in neural image captioning. (2) Qiao et al. Exploring human-like attention supervision in visual question answering. Please illustrate the differences with these papers.	The papers mentioned provide attention supervision over the attention layer. Our central argument for this and the next point will be that Grad-CAM is more faithful than attention. In order to show this I am planning on doing occlusion studies in the proposal space and compare that with the attention weights and the Grad-CAM proposal importance weights. Also with attention supervision only the layers before the attention layer can be updated, but with HINT all layer weights can be updated. Include lines from the paper. Att supervision doesn't work.	We should first very clearly say what you say in the first sentece of your response. And expand on that a bit if needed to make the point clearly. Does the paper differentiate our work from these work R2 cites or other such works? If so, we should clearly say in the response "As discussed in LXYZ-ABC..." You can then make the point about which layers can be updated. We can then additionally make the "central argument" point. But the direct response should be clear / not confused with the description of a new experiment and such.
3	2. It seems that the ground-truth attention map is used for the VQA task. For the captioning task, although no ground-truth attention map is used, the segmentation maps are used. As such compare with other methods, strong information about the image are incorporated, which should results in performance improvements.	Human attention or segmentation maps are used only during training and not during testing. While we agree that this is extra information used during training, we show why other approach fail to utilize this information to achieve improvements in performance during test time. Only a fraction of images in VQA have Human attention. Also if it is possible to such a good boost with just human attention, people would start collecting. Also HATs are important to know if models are making the right decision for the right reasons.	"we show why other approach fail to utilize this information to achieve improvements in performance during test time." You'll have to point to a specific experiment in the paper / lines in the paper / table in the paper and reproduce the curcial numbers here to support this claim. Then you can say this is only at training time, not at test time. (I think the reviewer already knows this. So starting with this response is not a strong start.)
4	3. For the alignment between human attention and network importance, it seems that there only exists one human attention map. However, for the network importance, specifically the VQA task, when the question is different, we should pay different attentions to different local regions. Therefore, how to make the corresponding alignments? The human attention is stable, while the network importance varies dynamically.	This is incorrect. HAT is question dependent. i.e. there exists different maps for different questions.	Yup. Say there are different human attention maps for different questions. So the human attention map is also question dependent.
5
6	The state of the art and beyond in the field is moving away from such human guided approaches. Localization is already being done in a wholly unsupervised fashion using embeddings for example. . Also the proposed approach is not scalable. Verification is done through human studies which is fine but again not scalable.	Highly opinionated with no citations. I don't agree that state of the art "and beyond" (whatever that means) is moving away from human guided approaches. Disagree that Localization is done in a wholly unsupervised way. Approaches for semi-supervised localizations exist, but they are still significantly worse than fully supervised approaches (xx% diffence in ILSVRC localization). Also disagree with the comment that verification is done only through human studies. In section 5 we quantitatively evaluate task performance , and in section 6 we quantitativly evaluate grounding, both of which show the effectiveness of HINT without requiring human workers. Our Human studies are required to show that our HINTed models are more trustworthy to humans than base models, which is needed not just for generalization but also necessary as more algorithmic decisions are made in the society	Also maybe make some point about how without human guidance sure models have good accuracies, but they can be heavily biased (and give examples from VQA maybe)?
7	It would be helpful to have an ablation like study in which you increase or decrease the level of HINT's and see what happens to get a deeper insight into what you are doing.	I can set this up. We can vary the amount of HATs used and examine how the performance varies	Yup
8
9	The method to set the ground truth importance scores seems hacky especially for image captioning. As I can imagine there shall be multiple objects in the same category and the HINT supervision will highlight all of them during generating the word. For example, assuming there are 3 people in a park and only 1 person is throwing a frisbee. The ground truth caption is 'A man is throwing a frisbee.' It is not appropriate to highlight all of the 3 people.	I completely agree. This problem does exist due to the way we use annotations for captioning. Mention that this is a first step and such cases although infrequent would make the model look at more than correct regions. In future work we plan on addressing such scenarios, basically modifying the loss that makes the model get heavily penalized if it places mass on incorrect regions, and penalize it not so much if it misses some regions which exists in the segmentation. This would make us use the same amount of supervision but address such scenarios pointed by R3	We can also say that this allowed us to use existing annotations that were collected for a different task, which is nice.
10	The author clearly states the importance of aligning the important region, however, the reason why aligning the gradient-based explanation can be better is not clear and detailed analyzed.	The above experiment showing that simple attention is not entirely faithful to the model, and gradient based explanation is more faithful, will help answer this comment. Also I think its important to state that using Attn. Supervision, later layer parameters cannot be updated, but with HINT they can be, as Network Importance is a function of all the weights of the network.	If you think that experiment helps here more than in the earlier comment, maybe mention it here and not there so the earlier response is cleaner? Not sure.. your call. Or maybe it is better to club this and the earlier comment (and the next one) into one response (while being clear in the rebuttal that it is in response to all three).
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100