ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Predictions in word representation
2
3
After each prediction you add it to the previous input, starting with a <start tag>.
4
5
1st input nothing, nothing, nothing, nothing, <start token>Input imagescreenshot.jpg
6
1st prediction<HTML>
7
8
2nd inputnothing, nothing, nothing, <start token>, <HTML>Input imagescreenshot.jpg
9
2nd prediction Hello World!
10
11
3rd inputnothing, nothing, <start token>, <HTML>, Hello World!Input imagescreenshot.jpg
12
3rd prediction</HTML>
13
14
4th inputnothing, <start token>, <HTML>, Hello World!, </HTML>Input imagescreenshot.jpg
15
4th prediction<end token>
16
17
When the end token is predicted the generation ends. It can also end after making X predictions.
18
19
Predictions in digit representation
20
21
1st input [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 0, 0, 0, 0]]Input images[pixel values]
22
1st prediction[0, 1, 0, 0, 0]
23
24
2nd input[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [0, 1, 0, 0, 0]]Input images[pixel values]
25
2nd prediction [0, 0, 1, 0, 0]
26
27
3rd input[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0]]Input images[pixel values]
28
3rd prediction[0, 0, 0, 1, 0]
29
30
4th input[[0, 0, 0, 0, 0], [1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0]]Input images[pixel values]
31
4th prediction[0, 0, 0, 0, 1]
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100