ABCDEFGHIJ
1
Model: KnightsAnalytics/all-MiniLM-L6-v2, a fine-tuned SBert model
Computation graph of the model in SVG: KnightsAnaliticsSBertModel.svg
XLA == OpenXLA XLA/PJRT using github.com/gomlx/gomlx (a Go Wrapper)
ORT == ONNX Runtime (by Microsoft) using github.com/yalue/onnxruntime_go (a Go wrapper)
Benchmarked with sentences sampled from the FineWeb dataset, truncated/padded to 128 tokens
Benchmark code: knights_sbert_test.go
2
EngineAcceleratorModel Slice
("Full" or output node)
BatchSizeParallelismTime (Mean)Time (mean) / exampleTime (Median)Time (5%-tile)Time (99%-tile)
3
Tested on 2024/12/14 11:00 times in milliseconds
4
XLACPU (Intel 12K900)Full1None15.0ms15.0ms
5
XLACPU (Intel 12K900)Full16None125.0ms7.8ms
6
XLACPU (Intel 12K900)Full64None539.0ms8.4ms
7
8
XLA's GPU seems to compile each time to something different, with huge differences (batchSize=1 varied from 1.22 to 7.7)
9
XLAGPU (2080Ti)Full1None1.2ms1.2ms
10
XLAGPU (2080Ti)Full16None6.9ms0.4ms
11
XLAGPU (2080Ti)Full64None22.7ms0.4ms
12
XLAGPU (2080Ti)Full1None1.3ms1.3ms
13
XLAGPU (2080Ti)Full16None8.8ms0.5ms
14
XLAGPU (2080Ti)Full64None27.0ms0.4ms
15
16
ORTCPU (Intel 12K900)Full1None4.9ms4.9ms
17
ORTCPU (Intel 12K900)Full16None66.4ms4.2ms
18
ORTCPU (Intel 12K900)Full64None350.0ms5.5ms
19
20
PyTorchCPU (Intel 12K900)Full1None6.9ms6.9ms
21
22
ORTGPU (2080Ti)Full1None1.4ms1.4ms
23
ORTGPU (2080Ti)Full16None7.6ms0.5ms
24
ORTGPU (2080Ti)Full64None28.3ms0.4ms
25
26
Tested on 2024/12/14 18:00
"embeddingGather": node "/embeddings/Add_output_0", only the gathering of the embeddings for the token ids and token type ids.
Times in microseconds!
27
XLACPU (Intel 12K900)embeddingGather1None13µs13.0µs
28
XLACPU (Intel 12K900)embeddingGather16None37µs2.3µs
29
XLACPU (Intel 12K900)embeddingGather64None222µs3.5µs
30
XLAGPU (2080Ti)embeddingGather1None83µs83.0µs
31
XLAGPU (2080Ti)embeddingGather16None79µs4.9µs
32
XLAGPU (2080Ti)embeddingGather64None163µs2.5µs
33
ORTCPU (Intel 12K900)embeddingGather1None11µs11.0µs
34
ORTCPU (Intel 12K900)embeddingGather16None80µs5.0µs
35
ORTCPU (Intel 12K900)embeddingGather64None877µs13.7µs
36
ORTGPU (2080Ti)embeddingGather1None49µs49.0µs
37
ORTGPU (2080Ti)embeddingGather16None399µs24.9µs
38
ORTGPU (2080Ti)embeddingGather64None1398µs21.8µs
39
40
Tested on 2024/12/15 7:00 : CPUs frequencies fixed at 3Ghz (by setting min/max frequencies to 3Ghz) -- it makes it significantly slower
embeddingsLayerNorm=/embeddings/LayerNorm/Add_1_output_0
embeddingGather=/embeddings/Add_output_0
41
XLACPU (Intel 12K900)embeddingGather1None21.4µs21.4µs
42
XLACPU (Intel 12K900)embeddingGather16None54.6µs3.4µs
43
XLACPU (Intel 12K900)embeddingGather64None245µs3.8µs
44
ORTCPU (Intel 12K900)embeddingGather1None18.4µs18.4µs
45
ORTCPU (Intel 12K900)embeddingGather16None109µs6.8µs
46
ORTCPU (Intel 12K900)embeddingGather64None879µs13.7µs
47
48
XLACPU (Intel 12K900)embeddingsLayerNorm1None56.2µs56.2µs
49
XLACPU (Intel 12K900)embeddingsLayerNorm16None735µs45.9µs
50
XLACPU (Intel 12K900)embeddingsLayerNorm64None4471µs69.9µs
51
ORTCPU (Intel 12K900)embeddingsLayerNorm1None56.6µs56.6µs
52
ORTCPU (Intel 12K900)embeddingsLayerNorm16None0µs0.0µs
53
ORTCPU (Intel 12K900)embeddingsLayerNorm64None2375µs37.1µs
54
55
Tested on 2024/12/16 7:00 : Using only P-Cores (see `taskset 0xFFFF`), and AVOID E-CORES -- XLA goes much faster
Also added the time to transfer result back to host.
56
XLACPU (Intel 12K900)Full1None8.0ms8.0ms8.0ms7.8ms8.5ms
57
XLACPU (Intel 12K900)Full16None163.8ms10.2ms162.9ms158.8ms171.7ms
58
XLACPU (Intel 12K900)Full64None760.0ms11.9ms762.0ms352.0ms352.0ms
59
60
ORTCPU (Intel 12K900)Full1None4.6ms4.6ms4.6ms4.6ms5.0ms
61
ORTCPU (Intel 12K900)Full16None64.3ms4.0ms64.2ms64.1ms65.3ms
62
ORTCPU (Intel 12K900)Full64None324.0ms5.1ms311.0ms307.0ms352.0ms
63
64
Tested on 2024/12/16 12:00 : Using Rob's examples sentences, with 13 >= n >= 7 tokens each
XLA uses one program per number of tokens
65
ORTCPU (Intel 12K900)Full(Rob's sentences)1None916µs916.0µs880µs814.3µs1300µs
66
XLACPU (Intel 12K900)Full(Rob's sentences)1None1800µs1800.0µs1700µs1500µs3100µs
67
All sentences padded to fixed 13 tokens in length
68
ORTCPU (Intel 12K900)Full(Rob's sentences)1None969µs969.0µs933.2µs913.1µs1200µs
69
ORTCPU (Intel 12K900)Full(Rob's sentences)32None13500µs421.9µs12.7ms12.3ms15.3ms
70
XLACPU (Intel 12K900)Full(Rob's sentences)1None1900µs1900.0µs1800µs1800µs2400µs
71
XLACPU (Intel 12K900)Full(Rob's sentences)32None110.9ms3.5ms110.5ms107.7ms120.2ms
72
73
XLAGPU (2080Ti)Full(Rob's sentences)1None1100µs1100.0µs1100µs647µs1700µs
74
XLAGPU (2080Ti)Full(Rob's sentences)8None2000µs250.0µs1.8ms1.8ms2.7ms
75
XLAGPU (2080Ti)Full(Rob's sentences)32None4600µs143.8µs4.5ms4.2ms5.7ms
76
77
Tested on 2024/12/16 16:00 : Other slices
positionEmbeddingsAdded=/embeddings/Add_1_output_0
78
XLACPU (Intel 12K900)embeddingGather1None28.9µs28.9µs28.3µs26.3µs39.1µs
79
XLACPU (Intel 12K900)embeddingGather16None163.1µs10.2µs159.9µs155.8µs213.0µs
80
XLACPU (Intel 12K900)embeddingGather64None741µs11.6µs705µs666.3µs1400µs
81
82
ORTCPU (Intel 12K900)embeddingGather1None12.9µs12.9µs12.8µs12.3µs15.4µs
83
ORTCPU (Intel 12K900)embeddingGather16None44.3µs2.8µs43.1µs42.0µs55.8µs
84
ORTCPU (Intel 12K900)embeddingGather64None1000µs15.6µs983.9µs951µs1300µs
85
86
XLACPU (Intel 12K900)positionEmbeddingsAdded1None29.4µs29.4µs28.9µs26.9µs35.9µs
87
XLACPU (Intel 12K900)positionEmbeddingsAdded16None178.3µs11.1µs177µs169.9µs208.2µs
88
XLACPU (Intel 12K900)positionEmbeddingsAdded64None768.9µs12.0µs753.1µs718.8µs1200µs
89
90
ORTCPU (Intel 12K900)positionEmbeddingsAdded1None23.1µs23.1µs23.1µs22.1µs26.4µs
91
ORTCPU (Intel 12K900)positionEmbeddingsAdded16None78.2µs4.9µs72.9µs71.6µs99µs
92
ORTCPU (Intel 12K900)positionEmbeddingsAdded64None2100µs32.8µs2100µs2000µs2500µs
93
94
Tested on 2024/12/16 16:00 : Other slices
normalizedEmbeddings=/embeddings/LayerNorm/Add_1_output_0
95
XLACPU (Intel 12K900)normalizedEmbeddings1None52.3µs52.3µs
96
XLACPU (Intel 12K900)normalizedEmbeddings16None722µs45.1µs
97
XLACPU (Intel 12K900)normalizedEmbeddings64None5600µs87.5µs
98
99
ORTCPU (Intel 12K900)normalizedEmbeddings1None33.1µs33.1µs
100
ORTCPU (Intel 12K900)normalizedEmbeddings16None165µs10.3µs