| A | B | C | D | E | F | G | H | I | J | |
|---|---|---|---|---|---|---|---|---|---|---|
1 | Model: KnightsAnalytics/all-MiniLM-L6-v2, a fine-tuned SBert model Computation graph of the model in SVG: KnightsAnaliticsSBertModel.svg XLA == OpenXLA XLA/PJRT using github.com/gomlx/gomlx (a Go Wrapper) ORT == ONNX Runtime (by Microsoft) using github.com/yalue/onnxruntime_go (a Go wrapper) Benchmarked with sentences sampled from the FineWeb dataset, truncated/padded to 128 tokens Benchmark code: knights_sbert_test.go | |||||||||
2 | Engine | Accelerator | Model Slice ("Full" or output node) | BatchSize | Parallelism | Time (Mean) | Time (mean) / example | Time (Median) | Time (5%-tile) | Time (99%-tile) |
3 | Tested on 2024/12/14 11:00 times in milliseconds | |||||||||
4 | XLA | CPU (Intel 12K900) | Full | 1 | None | 15.0ms | 15.0ms | |||
5 | XLA | CPU (Intel 12K900) | Full | 16 | None | 125.0ms | 7.8ms | |||
6 | XLA | CPU (Intel 12K900) | Full | 64 | None | 539.0ms | 8.4ms | |||
7 | | | ||||||||
8 | XLA's GPU seems to compile each time to something different, with huge differences (batchSize=1 varied from 1.22 to 7.7) | |||||||||
9 | XLA | GPU (2080Ti) | Full | 1 | None | 1.2ms | 1.2ms | |||
10 | XLA | GPU (2080Ti) | Full | 16 | None | 6.9ms | 0.4ms | |||
11 | XLA | GPU (2080Ti) | Full | 64 | None | 22.7ms | 0.4ms | |||
12 | XLA | GPU (2080Ti) | Full | 1 | None | 1.3ms | 1.3ms | |||
13 | XLA | GPU (2080Ti) | Full | 16 | None | 8.8ms | 0.5ms | |||
14 | XLA | GPU (2080Ti) | Full | 64 | None | 27.0ms | 0.4ms | |||
15 | | | ||||||||
16 | ORT | CPU (Intel 12K900) | Full | 1 | None | 4.9ms | 4.9ms | |||
17 | ORT | CPU (Intel 12K900) | Full | 16 | None | 66.4ms | 4.2ms | |||
18 | ORT | CPU (Intel 12K900) | Full | 64 | None | 350.0ms | 5.5ms | |||
19 | | | ||||||||
20 | PyTorch | CPU (Intel 12K900) | Full | 1 | None | 6.9ms | 6.9ms | |||
21 | | | ||||||||
22 | ORT | GPU (2080Ti) | Full | 1 | None | 1.4ms | 1.4ms | |||
23 | ORT | GPU (2080Ti) | Full | 16 | None | 7.6ms | 0.5ms | |||
24 | ORT | GPU (2080Ti) | Full | 64 | None | 28.3ms | 0.4ms | |||
25 | | | ||||||||
26 | Tested on 2024/12/14 18:00 "embeddingGather": node "/embeddings/Add_output_0", only the gathering of the embeddings for the token ids and token type ids. Times in microseconds! | |||||||||
27 | XLA | CPU (Intel 12K900) | embeddingGather | 1 | None | 13µs | 13.0µs | |||
28 | XLA | CPU (Intel 12K900) | embeddingGather | 16 | None | 37µs | 2.3µs | |||
29 | XLA | CPU (Intel 12K900) | embeddingGather | 64 | None | 222µs | 3.5µs | |||
30 | XLA | GPU (2080Ti) | embeddingGather | 1 | None | 83µs | 83.0µs | |||
31 | XLA | GPU (2080Ti) | embeddingGather | 16 | None | 79µs | 4.9µs | |||
32 | XLA | GPU (2080Ti) | embeddingGather | 64 | None | 163µs | 2.5µs | |||
33 | ORT | CPU (Intel 12K900) | embeddingGather | 1 | None | 11µs | 11.0µs | |||
34 | ORT | CPU (Intel 12K900) | embeddingGather | 16 | None | 80µs | 5.0µs | |||
35 | ORT | CPU (Intel 12K900) | embeddingGather | 64 | None | 877µs | 13.7µs | |||
36 | ORT | GPU (2080Ti) | embeddingGather | 1 | None | 49µs | 49.0µs | |||
37 | ORT | GPU (2080Ti) | embeddingGather | 16 | None | 399µs | 24.9µs | |||
38 | ORT | GPU (2080Ti) | embeddingGather | 64 | None | 1398µs | 21.8µs | |||
39 | | | ||||||||
40 | Tested on 2024/12/15 7:00 : CPUs frequencies fixed at 3Ghz (by setting min/max frequencies to 3Ghz) -- it makes it significantly slower embeddingsLayerNorm=/embeddings/LayerNorm/Add_1_output_0 embeddingGather=/embeddings/Add_output_0 | |||||||||
41 | XLA | CPU (Intel 12K900) | embeddingGather | 1 | None | 21.4µs | 21.4µs | |||
42 | XLA | CPU (Intel 12K900) | embeddingGather | 16 | None | 54.6µs | 3.4µs | |||
43 | XLA | CPU (Intel 12K900) | embeddingGather | 64 | None | 245µs | 3.8µs | |||
44 | ORT | CPU (Intel 12K900) | embeddingGather | 1 | None | 18.4µs | 18.4µs | |||
45 | ORT | CPU (Intel 12K900) | embeddingGather | 16 | None | 109µs | 6.8µs | |||
46 | ORT | CPU (Intel 12K900) | embeddingGather | 64 | None | 879µs | 13.7µs | |||
47 | | | ||||||||
48 | XLA | CPU (Intel 12K900) | embeddingsLayerNorm | 1 | None | 56.2µs | 56.2µs | |||
49 | XLA | CPU (Intel 12K900) | embeddingsLayerNorm | 16 | None | 735µs | 45.9µs | |||
50 | XLA | CPU (Intel 12K900) | embeddingsLayerNorm | 64 | None | 4471µs | 69.9µs | |||
51 | ORT | CPU (Intel 12K900) | embeddingsLayerNorm | 1 | None | 56.6µs | 56.6µs | |||
52 | ORT | CPU (Intel 12K900) | embeddingsLayerNorm | 16 | None | 0µs | 0.0µs | |||
53 | ORT | CPU (Intel 12K900) | embeddingsLayerNorm | 64 | None | 2375µs | 37.1µs | |||
54 | ||||||||||
55 | Tested on 2024/12/16 7:00 : Using only P-Cores (see `taskset 0xFFFF`), and AVOID E-CORES -- XLA goes much faster Also added the time to transfer result back to host. | |||||||||
56 | XLA | CPU (Intel 12K900) | Full | 1 | None | 8.0ms | 8.0ms | 8.0ms | 7.8ms | 8.5ms |
57 | XLA | CPU (Intel 12K900) | Full | 16 | None | 163.8ms | 10.2ms | 162.9ms | 158.8ms | 171.7ms |
58 | XLA | CPU (Intel 12K900) | Full | 64 | None | 760.0ms | 11.9ms | 762.0ms | 352.0ms | 352.0ms |
59 | ||||||||||
60 | ORT | CPU (Intel 12K900) | Full | 1 | None | 4.6ms | 4.6ms | 4.6ms | 4.6ms | 5.0ms |
61 | ORT | CPU (Intel 12K900) | Full | 16 | None | 64.3ms | 4.0ms | 64.2ms | 64.1ms | 65.3ms |
62 | ORT | CPU (Intel 12K900) | Full | 64 | None | 324.0ms | 5.1ms | 311.0ms | 307.0ms | 352.0ms |
63 | ||||||||||
64 | Tested on 2024/12/16 12:00 : Using Rob's examples sentences, with 13 >= n >= 7 tokens each XLA uses one program per number of tokens | |||||||||
65 | ORT | CPU (Intel 12K900) | Full(Rob's sentences) | 1 | None | 916µs | 916.0µs | 880µs | 814.3µs | 1300µs |
66 | XLA | CPU (Intel 12K900) | Full(Rob's sentences) | 1 | None | 1800µs | 1800.0µs | 1700µs | 1500µs | 3100µs |
67 | All sentences padded to fixed 13 tokens in length | |||||||||
68 | ORT | CPU (Intel 12K900) | Full(Rob's sentences) | 1 | None | 969µs | 969.0µs | 933.2µs | 913.1µs | 1200µs |
69 | ORT | CPU (Intel 12K900) | Full(Rob's sentences) | 32 | None | 13500µs | 421.9µs | 12.7ms | 12.3ms | 15.3ms |
70 | XLA | CPU (Intel 12K900) | Full(Rob's sentences) | 1 | None | 1900µs | 1900.0µs | 1800µs | 1800µs | 2400µs |
71 | XLA | CPU (Intel 12K900) | Full(Rob's sentences) | 32 | None | 110.9ms | 3.5ms | 110.5ms | 107.7ms | 120.2ms |
72 | ||||||||||
73 | XLA | GPU (2080Ti) | Full(Rob's sentences) | 1 | None | 1100µs | 1100.0µs | 1100µs | 647µs | 1700µs |
74 | XLA | GPU (2080Ti) | Full(Rob's sentences) | 8 | None | 2000µs | 250.0µs | 1.8ms | 1.8ms | 2.7ms |
75 | XLA | GPU (2080Ti) | Full(Rob's sentences) | 32 | None | 4600µs | 143.8µs | 4.5ms | 4.2ms | 5.7ms |
76 | ||||||||||
77 | Tested on 2024/12/16 16:00 : Other slices positionEmbeddingsAdded=/embeddings/Add_1_output_0 | |||||||||
78 | XLA | CPU (Intel 12K900) | embeddingGather | 1 | None | 28.9µs | 28.9µs | 28.3µs | 26.3µs | 39.1µs |
79 | XLA | CPU (Intel 12K900) | embeddingGather | 16 | None | 163.1µs | 10.2µs | 159.9µs | 155.8µs | 213.0µs |
80 | XLA | CPU (Intel 12K900) | embeddingGather | 64 | None | 741µs | 11.6µs | 705µs | 666.3µs | 1400µs |
81 | ||||||||||
82 | ORT | CPU (Intel 12K900) | embeddingGather | 1 | None | 12.9µs | 12.9µs | 12.8µs | 12.3µs | 15.4µs |
83 | ORT | CPU (Intel 12K900) | embeddingGather | 16 | None | 44.3µs | 2.8µs | 43.1µs | 42.0µs | 55.8µs |
84 | ORT | CPU (Intel 12K900) | embeddingGather | 64 | None | 1000µs | 15.6µs | 983.9µs | 951µs | 1300µs |
85 | ||||||||||
86 | XLA | CPU (Intel 12K900) | positionEmbeddingsAdded | 1 | None | 29.4µs | 29.4µs | 28.9µs | 26.9µs | 35.9µs |
87 | XLA | CPU (Intel 12K900) | positionEmbeddingsAdded | 16 | None | 178.3µs | 11.1µs | 177µs | 169.9µs | 208.2µs |
88 | XLA | CPU (Intel 12K900) | positionEmbeddingsAdded | 64 | None | 768.9µs | 12.0µs | 753.1µs | 718.8µs | 1200µs |
89 | ||||||||||
90 | ORT | CPU (Intel 12K900) | positionEmbeddingsAdded | 1 | None | 23.1µs | 23.1µs | 23.1µs | 22.1µs | 26.4µs |
91 | ORT | CPU (Intel 12K900) | positionEmbeddingsAdded | 16 | None | 78.2µs | 4.9µs | 72.9µs | 71.6µs | 99µs |
92 | ORT | CPU (Intel 12K900) | positionEmbeddingsAdded | 64 | None | 2100µs | 32.8µs | 2100µs | 2000µs | 2500µs |
93 | ||||||||||
94 | Tested on 2024/12/16 16:00 : Other slices normalizedEmbeddings=/embeddings/LayerNorm/Add_1_output_0 | |||||||||
95 | XLA | CPU (Intel 12K900) | normalizedEmbeddings | 1 | None | 52.3µs | 52.3µs | |||
96 | XLA | CPU (Intel 12K900) | normalizedEmbeddings | 16 | None | 722µs | 45.1µs | |||
97 | XLA | CPU (Intel 12K900) | normalizedEmbeddings | 64 | None | 5600µs | 87.5µs | |||
98 | ||||||||||
99 | ORT | CPU (Intel 12K900) | normalizedEmbeddings | 1 | None | 33.1µs | 33.1µs | |||
100 | ORT | CPU (Intel 12K900) | normalizedEmbeddings | 16 | None | 165µs | 10.3µs | |||