Counting bytes fast
by Nathan Kurz
-------------------------------
Hi Yann ---
I tried to post a long comment to your blog, but the form didn't want to take it. It says I'm over the 4KB length limits, but my attempts to shorten didn't go through either. So I'll send this to you this way, and just post a short summary on the blog.
Nice article, and interesting to ponder. I think your conclusions are
right, and your solution good, but the that the problem isn't actually
due to any write commit delay. These used to be an issue, but with
current store-to-load forwarding, they aren't a factor any more.
Instead, as 'gpd' suggests, the issue is with speculative loads. The
processor tries to get ahead by reading ahead in the input stream and
pre-loading the appropriate counter. It even speculatively executes
the addition and issues the write. But before the write is retired,
the counter preload is invalidated by a previous store. Usually no
reload of the counter is necessary, since the updated value is
forwarded directly to the appropriate register from the load, but the
work done for the increment and store is tossed out (not retired) and
the micro-ops are re-executed with the new value.
I think I can show this fairly directly using the CPU's performance counters. First, I modified your benchmark program to execute the loop a constant number of times rather than for a constant length of time. Your approach is really nifty, but it complicates the counter comparisons:
- while(BMK_GetMilliSpan(milliTime) < TIMELOOP)
+ for (int i = 0; i < 10000; i++)
Then I ran the trivial counter under 'likwid' for both 20% and 90%
probabilities. Likwid is a performance monitoring program similar to
'perf', and is available at https://code.google.com/p/likwid/. Some
of the exact counter names are part of the local customization,
though, so you might have to do things a little differently to
replicate.
nate@haswell:~/src/yann$ likwid -g UOPS_EXECUTED_PORT_PORT_0156:PMC0,UOPS_EXECUTED_PORT_PORT_23:PMC1,UOPS_EXECUTED_PORT_PORT_4:PMC2,MEM_UOP_RETIRED_STORES:PMC3 -C2 fullbench -b100 -i2 -P20
-------------------------------------------------------------
CPU type: Intel Core Haswell processor
CPU clock: 3.39 GHz
-------------------------------------------------------------
Generating 64 KB with P=20.00%
trivialCount : 1715.6 MB/s (0)
+------------------------------+-------------+
| Event | core 2 |
+------------------------------+-------------+
| UOPS_EXECUTED_PORT_PORT_0156 | 2.36448e+09 |
| UOPS_EXECUTED_PORT_PORT_23 | 2.48144e+09 |
| UOPS_EXECUTED_PORT_PORT_4 | 1.77152e+09 |
| MEM_UOP_RETIRED_STORES | 1.3129e+09 |
| INSTR_RETIRED_ANY | 6.56242e+09 |
| CPU_CLK_UNHALTED_CORE | 2.59642e+09 |
| CPU_CLK_UNHALTED_REF | 2.59641e+09 |
+------------------------------+-------------+
nate@haswell:~/src/yann$ likwid -g UOPS_EXECUTED_PORT_PORT_0156:PMC0,UOPS_EXECUTED_PORT_PORT_23:PMC1,UOPS_EXECUTED_PORT_PORT_4:PMC2,MEM_UOP_RETIRED_STORES:PMC3 -C2 fullbench -b100 -i2 -P90
Generating 64 KB with P=90.00%
trivialCount : 628.3 MB/s (0)
+------------------------------+-------------+
| Event | core 2 |
+------------------------------+-------------+
| UOPS_EXECUTED_PORT_PORT_0156 | 6.1712e+09 |
| UOPS_EXECUTED_PORT_PORT_23 | 3.06585e+09 |
| UOPS_EXECUTED_PORT_PORT_4 | 5.60409e+09 |
| MEM_UOP_RETIRED_STORES | 1.31305e+09 |
| INSTR_RETIRED_ANY | 6.56342e+09 |
| CPU_CLK_UNHALTED_CORE | 7.08232e+09 |
| CPU_CLK_UNHALTED_REF | 7.08232e+09 |
+------------------------------+-------------+
Ports 0, 1, 5, and 6 are the execution ports that in this case handle
the addition and the loop logic. Ports 2 and 3 are address
generation for the loads. Port 4 is used for the stores. What's
shown is that although the total number of instructions retired is the
same, the 90% case takes 3x as long to execute. Interestingly, in the
90% case there are also about 3 times as many stores executed as
retired, whereas for the 20% case it's about 1.2x. The scalar
executions are also about 3x, while the number of load executions
doesn't change much. The number of stores retired (actually
committed, not retried), though, stays constant. While there might
also be other things happening, the increase in the number of scalar
and store executions would account for almost all the increase in
runtime.
By comparison, here are the numbers for your 4 table case:
nate@haswell:~/src/yann$ likwid -g UOPS_EXECUTED_PORT_PORT_0156:PMC0,UOPS_EXECUTED_PORT_PORT_23:PMC1,UOPS_EXECUTED_PORT_PORT_4:PMC2,MEM_UOP_RETIRED_STORES:PMC3 -C2 fullbench -b1 -i2 -P20
Generating 64 KB with P=20.00%
FSE_count(255) : 1867.1 MB/s (13142)
+------------------------------+-------------+
| Event | core 2 |
+------------------------------+-------------+
| UOPS_EXECUTED_PORT_PORT_0156 | 1.62161e+09 |
| UOPS_EXECUTED_PORT_PORT_23 | 2.36236e+09 |
| UOPS_EXECUTED_PORT_PORT_4 | 1.34547e+09 |
| MEM_UOP_RETIRED_STORES | 1.32183e+09 |
| INSTR_RETIRED_ANY | 3.69438e+09 |
| CPU_CLK_UNHALTED_CORE | 2.38884e+09 |
| CPU_CLK_UNHALTED_REF | 2.38884e+09 |
+------------------------------+-------------+
nate@haswell:~/src/yann$ likwid -g UOPS_EXECUTED_PORT_PORT_0156:PMC0,UOPS_EXECUTED_PORT_PORT_23:PMC1,UOPS_EXECUTED_PORT_PORT_4:PMC2,MEM_UOP_RETIRED_STORES:PMC3 -C2 fullbench -b1 -i2 -P90
Generating 64 KB with P=90.00%
FSE_count(255) : 1820.4 MB/s (58868)
+------------------------------+-------------+
| Event | core 2 |
+------------------------------+-------------+
| UOPS_EXECUTED_PORT_PORT_0156 | 1.92168e+09 |
| UOPS_EXECUTED_PORT_PORT_23 | 2.39311e+09 |
| UOPS_EXECUTED_PORT_PORT_4 | 1.74347e+09 |
| MEM_UOP_RETIRED_STORES | 1.322e+09 |
| INSTR_RETIRED_ANY | 3.69836e+09 |
| CPU_CLK_UNHALTED_CORE | 2.447e+09 |
| CPU_CLK_UNHALTED_REF | 2.447e+09 |
+------------------------------+-------------+
Two things seem interesting here. The first is that even the 4 table
case has a significant number of retries of the stores and increments
due to invalidations --- UOPS_EXECUTED_PORT_PORT_4 and
UOPS_EXECUTED_PORT_PORT0156 are about 20% higher for the 90% case.
But the increase in the runtime (CPU_CLK_UNHALTED) is only about 2%.
Apparently most of the time the CPU is able to fit the additional
microops into the same number of cycles. This makes sense, since it
takes about 2 cycles per byte, during which time processor should be
able to execute 2 stores plus 4-8 scalar operations.
The range for the scalar ops is because I don't understand whether the
invalidated ops needs to be re-issued, which is subject a 4 op per
cycle limit, or just re-executed, which is limited only by the
available execution ports. But it does seem in theory possible that
if the loop was sufficiently unrolled to reduce the overhead of the
loop logic and enough tables were used to completely avoid
invalidations, that a speedup to 1 cycle per byte should be possible.
--nate
============================================================
I'm having some success with speedup on 3.4 GHz Haswell system, and
have gone from 1.8 GB/s to 2.2 GB/s if all the stars align. I'm
still not understanding all the factors affecting performance, but I
think I'm getting closer.
1) I get a small speedup at -P90 when going from 4 to 8 tables.
2) I get a tiny boost changing the loop logic to increment a negative
offset from lastSrc, since this allows the loop to use the flags set
by the add operation (instead of an additional cmp) to see if the loop
should terminate.
3) I'm frequently having better luck compiling with Intel's ICC than
with GCC. I don't recommend you do so, but I'm hoping to learn from
the assembly produced why it might be doing better.
4) I got a significant speedup changing the alignment of the tables.
I think the issue is that if the tables are in sequential 256B pieces,
the memory disambiguation can have false positives even though the
input is to a different table. Changing the allocated size of the
arrays to (264) or (300) has a positive effect for the -P90 case.
Changing the size to 512 had a strong negative effect. I don't know
how this mechanism works, but speculate that it is either using the
low bits, or has a maximum number of entries per cache line, and
somehow the non-power-of-two size avoids collisions.
> For the fun and comparison, I tried a (false) variant which would only write to destination cell (no ++, just = constant), and it reached 2.7 GB/s, which incidentally corresponds to the speed of my CPU (2.7 GHz).
Note that Intel's "Turbo Boost" feature can cause the actual speed to
be different than the rated speed. On Linux, almost all system
utilities report the nominal (label) speed rather than the boosted
speed. i7z and turbostat are two exceptions. "Speedstep" can cause
variable readings as well, although usually slower than expected.
To turn both of them off:
# for num in 0 1 2 3 4 5 6 7; do echo
performance > /sys/devices/system/cpu/cpu${num}/cpufreq/scaling_governor;
done
# echo 0 > /sys/devices/system/cpu/cpufreq/boost
It's easier to do this as root than with sudo, as redirection
operators revert to normal user privilege. I presume there are
Windows equivalents, but am unfamiliar.
> So it seems there really is a cost for read, add & write which can't be completely hidden within the pipeline by OoOex.
ps are re-executed with the new value.
I'm not sure yet how fast it can go. Intel's IACA tool can be useful
for understanding limiting factors. Here's what it shows for one of
my ICC compiled 2000 MB/s loops:
nate@haswell:~/src/yann$ /opt/intel/iaca/bin/iaca -mark 0 -64 -arch
HSW -analysis THROUGHPUT iaca.o
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - iaca.o
Binary Format - 64Bit
Architecture - HSW
Analysis Type - Throughput
*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************
Throughput Analysis Report
--------------------------
Block Throughput: 12.20 Cycles Throughput Bottleneck: PORT2_AGU, PORT3_AGU
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5
| 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 2.5 0.0 | 2.5 | 12.0 8.0 | 12.0 8.0 | 8.0 | 2.5
| 2.5 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV
- Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles
penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles
| |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 |
7 | |
---------------------------------------------------------------------------------
| 1 | 0.7 | | | | | | 0.3 |
| | test r14, r14
| 0F | | | | | | | |
| | jz 0x7c
| 0* | | | | | | | |
| | nop dword ptr [rax], eax
| 0* | | | | | | | |
| | nop dword ptr [rax], eax
| 1 | | | 1.0 1.0 | | | | |
| CP | movzx eax, byte ptr [r14+r15*1]
| 1 | | | | 1.0 1.0 | | | |
| CP | movzx edx, byte ptr [r14+r15*1+0x1]
| 1 | | | 1.0 1.0 | | | | |
| CP | movzx ecx, byte ptr [r14+r15*1+0x2]
| 1 | | | | 1.0 1.0 | | | |
| CP | movzx esi, byte ptr [r14+r15*1+0x3]
| 1 | | | 1.0 1.0 | | | | |
| CP | movzx edi, byte ptr [r14+r15*1+0x4]
| 1 | | | | 1.0 1.0 | | | |
| CP | movzx r8d, byte ptr [r14+r15*1+0x5]
| 1 | | | 1.0 1.0 | | | | |
| CP | movzx r9d, byte ptr [r14+r15*1+0x6]
| 1 | | | | 1.0 1.0 | | | |
| CP | movzx r10d, byte ptr [r14+r15*1+0x7]
| 4 | | 0.3 | 1.0 1.0 | 1.0 | 1.0 | 0.2 | 0.4 |
| CP | inc dword ptr [rax*4]
| 4 | | 0.3 | 1.0 | 1.0 1.0 | 1.0 | 0.3 | 0.3 |
| CP | inc dword ptr [rdx*4]
| 4 | | 0.4 | 1.0 1.0 | 1.0 | 1.0 | 0.3 | 0.3 |
| CP | inc dword ptr [rcx*4]
| 4 | 0.1 | 0.2 | 1.0 | 1.0 1.0 | 1.0 | 0.4 | 0.3 |
| CP | inc dword ptr [rsi*4]
| 4 | 0.2 | 0.3 | 1.0 1.0 | 1.0 | 1.0 | 0.3 | 0.2 |
| CP | inc dword ptr [rdi*4]
| 4 | 0.2 | 0.3 | 1.0 | 1.0 1.0 | 1.0 | 0.2 | 0.2 |
| CP | inc dword ptr [r8*4]
| 4 | 0.2 | 0.2 | 1.0 1.0 | 1.0 | 1.0 | 0.2 | 0.4 |
| CP | inc dword ptr [r9*4]
| 4 | 0.2 | 0.3 | 1.0 | 1.0 1.0 | 1.0 | 0.4 | 0.1 |
| CP | inc dword ptr [r10*4]
| 1 | 0.9 | | | | | | 0.1 |
| | add r14, 0x8
| 0F | | | | | | | |
| | jnz 0xffffffffffffff92
Total Num Of Uops: 42
(apologies if it comes through tab damaged -- cut and paste to a fixed
width font may help)
The important part is that it thinks the bottleneck with the AGU's
(address generation units). In theory, Haswell is supposed to be able
to sustain 2 loads and one store per cycle, since unlike Sandy Bridge
it has a dedicated Store AGU (Port 7). But it doesn't appear that
it's being used, and as a result, the Port 2 and 3 Load AGU's are
being overwhelmed. I don't know why the Port 7 AGU isn't being
utilized.
But it would imply that if we can load 64-bits at a time and subdivide
into 8-bit units, we might might come out faster. This is how I moved
from 2000 MB/s to 2200 MB/s, but only with ICC and some inline
assembly. At this point, the limiting factor becomes the front end
decoder, which is only able to issue 4 uops per cycle (even if coming
from uop cache as we are). The combination of two loads, an inc, and
a store would fit in a single cycle, but two loads, an inc, a store, a
shift and a mov is too much.
I'd hoped that making use of the legacy "high byte" and "low byte"
partial registers would get past this, but even when I convinced the
compiler to produce the code I wanted, it wasn't faster. I think
this is because there are some issues accessing the high and low for
the same register in succession, but I'm not sure.
I'll keep poking around a bit.
--nate
========================================================
OK, making some more progress. The lack of operations scheduled on
Port 7 is indeed the limiting factor (conversely, the bottleneck is
Ports 2 and 3 because Port 7 is not being used). The issue is
described here:
Port 7 is very simple minded, and can only do simple offsets, and not
[offset + base + index] type addressing. But with awkward syntax,
it's possible to use force this type of addressing to be used. This
gets me up to 2400 MB/s with ICC, 2250 MB/s with GCC. I'm using 16
tables to get here, although it's possible there is a way use fewer
(although I haven't yet found it).
These are going to be Haswell only increases, since Sandy Bridge/Ivy
Bridge has no Port 7 to work with. IACA says that it could be even
faster, though, projecting about 3200 MB/s. But that's assuming ideal
port scheduling, which isn't actually happening. It's not quite
enough to explain the speed discrepancy between theoretical and
observed, but it might explain most of it.
(If you are not familiar with IACA, someone just recently put up a
nice intro here:
http://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-i)
--nate
Current 2400 MB/s ICC inner loop viewed with 'perf record'/'perf
report' looks like this:
1.53 │4d0:┌─→movzbl (%rax,%rdx,1),%ecx
3.84 │ │ movzbl 0x1(%rax,%rdx,1),%esi
3.23 │ │ movzbl 0x2(%rax,%rdx,1),%edi
1.48 │ │ incl 0x614440(%rcx)
11.63 │ │ incl 0x614860(%rsi)
6.87 │ │ movzbl 0x3(%rax,%rdx,1),%r8d
1.17 │ │ movzbl 0x4(%rax,%rdx,1),%r9d
0.43 │ │ movzbl 0x5(%rax,%rdx,1),%r10d
0.69 │ │ movzbl 0x6(%rax,%rdx,1),%r11d
2.16 │ │ movzbl 0x7(%rax,%rdx,1),%ecx
0.93 │ │ movzbl 0x8(%rax,%rdx,1),%esi
0.55 │ │ incl 0x614c80(%rdi)
4.58 │ │ incl 0x6150a0(%r8)
3.90 │ │ incl 0x6154c0(%r9)
2.98 │ │ incl 0x6158e0(%r10)
4.53 │ │ incl 0x615d00(%r11)
4.67 │ │ incl 0x616120(%rcx)
3.72 │ │ incl 0x616540(%rsi)
3.75 │ │ movzbl 0x9(%rax,%rdx,1),%edi
1.32 │ │ movzbl 0xa(%rax,%rdx,1),%r8d
0.90 │ │ movzbl 0xb(%rax,%rdx,1),%r9d
0.68 │ │ movzbl 0xc(%rax,%rdx,1),%r10d
1.28 │ │ movzbl 0xd(%rax,%rdx,1),%r11d
1.31 │ │ movzbl 0xe(%rax,%rdx,1),%ecx
1.12 │ │ movzbl 0xf(%rax,%rdx,1),%esi
0.68 │ │ incl 0x616960(%rdi)
4.68 │ │ incl 0x616d80(%r8)
4.00 │ │ incl 0x6171a0(%r9)
3.98 │ │ incl 0x6175c0(%r10)
3.80 │ │ incl 0x6179e0(%r11)
4.19 │ │ incl 0x617e00(%rcx)
4.16 │ │ incl 0x618220(%rsi)
3.48 │ │ add $0x10,%rax
│ └──jne 4d0
=====================================================================
I thought of another approach while falling asleep last night, and it
seems to be working. Instead of trying to add workload to Port 7
(which would be Haswell only), reduce the workload on Ports 2 and 3 to
benefit Sandy Bridge as well. So I'm loading 16B at a time into an
XMM vector, and then extracting them. 2 cycle latency plus to uops to
extract, but removes 15 loads per 16 bytes. I've got it running at
2400 MB/s on both Haswell and Sandy Bridge, compiled with both ICC and
GCC.
It still requires some inline assembly to force the compiler to
produce what I want --- I'll see if there is a way to remove this.
Also requires 16 tables --- I'll see if I can get down to 8. I'm also
not sure why I'm at 2400 instead of 3000 --- might still be a way to
get there.
I'll wrap up and send you what I have by the end of the day.
--nate
========================================================
OK, I think I'm reaching the end of my knowledge. I have two
approaches: one based on trying to force the use of Haswell's Port 7
to avoid the AGU bottleneck, and one based on trying to make less use
of the AGU by reading a 16B vector and splitting it into bytes.
In theory, the Port 7 approach should be really fast on Haswell:
around 3000 MB/s. In practice, it's running at about 2100 MB/s on
Sandy Bridge, and only 2200 MB/s on Haswell. The problem seems to be
that it is very difficult to put the instructions in an order such
that the loads don't occasionally "steal" the use of Ports 2 and 3,
causing congestion.
The other approach is a bit faster (2450 MB/s on Haswell, 2200 MB/s on
Sandy Bridge), but I don't know that there is as much room for
improvement. It's being limited by the shear number of uops. A loop
can only sustain a throughput of 4 uops per cycle, and we have about
100 uops per 16 bytes. I think the only way to make it faster is to
figure out a way to use fewer uops. There certainly might be a way
--- my hope is that if you post the code and blog about it, someone
with more insight will turn up who can get it up to full speed.
I turned your fullbench.c into a standalone (renamed countbench.c), so
that it might be easier for others to play with. My edits were rough
--- it's possible that I broke things and did not notice. I was only
testing for speed, and didn't check for correctness. I'm not even
collecting the totals at the end of the loop, and I don't know if the
"remainder" handling is working. I've only tried compiling on Linux,
although I've done so on both Haswell and Sandy Bridge.
It's also missing your faster routine from the comparison, as I didn't
get around to putting it back in after removing the other files. The
numbers for the routines are moved around (see the source).
Importantly, I changed the loop to do a constant number of iterations
rather than running for a constant time. Changing back would be
fine, but the constant iteration approach made comparisons of
instruction counts easier.
--nate
# Intel Haswell i7-4770 CPU @ 3.40GHz (Turbo off)
nate@haswell:~/src/yann$ countbench -P90
*** FSE speed analyzer 64-bits, by Yann Collet (Oct 1 2014) ***
Generating 64 KB with P=90.00%
trivialCount : 628.3 MB/s (0)
trivialCount_b : 628.3 MB/s (0)
count_8x : 2229.1 MB/s (0)
count_8x_b : 2191.8 MB/s (0)
count_vec : 2445.4 MB/s (0)
nate@haswell:~/src/yann$ fullbench -b1 -P90
*** FSE speed analyzer 64-bits, by Yann Collet (Sep 30 2014) ***
Generating 64 KB with P=90.00%
FSE_count(255) : 1825.5 MB/s (58868)
# Intel Xeon E5-1620 @ 3.60GHz (Turbo off)
nate@sandybridge:~$ countbench -P90
*** FSE speed analyzer 64-bits, by Yann Collet (Oct 1 2014) ***
Generating 64 KB with P=90.00%
trivialCount : 637.5 MB/s (0)
trivialCount_b : 636.9 MB/s (0)
count_8x : 2048.0 MB/s (0)
count_8x_b : 2080.5 MB/s (0)
count_vec : 2162.9 MB/s (0)
nate@sandybridge:~$ fullbench -b1 -P90
*** FSE speed analyzer 64-bits, by Yann Collet (Sep 30 2014) ***
Generating 64 KB with P=90.00%
FSE_count(255) : 1702.2 MB/s (58868)