Counting bytes fast

by Nathan Kurz

-------------------------------

Hi Yann ---

I tried to post a long comment to your blog, but the form didn't want to take it. It says I'm over the 4KB length limits, but my attempts to shorten didn't go through either. So I'll send this to you this way, and just post a short summary on the blog.

Nice article, and interesting to ponder. I think your conclusions are

right, and your solution good, but the that the problem isn't actually

due to any write commit delay. These used to be an issue, but with

current store-to-load forwarding, they aren't a factor any more.

Instead, as 'gpd' suggests, the issue is with speculative loads. The

processor tries to get ahead by reading ahead in the input stream and

pre-loading the appropriate counter. It even speculatively executes

the addition and issues the write. But before the write is retired,

the counter preload is invalidated by a previous store. Usually no

reload of the counter is necessary, since the updated value is

forwarded directly to the appropriate register from the load, but the

work done for the increment and store is tossed out (not retired) and

the micro-ops are re-executed with the new value.

I think I can show this fairly directly using the CPU's performance counters. First, I modified your benchmark program to execute the loop a constant number of times rather than for a constant length of time. Your approach is really nifty, but it complicates the counter comparisons:

- while(BMK_GetMilliSpan(milliTime) < TIMELOOP)

+ for (int i = 0; i < 10000; i++)

Then I ran the trivial counter under 'likwid' for both 20% and 90%

probabilities. Likwid is a performance monitoring program similar to

'perf', and is available at https://code.google.com/p/likwid/. Some

of the exact counter names are part of the local customization,

though, so you might have to do things a little differently to

replicate.

nate@haswell:~/src/yann$ likwid -g UOPS_EXECUTED_PORT_PORT_0156:PMC0,UOPS_EXECUTED_PORT_PORT_23:PMC1,UOPS_EXECUTED_PORT_PORT_4:PMC2,MEM_UOP_RETIRED_STORES:PMC3 -C2 fullbench -b100 -i2 -P20

-------------------------------------------------------------

CPU type: Intel Core Haswell processor

CPU clock: 3.39 GHz

-------------------------------------------------------------

Generating 64 KB with P=20.00%

trivialCount : 1715.6 MB/s (0)

+------------------------------+-------------+

| Event | core 2 |

+------------------------------+-------------+

| UOPS_EXECUTED_PORT_PORT_0156 | 2.36448e+09 |

| UOPS_EXECUTED_PORT_PORT_23 | 2.48144e+09 |

| UOPS_EXECUTED_PORT_PORT_4 | 1.77152e+09 |

| MEM_UOP_RETIRED_STORES | 1.3129e+09 |

| INSTR_RETIRED_ANY | 6.56242e+09 |

| CPU_CLK_UNHALTED_CORE | 2.59642e+09 |

| CPU_CLK_UNHALTED_REF | 2.59641e+09 |

+------------------------------+-------------+

nate@haswell:~/src/yann$ likwid -g UOPS_EXECUTED_PORT_PORT_0156:PMC0,UOPS_EXECUTED_PORT_PORT_23:PMC1,UOPS_EXECUTED_PORT_PORT_4:PMC2,MEM_UOP_RETIRED_STORES:PMC3 -C2 fullbench -b100 -i2 -P90

Generating 64 KB with P=90.00%

trivialCount : 628.3 MB/s (0)

+------------------------------+-------------+

| Event | core 2 |

+------------------------------+-------------+

| UOPS_EXECUTED_PORT_PORT_0156 | 6.1712e+09 |

| UOPS_EXECUTED_PORT_PORT_23 | 3.06585e+09 |

| UOPS_EXECUTED_PORT_PORT_4 | 5.60409e+09 |

| MEM_UOP_RETIRED_STORES | 1.31305e+09 |

| INSTR_RETIRED_ANY | 6.56342e+09 |

| CPU_CLK_UNHALTED_CORE | 7.08232e+09 |

| CPU_CLK_UNHALTED_REF | 7.08232e+09 |

+------------------------------+-------------+

Ports 0, 1, 5, and 6 are the execution ports that in this case handle

the addition and the loop logic. Ports 2 and 3 are address

generation for the loads. Port 4 is used for the stores. What's

shown is that although the total number of instructions retired is the

same, the 90% case takes 3x as long to execute. Interestingly, in the

90% case there are also about 3 times as many stores executed as

retired, whereas for the 20% case it's about 1.2x. The scalar

executions are also about 3x, while the number of load executions

doesn't change much. The number of stores retired (actually

committed, not retried), though, stays constant. While there might

also be other things happening, the increase in the number of scalar

and store executions would account for almost all the increase in

runtime.

By comparison, here are the numbers for your 4 table case:

nate@haswell:~/src/yann$ likwid -g UOPS_EXECUTED_PORT_PORT_0156:PMC0,UOPS_EXECUTED_PORT_PORT_23:PMC1,UOPS_EXECUTED_PORT_PORT_4:PMC2,MEM_UOP_RETIRED_STORES:PMC3 -C2 fullbench -b1 -i2 -P20

Generating 64 KB with P=20.00%

FSE_count(255) : 1867.1 MB/s (13142)

+------------------------------+-------------+

| Event | core 2 |

+------------------------------+-------------+

| UOPS_EXECUTED_PORT_PORT_0156 | 1.62161e+09 |

| UOPS_EXECUTED_PORT_PORT_23 | 2.36236e+09 |

| UOPS_EXECUTED_PORT_PORT_4 | 1.34547e+09 |

| MEM_UOP_RETIRED_STORES | 1.32183e+09 |

| INSTR_RETIRED_ANY | 3.69438e+09 |

| CPU_CLK_UNHALTED_CORE | 2.38884e+09 |

| CPU_CLK_UNHALTED_REF | 2.38884e+09 |

+------------------------------+-------------+

nate@haswell:~/src/yann$ likwid -g UOPS_EXECUTED_PORT_PORT_0156:PMC0,UOPS_EXECUTED_PORT_PORT_23:PMC1,UOPS_EXECUTED_PORT_PORT_4:PMC2,MEM_UOP_RETIRED_STORES:PMC3 -C2 fullbench -b1 -i2 -P90

Generating 64 KB with P=90.00%

FSE_count(255) : 1820.4 MB/s (58868)

+------------------------------+-------------+

| Event | core 2 |

+------------------------------+-------------+

| UOPS_EXECUTED_PORT_PORT_0156 | 1.92168e+09 |

| UOPS_EXECUTED_PORT_PORT_23 | 2.39311e+09 |

| UOPS_EXECUTED_PORT_PORT_4 | 1.74347e+09 |

| MEM_UOP_RETIRED_STORES | 1.322e+09 |

| INSTR_RETIRED_ANY | 3.69836e+09 |

| CPU_CLK_UNHALTED_CORE | 2.447e+09 |

| CPU_CLK_UNHALTED_REF | 2.447e+09 |

+------------------------------+-------------+

Two things seem interesting here. The first is that even the 4 table

case has a significant number of retries of the stores and increments

due to invalidations --- UOPS_EXECUTED_PORT_PORT_4 and

UOPS_EXECUTED_PORT_PORT0156 are about 20% higher for the 90% case.

But the increase in the runtime (CPU_CLK_UNHALTED) is only about 2%.

Apparently most of the time the CPU is able to fit the additional

microops into the same number of cycles. This makes sense, since it

takes about 2 cycles per byte, during which time processor should be

able to execute 2 stores plus 4-8 scalar operations.

The range for the scalar ops is because I don't understand whether the

invalidated ops needs to be re-issued, which is subject a 4 op per

cycle limit, or just re-executed, which is limited only by the

available execution ports. But it does seem in theory possible that

if the loop was sufficiently unrolled to reduce the overhead of the

loop logic and enough tables were used to completely avoid

invalidations, that a speedup to 1 cycle per byte should be possible.

--nate

============================================================

I'm having some success with speedup on 3.4 GHz Haswell system, and

have gone from 1.8 GB/s to 2.2 GB/s if all the stars align.    I'm

still not understanding all the factors affecting performance, but I

think I'm getting closer.

1) I get a small speedup at -P90 when going from 4 to 8 tables.

2) I get a tiny boost changing the loop logic to increment a negative

offset from lastSrc, since this allows the loop to use the flags set

by the add operation (instead of an additional cmp) to see if the loop

should terminate.

3) I'm frequently having better luck compiling with Intel's ICC than

with GCC.  I don't recommend you do so, but I'm hoping to learn from

the assembly produced why it might be doing better.

4) I got a significant speedup changing the alignment of the tables.

I think the issue is that if the tables are in sequential 256B pieces,

the memory disambiguation can have false positives even though the

input is to a different table.  Changing the allocated size of the

arrays to (264) or (300) has a positive effect for the -P90 case.

Changing the size to 512 had a strong negative effect.  I don't know

how this mechanism works, but speculate that it is either using the

low bits, or has a maximum number of entries per cache line, and

somehow the non-power-of-two size avoids collisions.

> For the fun and comparison, I tried a (false) variant which would only write to destination cell (no ++, just = constant), and it reached 2.7 GB/s, which incidentally corresponds to the speed of my CPU (2.7 GHz).

Note that Intel's "Turbo Boost" feature can cause the actual speed to

be different than the rated speed.  On Linux, almost all system

utilities report the nominal (label) speed rather than the boosted

speed.   i7z and turbostat are two exceptions.  "Speedstep" can cause

variable readings as well, although usually slower than expected.

To turn both of them off:

# for num in 0 1 2 3 4 5 6 7; do echo

performance >  /sys/devices/system/cpu/cpu${num}/cpufreq/scaling_governor;

done

# echo 0 > /sys/devices/system/cpu/cpufreq/boost

It's easier to do this as root than with sudo, as redirection

operators revert to normal user privilege.  I presume there are

Windows equivalents, but am unfamiliar.

> So it seems there really is a cost for read, add & write which can't be completely hidden within the pipeline by OoOex.

ps are re-executed with the new value.

I'm not sure yet how fast it can go.   Intel's IACA tool can be useful

for understanding limiting factors.  Here's what it shows for one of

my ICC compiled 2000 MB/s loops:

nate@haswell:~/src/yann$ /opt/intel/iaca/bin/iaca  -mark 0 -64 -arch

HSW -analysis THROUGHPUT iaca.o

Intel(R) Architecture Code Analyzer Version - 2.1

Analyzed File - iaca.o

Binary Format - 64Bit

Architecture  - HSW

Analysis Type - Throughput

*******************************************************************

Intel(R) Architecture Code Analyzer Mark Number 1

*******************************************************************

Throughput Analysis Report

--------------------------

Block Throughput: 12.20 Cycles       Throughput Bottleneck: PORT2_AGU, PORT3_AGU

Port Binding In Cycles Per Iteration:

---------------------------------------------------------------------------------------

|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5

 |  6   |  7   |

---------------------------------------------------------------------------------------

| Cycles | 2.5    0.0  | 2.5  | 12.0   8.0  | 12.0   8.0  | 8.0  | 2.5

 | 2.5  | 0.0  |

---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV

- Divider pipe (on port 0)

D - Data fetch pipe (on ports 2 and 3), CP - on a critical path

F - Macro Fusion with the previous instruction occurred

* - instruction micro-ops not bound to a port

^ - Micro Fusion happened

# - ESP Tracking sync uop was issued

@ - SSE instruction followed an AVX256 instruction, dozens of cycles

penalty is expected

! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles

    |    |

|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |

 7  |    |

---------------------------------------------------------------------------------

|   1    | 0.7       |     |           |           |     |     | 0.3 |

    |    | test r14, r14

|   0F   |           |     |           |           |     |     |     |

    |    | jz 0x7c

|   0*   |           |     |           |           |     |     |     |

    |    | nop dword ptr [rax], eax

|   0*   |           |     |           |           |     |     |     |

    |    | nop dword ptr [rax], eax

|   1    |           |     | 1.0   1.0 |           |     |     |     |

    | CP | movzx eax, byte ptr [r14+r15*1]

|   1    |           |     |           | 1.0   1.0 |     |     |     |

    | CP | movzx edx, byte ptr [r14+r15*1+0x1]

|   1    |           |     | 1.0   1.0 |           |     |     |     |

    | CP | movzx ecx, byte ptr [r14+r15*1+0x2]

|   1    |           |     |           | 1.0   1.0 |     |     |     |

    | CP | movzx esi, byte ptr [r14+r15*1+0x3]

|   1    |           |     | 1.0   1.0 |           |     |     |     |

    | CP | movzx edi, byte ptr [r14+r15*1+0x4]

|   1    |           |     |           | 1.0   1.0 |     |     |     |

    | CP | movzx r8d, byte ptr [r14+r15*1+0x5]

|   1    |           |     | 1.0   1.0 |           |     |     |     |

    | CP | movzx r9d, byte ptr [r14+r15*1+0x6]

|   1    |           |     |           | 1.0   1.0 |     |     |     |

    | CP | movzx r10d, byte ptr [r14+r15*1+0x7]

|   4    |           | 0.3 | 1.0   1.0 | 1.0       | 1.0 | 0.2 | 0.4 |

    | CP | inc dword ptr [rax*4]

|   4    |           | 0.3 | 1.0       | 1.0   1.0 | 1.0 | 0.3 | 0.3 |

    | CP | inc dword ptr [rdx*4]

|   4    |           | 0.4 | 1.0   1.0 | 1.0       | 1.0 | 0.3 | 0.3 |

    | CP | inc dword ptr [rcx*4]

|   4    | 0.1       | 0.2 | 1.0       | 1.0   1.0 | 1.0 | 0.4 | 0.3 |

    | CP | inc dword ptr [rsi*4]

|   4    | 0.2       | 0.3 | 1.0   1.0 | 1.0       | 1.0 | 0.3 | 0.2 |

    | CP | inc dword ptr [rdi*4]

|   4    | 0.2       | 0.3 | 1.0       | 1.0   1.0 | 1.0 | 0.2 | 0.2 |

    | CP | inc dword ptr [r8*4]

|   4    | 0.2       | 0.2 | 1.0   1.0 | 1.0       | 1.0 | 0.2 | 0.4 |

    | CP | inc dword ptr [r9*4]

|   4    | 0.2       | 0.3 | 1.0       | 1.0   1.0 | 1.0 | 0.4 | 0.1 |

    | CP | inc dword ptr [r10*4]

|   1    | 0.9       |     |           |           |     |     | 0.1 |

    |    | add r14, 0x8

|   0F   |           |     |           |           |     |     |     |

    |    | jnz 0xffffffffffffff92

Total Num Of Uops: 42

(apologies if it comes through tab damaged -- cut and paste to a fixed

width font may help)

The important part is that it thinks the bottleneck with the AGU's

(address generation units).  In theory, Haswell is supposed to be able

to sustain 2 loads and one store per cycle, since unlike Sandy Bridge

it has a dedicated Store AGU (Port 7).  But it doesn't appear that

it's being used, and as a result, the Port 2 and 3 Load AGU's are

being overwhelmed.  I don't know why the Port 7 AGU isn't being

utilized.

But it would imply that if we can load 64-bits at a time and subdivide

into 8-bit units, we might might come out faster.  This is how I moved

from 2000 MB/s to 2200 MB/s, but only with ICC and some inline

assembly.  At this point, the limiting factor becomes the front end

decoder, which is only able to issue 4 uops per cycle (even if coming

from uop cache as we are).   The combination of two loads, an inc, and

a store would fit in a single cycle, but two loads, an inc, a store, a

shift and a mov is too much.

I'd hoped that making use of the legacy "high byte" and "low byte"

partial registers would get past this, but even when I convinced the

compiler to produce the code I wanted, it wasn't faster.   I think

this is because there are some issues accessing the high and low for

the same register in succession, but I'm not sure.

I'll keep poking around a bit.

--nate

========================================================

OK, making some more progress.  The lack of operations scheduled on

Port 7 is indeed the limiting factor (conversely, the bottleneck is

Ports 2 and 3 because Port 7 is not being used).   The issue is

described here:

http://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62

Port 7 is very simple minded, and can only do simple offsets, and not

[offset + base + index] type addressing.  But with awkward syntax,

it's possible to use force this type of addressing to be used.  This

gets me up to 2400 MB/s with ICC, 2250 MB/s with GCC.    I'm using 16

tables to get here, although it's possible there is a way use fewer

(although I haven't yet found it).

These are going to be Haswell only increases, since Sandy Bridge/Ivy

Bridge has no Port 7 to work with.  IACA says that it could be even

faster, though, projecting about 3200 MB/s.  But that's assuming ideal

port scheduling, which isn't actually happening.  It's not quite

enough to explain the speed discrepancy between theoretical and

observed, but it might explain most of it.

(If you are not familiar with IACA, someone just recently put up a

nice intro here:

http://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-i)

--nate

Current 2400 MB/s ICC inner loop viewed with 'perf record'/'perf

report' looks like this:

  1.53 │4d0:┌─→movzbl (%rax,%rdx,1),%ecx

  3.84 │    │  movzbl 0x1(%rax,%rdx,1),%esi

  3.23 │    │  movzbl 0x2(%rax,%rdx,1),%edi

  1.48 │    │  incl   0x614440(%rcx)

11.63 │    │  incl   0x614860(%rsi)

  6.87 │    │  movzbl 0x3(%rax,%rdx,1),%r8d

  1.17 │    │  movzbl 0x4(%rax,%rdx,1),%r9d

  0.43 │    │  movzbl 0x5(%rax,%rdx,1),%r10d

  0.69 │    │  movzbl 0x6(%rax,%rdx,1),%r11d

  2.16 │    │  movzbl 0x7(%rax,%rdx,1),%ecx

  0.93 │    │  movzbl 0x8(%rax,%rdx,1),%esi

  0.55 │    │  incl   0x614c80(%rdi)

  4.58 │    │  incl   0x6150a0(%r8)

  3.90 │    │  incl   0x6154c0(%r9)

  2.98 │    │  incl   0x6158e0(%r10)

  4.53 │    │  incl   0x615d00(%r11)

  4.67 │    │  incl   0x616120(%rcx)

  3.72 │    │  incl   0x616540(%rsi)

  3.75 │    │  movzbl 0x9(%rax,%rdx,1),%edi

  1.32 │    │  movzbl 0xa(%rax,%rdx,1),%r8d

  0.90 │    │  movzbl 0xb(%rax,%rdx,1),%r9d

  0.68 │    │  movzbl 0xc(%rax,%rdx,1),%r10d

  1.28 │    │  movzbl 0xd(%rax,%rdx,1),%r11d

  1.31 │    │  movzbl 0xe(%rax,%rdx,1),%ecx

  1.12 │    │  movzbl 0xf(%rax,%rdx,1),%esi

  0.68 │    │  incl   0x616960(%rdi)

  4.68 │    │  incl   0x616d80(%r8)

  4.00 │    │  incl   0x6171a0(%r9)

  3.98 │    │  incl   0x6175c0(%r10)

  3.80 │    │  incl   0x6179e0(%r11)

  4.19 │    │  incl   0x617e00(%rcx)

  4.16 │    │  incl   0x618220(%rsi)

  3.48 │    │  add    $0x10,%rax

          │    └──jne    4d0

=====================================================================

I thought of another approach while falling asleep last night, and it

seems to be working.  Instead of trying to add workload to Port 7

(which would be Haswell only), reduce the workload on Ports 2 and 3 to

benefit Sandy Bridge as well.   So I'm loading 16B at a time into an

XMM vector, and then extracting them.  2 cycle latency plus to uops to

extract, but removes 15 loads per 16 bytes.  I've got it running at

2400 MB/s on both Haswell and Sandy Bridge, compiled with both ICC and

GCC.

It still requires some inline assembly to force the compiler to

produce what I want --- I'll see if there is a way to remove this.

Also requires 16 tables --- I'll see if I can get down to 8.  I'm also

not sure why I'm at 2400 instead of 3000 --- might still be a way to

get there.

I'll wrap up and send you what I have by the end of the day.

--nate

========================================================

OK, I think I'm reaching the end of my knowledge.  I have two

approaches:  one based on trying to force the use of Haswell's Port 7

to avoid the AGU bottleneck, and one based on trying to make less use

of the AGU by reading a 16B vector and splitting it into bytes.

In theory, the Port 7 approach should be really fast on Haswell:

around 3000 MB/s.  In practice, it's running at about 2100 MB/s on

Sandy Bridge, and only 2200 MB/s on Haswell. The problem seems to be

that it is very difficult to put the instructions in an order such

that the loads don't occasionally "steal" the use of Ports 2 and 3,

causing congestion.

The other approach is a bit faster (2450 MB/s on Haswell, 2200 MB/s on

Sandy Bridge), but I don't know that there is as much room for

improvement.  It's being limited by the shear number of uops.  A loop

can only sustain a throughput of 4 uops per cycle, and we have about

100 uops per 16 bytes.   I think the only way to make it faster is to

figure out a way to use fewer uops.  There certainly might be a way

--- my hope is that if you post the code and blog about it, someone

with more insight will turn up who can get it up to full speed.

I turned your fullbench.c into a standalone (renamed countbench.c), so

that it might be easier for others to play with.  My edits were rough

--- it's possible that I broke things and did not notice.  I was only

testing for speed, and didn't check for correctness.  I'm not even

collecting the totals at the end of the loop, and I don't know if the

"remainder" handling is working.  I've only tried compiling on Linux,

although I've done so on both Haswell and Sandy Bridge.

It's also missing your faster routine from the comparison, as I didn't

get around to putting it back in after removing the other files.   The

numbers for the routines are moved around (see the source).

Importantly, I changed the loop to do a constant number of iterations

rather than running for a constant time.   Changing back would be

fine, but the constant iteration approach made comparisons of

instruction counts easier.

--nate

# Intel Haswell i7-4770 CPU @ 3.40GHz (Turbo off)

nate@haswell:~/src/yann$ countbench -P90

*** FSE speed analyzer  64-bits, by Yann Collet (Oct  1 2014) ***

Generating 64 KB with P=90.00%

trivialCount             :    628.3 MB/s   (0)

trivialCount_b           :    628.3 MB/s   (0)

count_8x                 :   2229.1 MB/s   (0)

count_8x_b               :   2191.8 MB/s   (0)

count_vec                :   2445.4 MB/s   (0)

nate@haswell:~/src/yann$ fullbench -b1 -P90

*** FSE speed analyzer  64-bits, by Yann Collet (Sep 30 2014) ***

Generating 64 KB with P=90.00%

FSE_count(255)           :   1825.5 MB/s   (58868)

# Intel Xeon E5-1620 @ 3.60GHz (Turbo off)

nate@sandybridge:~$ countbench -P90

*** FSE speed analyzer  64-bits, by Yann Collet (Oct  1 2014) ***

Generating 64 KB with P=90.00%

trivialCount             :    637.5 MB/s   (0)

trivialCount_b           :    636.9 MB/s   (0)

count_8x                 :   2048.0 MB/s   (0)

count_8x_b               :   2080.5 MB/s   (0)

count_vec                :   2162.9 MB/s   (0)

nate@sandybridge:~$ fullbench -b1 -P90

*** FSE speed analyzer  64-bits, by Yann Collet (Sep 30 2014) ***

Generating 64 KB with P=90.00%

FSE_count(255)           :   1702.2 MB/s   (58868)