SSE Accelerated Histogram Calculation

At long last, the SSE tutorial series actually gets around to writing some code! Today’s post will explain how to write a function that finds a discrete histogram of an input signal. What we are going to do is:



If you don’t understand what the central limit theorem is, or what a histogram is, don’t worry, the next section will explain those things. If you do, feel free to skip the next section.

Mathematical Preliminaries

The central limit theorem states that if you have a number of random variables, each of which are independently and identically distributed, their sum will tend to be normally distributed. Let’s analyze that statement in detail – there are two key conditions: the random variables should be independently and identically distributed, and the result states that if these restrictions hold, then the sum of these random variables will itself tend to be normally distributed. The restriction on being identically distributed means that you can’t sum one random variable that is uniformly distributed and another that is, say, chi-square distributed and expect the central limit theorem to hold. The restriction on independence means that you can’t have one random variable “affecting” other random variables. The result states that no matter what the individual distributions are, the sum will tend towards being normally distributed as you keep increasing the number of variables that you are summing.


If you think about it, this is a tall claim. How can this work for any and every distribution?! It turns out this is true, and we don’t need to go into the details of the rigorous proof. Instead let’s try to visualize this for a simple example. Think about a random variable obtained by summing the result of two dice throws. It should be clear that the individual dice throws themselves are uniformly distributed and independent. But note that the sum of two throws is more likely to be 7 than any other number. This is because there are more ways to sum to 7 (1+6, 2+5, 3+4, 4+3, 5+2, 6+1) than there are to sum to, for example, 12 (only 6+6).  Intuitively, this means that if you make two random throws and add them up, you are more likely to end up in the middle than the edges because there are more ways to get to the middle.


Now that we’ve got the central limit theorem out of the way, what is a histogram? Assume that I have a large array filled up with numbers which lie in a certain range. Now, let us split up this range into n bins – each of these bins correspond some non-overlapping interval in the input range. For example, if my input range is 0-100, I could split it up into 10 bins: 0-10, 10-20, 20-30, 30-40 … 90-1001. The histogram of the array is just a count of how many numbers in the input array fall in each bin. It should be clear that the histogram is an approximation of the probability distribution function of the elements of the array.

Calculating the Histogram

For all that fuss, calculating the histogram is a really simple operation. We initialize the result array with zeros, then iterate over each element in the array, find which bin the element falls into and increment the count for that bin.

Let N be the number of bins, Amax and Amin be the maximum and minimum values of elements in the array respectively. The width of each bin Wb = (Amax – Amin) / N. If each bin is given zero-based index, Bi (i.e. first bin (Amin ≤ x < Wb) has index 0, the next bin (Wb ≤ x < 2Wb) has index 1 and the last bin ((N-1)Wb ≤ x < Amax) has index N-1, then the index Bi for an input value Ai can be calculated as Bi = floor((Ai – Amin)/Wb). Convince yourself that this equation is correct, because it is the key to the discussion that follows.


A straightforward translation of the equations in the previous paragraph leads to this function for calculating the histogram of an array.


void findHistogramx87(float* data, int count, float min, float max, int nBins, int* histogram)  {

      zeroArray(histogram, nBins);

 

      float binSize = (max - min) / nBins;

      for(int i = 0; i < count; i++)  {

            int bin = int((data[i] - min) / binSize);

            histogram[bin] += 1;

      }

}


Hopefully, the code is self-explanatory. Note that we are relying on the C/C++-standards which specify that when a double value is converted to an integer it is truncated rather than rounded and so the floor function that appears in the equation for Bi does not appear explicitly in the code.


A key point here is that the calculation of the histogram is a data-parallel operation. In other words, each iteration of the for-loop is independent of the other iterations. This is almost always an indication that the code can benefit from using the SSE instruction set.

Vectorizing the Histogram Calculation

The idea here is fairly simple. Since the 128-bit SSE registers allow us to process 4 floats at a time, we just rewrite the inner loop to work on 4 elements in parallel. There is one small problem though – we still need to handle arrays of arbitrary sizes, including those that are not multiples of 4. Our strategy is simple, we will use SSE to process as much of the array as possible by sticking to multiples of 4, and for the elements that are leftover, we will just use plain x87 floating point code.


That said, here is what this boils down to:


void findHistogramSSE(float* data, int count, float min, float max, int nBins, int* histogram)  {

      zeroArray(histogram, nBins);

      float binSize = (max - min) / nBins;

 

      int oddAmount = count & 3; // note that this is actually count % 4.

      for(int i = 0; i < oddAmount; i++)  {

            int bin = int(((*data++) - min) / binSize);

            histogram[bin] += 1;

      }

 

      __asm   {

            // xmm0 = {x, y, z, binSize}

            movss       xmm0, binSize;

            // xmm0 = {z, z, binSize, binSize}

            movsldup    xmm0, xmm0;

            // xmm0 = {binSize, binSize, binSize}

            movlhps           xmm0, xmm0;

 

            // next 3 instructions make

            // xmm1 = {min, min, min, min}

            movss       xmm1, min;

            movsldup    xmm1, xmm1;

            movlhps           xmm1, xmm1;

 

            // setup loop counter as

            // array size / 4.

            mov               ecx, count;

            shr               ecx, 2;

            // esi points to input data

            // ebx points to output histogram

            mov               esi, data;

            mov               ebx, histogram;

 

            hLoop:

                  // load four elements from input array in xmm2

                  movups            xmm2, [esi];

                  // subtract min from each element of xmm2

                  subps       xmm2, xmm1;

                  // divide each element of xmm2 by binSize

                  divps       xmm2, xmm0;

                  // and convert each element of xmm2 into integers

                  cvttps2dq   xmm3, xmm2;

                 

                  // get least significant dword of xmm3;

                  movd        eax, xmm3;

                  // and increment its histogram value.

                  add               dword ptr [ebx+eax*4], 1;

 

                  // right shift 128-bit register by 4 bytes

                  // so we can get at the next dword.

                  psrldq            xmm3, 4;

                  movd        eax, xmm3;

                  // and use the new least significant dword

                  // to increment in the output histogram

                  add               dword ptr [ebx+eax*4], 1;

 

                  // again

                  psrldq            xmm3, 4;

                  movd        eax, xmm3;

                  add               dword ptr [ebx+eax*4], 1;

 

                  // and again.

                  psrldq            xmm3, 4;

                  movd        eax, xmm3;

                  add               dword ptr [ebx+eax*4], 1;

 

                  // increment input pointer.

                  add               esi, 16;

                  // decrement loop counter.

                  sub               ecx, 1;

                  jnz               hLoop;

      }

}

 

That is a lot of code to take in at once. Let’s do a blow-by-blow analysis of the action! :-)


The first few lines are fairly straightforward. The zeroArray function initializes the output array with zeros. Then, we calculate the width of each bin and deal with the “left-over” elements for the case when the input array length is not a multiple of 4.

      zeroArray(histogram, nBins);

      float binSize = (max - min) / nBins;

 

      int oddAmount = count & 3; // note that this is actually count % 4.

      for(int i = 0; i < oddAmount; i++)  {

            int bin = int(((*data++) - min) / binSize);

            histogram[bin] += 1;

      }

 

Now, we have to do some setup work before we actually process elements in the loop.


            // xmm0 = {x, y, z, binSize}

            movss       xmm0, binSize;

            // xmm0 = {z, z, binSize, binSize}

            movsldup    xmm0, xmm0;

            // xmm0 = {binSize, binSize, binSize}

            movlhps           xmm0, xmm0;

 

            // next 3 instructions make

            // xmm1 = {min, min, min, min}

            movss       xmm1, min;

            movsldup    xmm1, xmm1;

            movlhps           xmm1, xmm1;

 

            // setup loop counter as

            // array size / 4.

            mov               ecx, count;

            shr               ecx, 2;

            // esi points to input data

            // ebx points to output histogram

            mov               esi, data;

            mov               ebx, histogram;

 

The first instruction movss [move scalar single precision value] copies the value of binSize from the stack into the lowest 32-bit part of the xmm0 register. The other parts of the register are left unchanged. Our goal here is to make all elements of xmm0 contain the bin size, so the next two instructions enact a little circus to achieve that. The movsldup instruction is best explained with an example: assume that xmm0 = {x, y, z, w}, then movsldup xmm1, xmm0 will result in xmm1 containing the values {z, z, w, w}. Basically, the instruction is duplicating the two 32-bit values in the lower half of the source register into the upper and lower parts of the destination register. The movlhps instruction copies the lower half of the source register into the upper half of the destination register. Combined, we’ve now copied the bin size into all four parts of the xmm0 register.


Note that all the instructions we’ve encountered so far have an ‘s’ in them somewhere. The letter ‘s’ stands for single precision. If you were working with double-precision floating point, you’d find the letter ‘d’ in the instruction. For example, just like the movss instruction, there is movsd instruction which copies a scalar double precision value to/from an XMM register. The next thing to notice is the word scalar. Some SSE instructions operate on a single floating point value, and some operate on tuples of packed data. The letter ‘s’ in the movss instruction says that it is moving scalar values [a single floating point value rather than a tuple]. If you wanted to move, 4 floating point values “at once” into the register, you would use the movups or the movaps instructions. The letter ‘p’ in these instructions indicates that we are working with packed data. The question you should be asking now is why are there two versions of the packed move? I mentioned in a previous post that SSE provides aligned move instructions that operate on addresses which are aligned on 16-byte boundaries. If your data is aligned, data transfer when using the aligned move instructions will be faster. (If you pass an address that is not 16-bit aligned to instructions that require data-alignment, your program will crash) Since not everybody has the luxury of aligning their data, so there is an equivalent unaligned move. And this instruction is slightly slower than the aligned move, but will work for all types of addresses. So the letters ‘a’ and ‘u’ in the movaps and movups instructions stand for aligned and unaligned respectively.


Here is the loop that actually does the histogram calculation.


            hLoop:

                  // load four elements from input array in xmm2

                  movups            xmm2, [esi];

                  // subtract min from each element of xmm2

                  subps       xmm2, xmm1;

                  // divide each element of xmm2 by binSize

                  divps       xmm2, xmm0;

                  // and convert each element of xmm2 into integers

                  cvttps2dq   xmm3, xmm2;

                 

                  // get least significant dword of xmm3;

                  movd        eax, xmm3;

                  // and increment its histogram value.

                  add               dword ptr [ebx+eax*4], 1;

 

                  // right shift 128-bit register by 4 bytes

                  // so we can get at the next dword.

                  psrldq            xmm3, 4;

                  movd        eax, xmm3;

                  // and use the new least significant dword

                  // to increment in the output histogram

                  add               dword ptr [ebx+eax*4], 1;

 

                  // again

                  psrldq            xmm3, 4;

                  movd        eax, xmm3;

                  add               dword ptr [ebx+eax*4], 1;

 

                  // and again.

                  psrldq            xmm3, 4;

                  movd        eax, xmm3;

                  add               dword ptr [ebx+eax*4], 1;

 

                  // increment input pointer.

                  add               esi, 16;

                  // decrement loop counter.

                  sub               ecx, 1;

                  jnz               hLoop;

 

Because, I’m a lazy guy who doesn’t want to align my data, I am just using the unaligned packed move to load 4 single precision floating point numbers into the xmm2 register. Then I am subtracting each element of the xmm1 register from the corresponding element in the xmm2 register; the result goes back to  xmm2. The alert reader will notice subps means subtract packed single precision. Then I divide each element of xmm2 register by the corresponding element in the xmm0 register and convert the result to an integer using the cvttps2dq instruction. Read the cvtps2dq as “convert with truncate packed single precision to double quadword”. There is a similar instruction, cvtps2dq rounds values instead of truncating them.


Let’s set aside all the jargon, and take stock what we’ve done so far. We have 4 floating point numbers x0, x1, x2 and x3 from the input array. We’ve subtracted min from each xi [because we’d loaded xmm1 with all min’s]. Then we’ve divided the result of each subtraction by binSize [xmm0 had binSize, remember?] and converted each result to an integer. The key point to note here is that we’ve dealt with 4 floating point numbers with each instruction. To do the same thing using the x87 instructions we would have had to write a for-loop with four iterations to perform the same calculation, but because SSE gives us these instructions that work on 4 numbers at once, (in theory), our code can become 4X as fast.


If you’ve got so far, the rest of the code should be really easy to figure out. Now we have 4 integers which are actually array indices sitting in the xmm3 register, all we do next is pull out each array index and increment the corresponding element in the output array to update the histogram.

Conclusion

I’ve uploaded a VS2005 project that contains the source code for this application here. Check out the source code, I’ve tried to make it as readable as possible and there are a few goodies - like a routine that determines if your processor has SSE3 support.


On this Centrino Duo T7200, the SSE accelerated version is slightly more than 4X faster than the “regular” version. I tried this program on a Core 2 Quad Q6600 and also got similar results. I expect the speedup to be somewhat lesser on a Pentium 4, because the NetBurst microarchitecture has inferior support for SSE as compared to the Core microarchitecture.


Hopefully, this whole series was useful to you. Please let me know if you questions/comments or if you want me to blog about some specific problem.


Contact

My name is Pramod Subramanyan. I'm a software engineer working at National Instruments. If you have questions about the content, please do feel free to contact me.

Email: pramod (dot) subramanyan (at) gmail (dot) com.
Website: http://pramod.sub.googlepages.com/
Blog: http://fundae.wordpress.com/