1 of 17

Lab 7

61C Summer 2023

2 of 17

SIMD

  • We will be focusing on SIMD today (single instruction multiple streams of data)

3 of 17

SIMD Intrinsics

  • An intrinsic function is a function whose implementation is handled by the compiler
  • We will be using SIMD intrinsic functions to speed up our programs

4 of 17

SIMD Intrinsics

  • Our SISD (single instruction single data stream) instructions operate on 32 bits
    • Ex. Element wise adding two arrays of size 4 takes 4 instructions {1,2,3,4} + {1,2,3,4} = {2,4,6,8}
  • SIMD functions use large (128 bits) registers to store and operate on more value at the same time
    • Now we can stuff 4 ints into 1 register and it will only take us 1 instruction to add 4 ints to 4 ints

5 of 17

Intel Intrinsic Functions

  • Guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
  • __m128i _mm_setzero_si128() - returns a 128-bit zero vector
  • __m128i _mm_loadu_si128(__m128i *p) - returns 128-bit vector stored at pointer p
  • __m128i _mm_add_epi32(__m128i a, __m128i b) - returns vector (a_0 + b_0, a_1 + b_1, a_2 + b_2, a_3 + b_3)

6 of 17

Intel Intrinsics

  • void _mm_storeu_si128(__m128i *p, __m128i a) - stores 128-bit vector a into pointer p
  • __m128i _mm_cmpgt_epi32(__m128i a, __m128i b) - returns the vector (a_i > b_i ? 0xffffffff : 0x0 for i from 0 to 3) (useful as mask for and)
  • __m128i _mm_and_si128(__m128i a, __m128i b) - returns vector (a_0 & b_0, a_1 & b_1, a_2 & b_2, a_3 & b_3), where & represents the bitwise and operator

7 of 17

Intel Intrinsics Examples

  • We have two 4 int wide arrays, arr1, arr2, add arr1 and arr2 element wise and store it in arr1
    • __m128i vec1 = _mm_loadu_si128((__m128i*) arr1)
    • __m128i vec2 = _mm_loadu_si128((__m128i*) arr2)
    • __m128i result = _mm_add_epi32(vec1, vec2)
    • _mm_storeu_si128((__m128i*) arr1, result)

We have to load arrays from memory into vector registers

We update arr1

8 of 17

Another Intrinsics Example

  • Given we want to write SIMD code that adds all the elements up together

9 of 17

Another Intrinsics Example

  • Given we want to write SIMD code that adds all the elements up together

We first create sum_vec(4 ints wide set to all 0s) to store our sum

sum_vec:

0

0

0

0

10 of 17

Another Intrinsics Example

  • Given we want to write SIMD code that adds all the elements up together

Next we load in 4 ints (elems 0-3) to a temp vector

sum_vec:

0

0

0

0

tmp:

1

3

4

1

11 of 17

Another Intrinsics Example

  • Given we want to write SIMD code that adds all the elements up together

We add sum_vec and tmp together

sum_vec:

1

3

4

1

tmp:

1

3

4

1

12 of 17

Another Intrinsics Example

  • Given we want to write SIMD code that adds all the elements up together

We load in the next 4 elems of arr into tmp

sum_vec:

1

3

4

1

tmp:

9

5

2

6

13 of 17

Another Intrinsics Example

  • Given we want to write SIMD code that adds all the elements up together

Once again we add sum_vec and tmp

sum_vec:

10

8

6

7

tmp:

9

5

2

6

14 of 17

Another Intrinsics Example

  • Given we want to write SIMD code that adds all the elements up together

Finally, we store sum_vec into a temporary array and then add up all 4 elements of that array

sum_vec:

10

8

6

7

tmp:

9

5

2

6

15 of 17

Loop unrolling

  • Unrolling a loop (more operations per iteration of loop) will result in slightly improved performance
  • Unrolling a loop with SIMD functions will result in even better performance

16 of 17

Loop Unrolling Example

int N = 100;�int arr[N];

for (int i = 0; i < N; i += 1) {

arr[i] = i;

}

int N = 100;�int arr[N];

for (int i = 0; i < N; i += 4) {

arr[i] = i;� arr[i + 1] = i + 1;� arr[i + 2] = i + 2;� arr[i + 3] = i + 3;

}

17 of 17

Loop Unrolling Example with tail case

int N = 103;�int arr[N];

for (int i = 0; i < N; i += 1) {

arr[i] = i;

}

int N = 103;�int arr[N];

for (int i = 0; i < N / 4 * 4; i += 4) {� arr[i] = i;� arr[i + 1] = i + 1;� arr[i + 2] = i + 2;� arr[i + 3] = i + 3;�}

for (int i = N / 4 * 4; i < N; i += 1) {� arr[i] = i;�}