3 of 27

Overview

The chapter assumes prior knowledge of the C language and some understanding of assembly programming.
Optimizing code requires time and may reduce source code readability.
It is generally worthwhile to optimize functions that are frequently executed and crucial for performance.
Performance profiling tools, available in most ARM simulators, can help identify frequently executed functions.
Nonobvious optimizations should be documented using source code comments to aid maintainability.
C compilers translate C functions into assembly code to ensure they work for all possible inputs.
However, many input combinations may not be possible or occur in practice.
An example will be presented to illustrate the challenges faced by the compiler.
The example focuses on the memclr function, which clears N bytes of memory at a given address.

�

5 of 27

Basic C DataTypes

C compiler datatype mappings.

�

C Data Type Implementation

char unsigned 8-bit byte

short signed 16-bit halfword

int signed 32-bit word

long signed 32-bit word�long long signed 64-bit double word

Local variable Types

The following code checksums a data packet containing 64 words. It shows why you should avoid using char for local variables.

int checksum_v1(int *data)

{

char i; int sum = 0;

for (i = 0; i < 64; i++)

{

sum += data[i];

}

return sum;

}

6 of 27

The checksum_v4 code ﬁxes all the problems we have discussed in this section. It uses int type local variables to avoid unnecessary casts. It increments the pointer data instead of using an index offset data[i].

short checksum_v4(short *data)

{

�unsigned int i; int sum=0;

for (i=0; i<64; i++)

{

sum += *(data++);

}

return (short)sum;

}

The *(data++) operation translates to a single ARM instruction that loads the data and increments the data pointer. Of course you could write sum += *data; data++; or even

*data++ instead if you prefer. The compiler produces the following output. Three instruc- tions have been removed from the inside loop, saving three cycles per loop compared to checksum_v3.

7 of 27

function Argument Types

Converting local variables from types char or short to type int increases performance and reduces code size. The same holds for function arguments. Consider the following simple function, which adds two 16-bit values, halving the second, and returns a 16-bit sum:

short add_v1(short a, short b)

{

returna+ (b >> 1);

}

Whatever the merits of different narrow and wide calling protocols, you can see that char or short type function arguments and return values introduce extra casts. These increase code size and decrease performance. It is more efﬁcient to use the int type for function arguments and return values, even if you are only passing an 8-bit value.

8 of 27

Signed versus unsigned Types

If your code uses addition, subtraction, and multiplication, then there is no performance difference between signed and unsigned operations. However, there is a difference when it comes to division. Consider the following short example that averages two integers:

int average_v1(int a, int b)

{

return (a+b)/2;

}

It is more efﬁcient to use unsigned types for divisions. The compiler converts unsigned power of two divisions directly to right shifts. For general divisions, the divide routine in the C library is faster for unsigned types

9 of 27

The Efﬁcient Use of C Types

For local variables held in registers, don’t use a char or short type unless 8-bit or 16-bit modular arithmetic is necessary. Use the signed or unsigned int types instead. Unsigned types are faster when you use divisions.
For array entries and global variables held in main memory, use the type with the smallest size possible to hold the required data. This saves memory footprint. The ARMv4 architecture is efﬁcient at loading and storing all data widths provided you traverse arrays by incrementing the array pointer. Avoid using offsets from the base of the array with short type arrays, as LDRH does not support this.
Use explicit casts when reading array entries or global variables into local variables, or writing local variables out to array entries. The casts make it clear that for fast operation you are taking a narrow width type stored in memory and expanding it to a wider type in the registers. Switch on implicit narrowing cast warnings in the compiler to detect implicit casts.
Avoid implicit or explicit narrowing casts in expressions because they usually cost extra cycles. Casts on loads or stores are usually free because the load or store instruction performs the cast for you.
Avoid char and short types for function arguments or return values. Instead use the int type even if the range of the parameter is smaller. This prevents the compiler performing unnecessary casts.

10 of 27

C Looping Structures

Loops with a fixed number of iterations

What is the most efﬁcient way to write a for loop on the ARM? Let’s return to our checksum example and look at the looping structure.

Here is the last version of the 64-word packet checksum routine. This shows how the compiler treats a loop with incrementing count i++.

int checksum_v5(int *data)

{

unsigned int i; int sum=0;

for (i=0; i<64; i++)

{

sum += *(data++);

}

return sum;

}

11 of 27

This is not efﬁcient. On the ARM, a loop should only use two instructions:

A subtract to decrement the loop counter, which also sets the condition code ﬂags on the result

A conditional branch instruction

The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit. Then the comparison with zero is free since the result is stored in the condition ﬂags. Since we are no longer using i as an array index, there is no problem in counting down rather than up.

This example shows the improvement if we switch to a decrementing loop rather than an incrementing loop.

int checksum_v6(int *data)

{

�unsigned int i; int sum=0;

for (i=64; i!=0; i--)

{

sum += *(data++);

}

return sum;

}

12 of 27

Loops using a variable number of iterations

Now suppose we want our checksum routine to handle packets of arbitrary size. We pass in a variable N giving the number of words in the data packet. Using the lessons from the last section we count down until N 0 and don’t require an extra loop counter i. The checksum_v7 example shows how the compiler handles a for loop with a variable number of iterations N.

int checksum_v7(int *data, unsigned int N)

{

int sum=0;

for (; N!=0; N--)

{

sum += *(data++);

}

return sum;

}

13 of 27

To implement a function efﬁciently, you need to

■ minimize the number of spilled variables

■ ensure that the most important and frequently accessed variables are stored in registers

In theory, the C compiler can assign 14 variables to registers without spillage. In practice, some compilers use a ﬁxed register such as r12 for intermediate scratch working and do not assign variables to this register. Also, complex expressions require intermediate working registers to evaluate. Therefore, to ensure good assignment to registers, you should try to limit the internal loop of functions to using at most 12 local variables

�

15 of 27

summary Efﬁcient Register Allocation

■ Try to limit the number of local variables in the internal loop of functions to 12. The compiler should be able to allocate these to ARM registers.

■ You can guide the compiler as to which variables are important by ensuring these variables are used within the innermost loop.

16 of 27

FUNCTION CALLS

The ARM Procedure Call Standard (APCS) deﬁnes how to pass function arguments and return values in ARM registers. The more recent ARM-Thumb Procedure Call Standard (ATPCS) covers ARM and Thumb interworking as well.

The ﬁrst four integer arguments are passed in the ﬁrst four ARM registers: r0, r1, r2, and r3. Subsequent integer arguments are placed on the full descending stack, ascending in memory as in Figure 5.1. Function return integer values are passed in r0.

This description covers only integer or pointer arguments. Two-word arguments such as long long or double are passed in a pair of consecutive argument registers and returned in r0, r1. The compiler may pass structures in registers or by reference according to command line compiler options.

�

18 of 27

The next example illustrates the beneﬁts of using a structure pointer. First we show a typical routine to insert N bytes from array data into a queue. We implement the queue using a cyclic buffer with start address Q_start (inclusive) and end address Q_end (exclusive).

19 of 27

The following code creates a Queue structure and passes this to the function to reduce the number of function arguments

20 of 27

There are other ways of reducing function call overhead if your function is very small and corrupts few registers (uses few local variables). Put the C function in the same C ﬁle as the functions that will call it. The C compiler then knows the code generated for the callee function and can make optimizations in the caller function:

The caller function need not preserve registers that it can see the callee doesn’t corrupt. Therefore the caller function need not save all the ATPCS corruptible registers.
If the callee function is very small, then the compiler can inline the code in the caller function. This removes the function call overhead completely

21 of 27

The function uint_to_hex converts a 32-bit unsigned integer into an array of eight hexa- decimal digits. It uses a helper function nybble_to_hex, which converts a digit d in the range 0 to 15 to a hexadecimal digit.

22 of 27

POINTER ALIASING

Two pointers are said to alias when they point to the same address. If you write to one pointer, it will affect the value you read from the other pointer. In a function, the compiler often doesn’t know which pointers can alias and which pointers can’t. The compiler must be very pessimistic and assume that any write to a pointer may affect the value read from any other pointer, which can signiﬁcantly reduce code efﬁciency.

23 of 27

Avoiding Pointer Aliasing

■ Do not rely on the compiler to eliminate common subexpressions involving memory accesses. Instead create new local variables to hold the expression. This ensures the expression is evaluated only once.

■ Avoid taking the address of local variables. The variable may be inefﬁcient to access from then on.

24 of 27

PORTABILITY ISSUES

Here is a summary of the issues you may encounter when porting C code to the ARM.

■ The char type. On the ARM, char is unsigned rather than signed as for many other processors. A common problem concerns loops that use a char loop counter i and the continuation condition i ≥ 0, they become inﬁnite loops. In this situation, armcc

produces a warning of unsigned comparison with zero. You should either use a compiler option to make char signed or change loop counters to type int.

■ The int type. Some older architectures use a 16-bit int, which may cause problems when moving to ARM’s 32-bit int type although this is rare nowadays. Note that expressions are promoted to an int type before evaluation. Therefore if i = -0x1000, the expression i == 0xF000 is true on a 16-bit machine but false on a 32- bit machine.

■ Unaligned data pointers. Some processors support the loading of short and int typed values from unaligned addresses. A C program may manipulate pointers directly so that they become unaligned, for example, by casting a char * to an int *. ARM architectures up to ARMv5TE do not support unaligned pointers. To detect them, run the program on an ARM with an alignment checking trap. For example, you can conﬁgure the ARM720T to data abort on an unaligned access.

■ Endian assumptions. C code may make assumptions about the endianness of a memory system, for example, by casting a char * to an int *. If you conﬁgure the ARM for the same endianness the code is expecting, then there is no issue. Otherwise, you must remove endian-dependent code sequences and replace them by endian-independent ones. See Section 5.9 for more details.

■ Function prototyping. The armcc compiler passes arguments narrow, that is, reduced to the range of the argument type. If functions are not prototyped correctly, then the function may return the wrong answer. Other compilers that pass arguments wide may give the correct answer even if the function prototype is incorrect. Always use ANSI prototypes.

�

25 of 27

■ Use of bit-ﬁelds. The layout of bits within a bit-ﬁeld is implementation and endian dependent. If C code assumes that bits are laid out in a certain order, then the code is not portable.

■ Use of enumerations. Although enum is portable, different compilers allocate different numbers of bytes to an enum. The gcc compiler will always allocate four bytes to an enum type. The armcc compiler will only allocate one byte if the enum takes only eight-bit values. Therefore you can’t cross-link code and libraries between different compilers if you use enums in an API structure.

■ Inline assembly. Using inline assembly in C code reduces portability between architectures. You should separate any inline assembly into small inlined functions that can easily be replaced. It is also useful to supply reference, plain C implementations of these functions that can be used on other architectures, where this is possible.

■ The volatile keyword. Use the volatile keyword on the type deﬁnitions of ARM memory-mapped peripheral locations. This keyword prevents the compiler from opti- mizing away the memory access. It also ensures that the compiler generates a data access of the correct type. For example, if you deﬁne a memory location as a volatile short type, then the compiler will access it using 16-bit load and store instructions LDRSH and STRH.

1 of 27

2 of 27

3 of 27

4 of 27

5 of 27

6 of 27

7 of 27

8 of 27

9 of 27

10 of 27

11 of 27

12 of 27

13 of 27

14 of 27

15 of 27

16 of 27

17 of 27

18 of 27

19 of 27

20 of 27

21 of 27

22 of 27

23 of 27

24 of 27

25 of 27

26 of 27

27 of 27