MODULE 3
C Compilers and Optimization
BY: GURURAJ S CSE DEPT
GURURAJ S CSE DEPT
C Compilers and Optimization
Module 3
Overview
�
Basic C DataTypes
C compiler datatype mappings.
�
C Data Type Implementation
char unsigned 8-bit byte
short signed 16-bit halfword
int signed 32-bit word
long signed 32-bit word�long long signed 64-bit double word
Local variable Types
The following code checksums a data packet containing 64 words. It shows why you should avoid using char for local variables.
int checksum_v1(int *data)
{
char i; int sum = 0;
for (i = 0; i < 64; i++)
{
sum += data[i];
}
return sum;
}
The checksum_v4 code fixes all the problems we have discussed in this section. It uses int type local variables to avoid unnecessary casts. It increments the pointer data instead of using an index offset data[i].
short checksum_v4(short *data)
{
�unsigned int i; int sum=0;
for (i=0; i<64; i++)
{
sum += *(data++);
}
return (short)sum;
}
The *(data++) operation translates to a single ARM instruction that loads the data and increments the data pointer. Of course you could write sum += *data; data++; or even
*data++ instead if you prefer. The compiler produces the following output. Three instruc- tions have been removed from the inside loop, saving three cycles per loop compared to checksum_v3.
function Argument Types
Converting local variables from types char or short to type int increases performance and reduces code size. The same holds for function arguments. Consider the following simple function, which adds two 16-bit values, halving the second, and returns a 16-bit sum:
short add_v1(short a, short b)
{
returna+ (b >> 1);
}
Whatever the merits of different narrow and wide calling protocols, you can see that char or short type function arguments and return values introduce extra casts. These increase code size and decrease performance. It is more efficient to use the int type for function arguments and return values, even if you are only passing an 8-bit value.
Signed versus unsigned Types
If your code uses addition, subtraction, and multiplication, then there is no performance difference between signed and unsigned operations. However, there is a difference when it comes to division. Consider the following short example that averages two integers:
int average_v1(int a, int b)
{
return (a+b)/2;
}
It is more efficient to use unsigned types for divisions. The compiler converts unsigned power of two divisions directly to right shifts. For general divisions, the divide routine in the C library is faster for unsigned types
The Efficient Use of C Types
C Looping Structures
Loops with a fixed number of iterations
What is the most efficient way to write a for loop on the ARM? Let’s return to our checksum example and look at the looping structure.
Here is the last version of the 64-word packet checksum routine. This shows how the compiler treats a loop with incrementing count i++.
int checksum_v5(int *data)
{
unsigned int i; int sum=0;
for (i=0; i<64; i++)
{
sum += *(data++);
}
return sum;
}
This is not efficient. On the ARM, a loop should only use two instructions:
A subtract to decrement the loop counter, which also sets the condition code flags on the result
A conditional branch instruction
The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit. Then the comparison with zero is free since the result is stored in the condition flags. Since we are no longer using i as an array index, there is no problem in counting down rather than up.
This example shows the improvement if we switch to a decrementing loop rather than an incrementing loop.
int checksum_v6(int *data)
{
�unsigned int i; int sum=0;
for (i=64; i!=0; i--)
{
sum += *(data++);
}
return sum;
}
Loops using a variable number of iterations
Now suppose we want our checksum routine to handle packets of arbitrary size. We pass in a variable N giving the number of words in the data packet. Using the lessons from the last section we count down until N 0 and don’t require an extra loop counter i. The checksum_v7 example shows how the compiler handles a for loop with a variable number of iterations N.
int checksum_v7(int *data, unsigned int N)
{
int sum=0;
for (; N!=0; N--)
{
sum += *(data++);
}
return sum;
}
REGISTER ALLOCATION
To implement a function efficiently, you need to
■ minimize the number of spilled variables
■ ensure that the most important and frequently accessed variables are stored in registers
In theory, the C compiler can assign 14 variables to registers without spillage. In practice, some compilers use a fixed register such as r12 for intermediate scratch working and do not assign variables to this register. Also, complex expressions require intermediate working registers to evaluate. Therefore, to ensure good assignment to registers, you should try to limit the internal loop of functions to using at most 12 local variables
�
o
o
summary Efficient Register Allocation
■ Try to limit the number of local variables in the internal loop of functions to 12. The compiler should be able to allocate these to ARM registers.
■ You can guide the compiler as to which variables are important by ensuring these variables are used within the innermost loop.
FUNCTION CALLS
The ARM Procedure Call Standard (APCS) defines how to pass function arguments and return values in ARM registers. The more recent ARM-Thumb Procedure Call Standard (ATPCS) covers ARM and Thumb interworking as well.
The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and r3. Subsequent integer arguments are placed on the full descending stack, ascending in memory as in Figure 5.1. Function return integer values are passed in r0.
This description covers only integer or pointer arguments. Two-word arguments such as long long or double are passed in a pair of consecutive argument registers and returned in r0, r1. The compiler may pass structures in registers or by reference according to command line compiler options.
�
o
�
The next example illustrates the benefits of using a structure pointer. First we show a typical routine to insert N bytes from array data into a queue. We implement the queue using a cyclic buffer with start address Q_start (inclusive) and end address Q_end (exclusive).
The following code creates a Queue structure and passes this to the function to reduce the number of function arguments
There are other ways of reducing function call overhead if your function is very small and corrupts few registers (uses few local variables). Put the C function in the same C file as the functions that will call it. The C compiler then knows the code generated for the callee function and can make optimizations in the caller function:
The function uint_to_hex converts a 32-bit unsigned integer into an array of eight hexa- decimal digits. It uses a helper function nybble_to_hex, which converts a digit d in the range 0 to 15 to a hexadecimal digit.
POINTER ALIASING
Avoiding Pointer Aliasing
■ Do not rely on the compiler to eliminate common subexpressions involving memory accesses. Instead create new local variables to hold the expression. This ensures the expression is evaluated only once.
■ Avoid taking the address of local variables. The variable may be inefficient to access from then on.
PORTABILITY ISSUES
Here is a summary of the issues you may encounter when porting C code to the ARM.
■ The char type. On the ARM, char is unsigned rather than signed as for many other processors. A common problem concerns loops that use a char loop counter i and the continuation condition i ≥ 0, they become infinite loops. In this situation, armcc
produces a warning of unsigned comparison with zero. You should either use a compiler option to make char signed or change loop counters to type int.
■ The int type. Some older architectures use a 16-bit int, which may cause problems when moving to ARM’s 32-bit int type although this is rare nowadays. Note that expressions are promoted to an int type before evaluation. Therefore if i = -0x1000, the expression i == 0xF000 is true on a 16-bit machine but false on a 32- bit machine.
■ Unaligned data pointers. Some processors support the loading of short and int typed values from unaligned addresses. A C program may manipulate pointers directly so that they become unaligned, for example, by casting a char * to an int *. ARM architectures up to ARMv5TE do not support unaligned pointers. To detect them, run the program on an ARM with an alignment checking trap. For example, you can configure the ARM720T to data abort on an unaligned access.
■ Endian assumptions. C code may make assumptions about the endianness of a memory system, for example, by casting a char * to an int *. If you configure the ARM for the same endianness the code is expecting, then there is no issue. Otherwise, you must remove endian-dependent code sequences and replace them by endian-independent ones. See Section 5.9 for more details.
■ Function prototyping. The armcc compiler passes arguments narrow, that is, reduced to the range of the argument type. If functions are not prototyped correctly, then the function may return the wrong answer. Other compilers that pass arguments wide may give the correct answer even if the function prototype is incorrect. Always use ANSI prototypes.
�
■ Use of bit-fields. The layout of bits within a bit-field is implementation and endian dependent. If C code assumes that bits are laid out in a certain order, then the code is not portable.
■ Use of enumerations. Although enum is portable, different compilers allocate different numbers of bytes to an enum. The gcc compiler will always allocate four bytes to an enum type. The armcc compiler will only allocate one byte if the enum takes only eight-bit values. Therefore you can’t cross-link code and libraries between different compilers if you use enums in an API structure.
■ Inline assembly. Using inline assembly in C code reduces portability between architectures. You should separate any inline assembly into small inlined functions that can easily be replaced. It is also useful to supply reference, plain C implementations of these functions that can be used on other architectures, where this is possible.
■ The volatile keyword. Use the volatile keyword on the type definitions of ARM memory-mapped peripheral locations. This keyword prevents the compiler from opti- mizing away the memory access. It also ensures that the compiler generates a data access of the correct type. For example, if you define a memory location as a volatile short type, then the compiler will access it using 16-bit load and store instructions LDRSH and STRH.
MODULE 3 ENDS
o