8
Code Optimization
239
•
Optimization techniques for code efficiency
•
Intrinsic C functions
•
Parallel instructions
•
Word-wide data access
•
Software pipelining
In this chapter we illustrate several schemes that can be used to optimize and
drastically reduce the execution time of your code. These techniques include the
use of instructions in parallel, word-wide data, intrinsic functions, and software
pipelining.
8.1 INTRODUCTION
Begin at a workstation level; for example, use C code on a PC. While code written
in assembly (ASM) is processor-specific, C code can readily be ported from one plat-
form to another. However, optimized ASM code runs faster than C and requires
less memory space.
Before optimizing, make sure that the code is functional and yields correct
results. After optimizing, the code can be so reorganized and resequenced that the
optimization process makes it difficult to follow. One needs to realize that if a C-
coded algorithm is functional and its execution speed is satisfactory, there is no need
to optimize further.
After testing the functionality of your C code, transport it to the C6x platform.
A floating-point implementation can be modeled first, then converted to a fixed-
point implementation if desired. If the performance of the code is not adequate, use
DSP Applications Using C and the TMS320C6x DSK. Rulph Chassaing
further optimizations and generates an ASM file.
The options:
1. –o0 optimizes the use of registers.
2. –o1 performs a local optimization in addition to optimizations performed by
the previous option: –o0.
3. –o2 performs a global optimization in addition to the optimizations per-
formed by the previous options: –o0 and –o1.
240
Code Optimization
4. –o3 performs a file optimization in addition to the optimizations performed
by the three previous options: –o0, –o1, and –o2.
The options –o2 and –o3 attempt to do software optimization.
8.2.2 Intrinsic C Functions
There are a number of available C intrinsic functions that can be used to increase
the efficiency of code (see also Example 3.1):
1. int_mpy() has the equivalent ASM instruction MPY, which multiplies the
16 LSBs of a number by the 16 LSBs of another number.
2. int_mpyh() has the equivalent ASM instruction MPYH, which multiplies the
16 MSBs of a number by the 16 MSBs of another number.
3. int_mpylh() has the equivalent ASM instruction MPYLH, which multiplies
the 16 LSBs of a number by the 16 MSBs of another number.
4. int_mpyhl() has the equivalent instruction MPYHL, which multiplies the
16 MSBs of a number by the 16 LSBs of another number.
5. void_nassert(int) generates no code. It tells the compiler that the
expression declared with the assert function is true. This conveys information
to the compiler about alignment of pointers and arrays and of valid opti-
mization schemes, such as word-wide optimization.
6. uint_lo(double) and uint_hi(double) obtain the low and high 32 bits
of a double word, respectively (available on C67x or C64x).
8.3 PROCEDURE FOR CODE OPTIMIZATION
.
.
.
}
242
Code Optimization
//twosum.c Sum of Products with separate accumulation of even/odd terms
//with word-wide data for fixed-point implementation
int dotp (short a[ ], short b [ ])
{
int suml, sumh, sum, i;
suml = 0;
sumh = 0;
sum = 0;
for (i = 0; i < 200; i +=2)
{
suml += a[i] * b[i]; //sum of products of even terms
sumh += a[i + 1] * b[i + 1]; //sum of products of odd terms
}
sum = suml + sumh; //final sum of odd and even terms
return (sum);
}
FIGURE 8.1. C code for sum of products using word-wide data access for separate accu-
mulation of even and odd sum of products terms (twosum.c).
Example 8.2: Separate Sum of Products with C Intrinsic Functions
Using C Code (dotpintrinsic)
Figure 8.2 shows the C code dotpintrinsic.c to illustrate the separate sum of
products using two C intrinsic functions, _mpy and _mpyh, which have the
equivalent ASM instructions MPY and MPYH, respectively. Whereas the even and odd
sum of products are calculated within the loop, the final summation is taken outside
FIGURE 8.3. Separate sum of products using linear ASM code for fixed-point implemen-
tation (twosumlasmfix.sa).
Example 8.4: Sum of Products with Double-Word Load for Floating-Point
Implementation Using Linear ASM Code (twosumlasmfloat)
Figure 8.4 shows the linear ASM code twosumlasmfloat.sa to obtain two sepa-
rate sums of products for a floating-point implementation using linear ASM code.
The double-word load instruction LDDW loads a 64-bit data value and stores it in
a pair of registers. Each single-precision multiply instruction MPYSP performs a
32 ¥ 32 multiplication. The sums of products of the lower and upper 32 bits are
performed to yield a sum of both even and odd terms as 32 bits.
Example 8.5: Dot Product with No Parallel Instructions for Fixed-Point
Implementation Using ASM Code (dotpnp)
Figure 8.5 shows the ASM code dotpnp.asm for the dot product with no instruc-
tions in parallel for a fixed-point implementation. A fixed-point implementation can
244
Code Optimization
;twosumlasmfloat.sa Sum of products. Separate accum of even/odd terms
;Using double-word load LDDW for floating-point implementation
loop: LDDW *aptr++, ai1:ai0 ;64-bit word ai0 and ai1
LDDW *bptr++, bi1:bi0 ;64-bit word bi0 and bi1
MPYSP ai0, bi0, prodl ;lower 32-bit product
MPYSP ai1, bi1, prodh ;hiagher 32-bit product
ADDSP prodl, suml, suml ;accum 32-bit even terms
ADDSP prodh, sumh, sumh ;accum 32-bit odd terms
SUB count, 1, count ;decrement count
[count] B loop ;branch to loop
FIGURE 8.4. Separate sum of products with LDDW using linear ASM code for floating-point
implementation (twosumlasmfloat.sa).
;dotpnp.asm ASM Code with no-parallel instructions for fixed-point
MVK .S1 200, A1 ;count into A1
after the ADD instruction. Using parallel instructions, the instructions within the loop
now consume eight cycles per iteration, to yield 8 ¥ 200 = 1600 cycles.
Example 8.7: Two Sums of Products with Word-Wide (32-bit) Data for
Fixed-Point Implementation Using ASM Code (twosumfix)
Figure 8.7 shows the ASM code twosumfix.asm, which calculates two separate
sums of products using word-wide access of data for a fixed-point implementation.
The loop count is initialized to 100 (not 200) since two sums of products are obtained
Programming Examples Using Code Optimization Techniques
245
;dotpp.asm ASM Code with parallel instructions for fixed-point
MVK .S1 200, A1 ;count into A1
|| ZERO .L1 A7 ;init A7 for accum
LOOP LDH .D1 *A4++,A2 ;A2=16-bit data pointed by A4
|| LDH .D2 *B4++,B2 ;B2=16-bit data pointed by B4
SUB .S1 A1,1,A1 ;decrement count
[A1] B .S1 LOOP ;branch to LOOP (after ADD)
NOP 2 ;delay slots for LDH and B
MPY .M1x A2,B2,A6 ;product in A6
NOP ;1 delay slot for MPY
ADD .L1 A6,A7,A7 ;accum in A7,then branch
;branch occurs here
FIGURE 8.6. ASM code with parallel instructions for fixed-point implementation
(dotpp.asm).
per iteration. The instruction LDW loads a word or 32-bit data. The multiply instruc-
tion MPY finds the product of the lower 16 ¥ 16 data, and MPYH finds the product of
the upper 16 ¥ 16 data. The two ADD instructions accumulate separately the even
and odd sums of products. Note that an additional ADD instruction is needed outside
the loop to accumulate A7 and B7. The instructions within the loop consume eight
cycles, now using 100 iterations (not 200), to yield 8 ¥ 100 = 800 cycles.
Example 8.8: Dot Product with No Parallel Instructions for Floating-Point
;branch occurs here
FIGURE 8.7. ASM code for two sums of products with 32-bit data for fixed-point imple-
mentation (twosumfix.asm).