posted on 2014-02-21, 12:35authored byMounira Bachir, Sid-Ahmed-Ali Touati, Frederic Brault, David Gregg, Albert Cohen
We address the problem of generating compact code from software pipelined loops. Although software pipelining
is a powerful technique to extract fine-grain parallelism, it generates lifetime intervals spanning multiple loop
iterations. These intervals require periodic register allocation (also called variable expansion), which in turn yields a
code generation challenge. We are looking for the minimal unrolling factor enabling the periodic register allocation
of software pipelined kernels. This challenge is generally addressed through one of: (1) hardware support in the form
of rotating register files, which solve the unrolling problem but are expensive in hardware; (2) register renaming by
inserting register moves, which increase the number of operations in the loop, and may damage the schedule of the
software pipeline and reduce throughput; (3) post-pass loop unrolling that does not compromise throughput but often
leads to impractical code growth. The latter approach relies on the proof that MAXLIVE registers (maximal number
of values simultaneously alive) are sufficient for periodic register allocation [10, 13]. However, the best existing
heuristic for controlling this code growth—modulo variable expansion [16] —may not apply the correct amount of
loop unrolling to guarantee that MAXLIVE registers are enough, which may result in register spills [10].
This paper presents our research results on the open problem of minimal loop unrolling, allowing a software-only
code generation that does not trade the optimality of the initiation interval (II) for the compactness of the generated
code. Our novel idea is to use the remaining free registers after periodic register allocation to relax the constraints on
register reuse.
The problem of minimal loop unrolling arises either before or after software pipelining, either with a single or
with multiple register types (classes). We provide a formal problem definition for each scenario, and we propose and
study a dedicated algorithm for each problem.
Our solutions are implemented within an industrial-strength compiler for a VLIW embedded processor from
STMicroelectronics, and validated on multiple benchmarks suites.
History
Publication
International Journal of Parallel Programming;41(1), pp. 1-58
Publisher
Springer
Note
peer-reviewed
Other Funding information
ANR MOPUCE project, DIGITEO foundation
Rights
The original publication is available at www.springerlink.com