4.7 Article

Efficient utilization of SIMD extensions

Journal

PROCEEDINGS OF THE IEEE
Volume 93, Issue 2, Pages 409-425

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/JPROC.2004.840491

Keywords

autontatic vectorization; digital signal processing (DSP); fast Fourier transform (FFT); short vector single instruction; multiple data (SIMD); symbolic vectorization

Ask authors/readers for more resources

This paper targets automatic perfomance tuning of numerical kernels in the presence of multilayered memory hierarchies and single-instruction, multiple-data (SIMD) parallelisin. The studied SIMD instruction set extensions include Intel's SSE family, AMD's 3DNow!, Motorola's AltiVec, and IBM's BlueGene/L SIMD instructions. FFTW ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software pack-ages generate and optimize ANSI C code and feed it into the target machine's general-purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and, thus, inhibits satisfactory performance on processors featuring short vector extensions. this paper describes special-purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes; 1) symbolic vectorization of digital signal processing transforms; 2) straight-line code vectorization for numerical kernels; and 3) compiler back ends for straight-line code with vector,instructions. Methods from all three areas were combined with FFTW. SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speedups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available