4.6 Article

The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale

Journal

SIAM REVIEW
Volume 60, Issue 4, Pages 808-865

Publisher

SIAM PUBLICATIONS
DOI: 10.1137/17M1117732

Keywords

singular value decomposition; SVD; bidiagonal matrix; QR iteration; divide and conquer; bisection; MRRR; Jacobi method; Kogbetliantz method; Hestenes method

Funding

  1. National Science Foundation [1339822]
  2. Exascale Computing Project [17-SC-20-SC]
  3. NVIDIA
  4. Intel
  5. Office of Advanced Cyberinfrastructure (OAC)
  6. Direct For Computer & Info Scie & Enginr [1339822] Funding Source: National Science Foundation

Ask authors/readers for more resources

The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in the EIS-PACK library, and was later more efficiently implemented in the LINPACK library, targeting contemporary vector machines. To address cache-based memory hierarchies, the SVD algorithm was reformulated to use Level 3 BLAS in the LAPACK library. To address new architectures, ScaLAPACK was introduced to take advantage of distributed computing, and MAGMA was developed for accelerators such as GPUs. Algorithmically, the divide and conquer and MRRR algorithms were developed to reduce the number of operations. Still, these methods remained memory bound, so two-stage algorithms were developed to reduce memory operations and increase the computational intensity, with efficient implementations in PLASMA, DPLASMA, and MAGMA. Jacobi methods started with the two-sided method of Kogbetliantz and the one-sided method of Hestenes. They have likewise had many developments, including parallel and block versions and preconditioning to improve convergence. In this paper, we investigate the impact of these changes by testing various historical and current implementations on a common, modern multicore machine and a distributed computing platform. We show that algorithmic and implementation improvements have increased the speed of the SVD by several orders of magnitude, while using up to 40 times less energy.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available