4.5 Article

Parallel QR Factorization of Block Low-rank Matrices

Journal

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3538647

Keywords

Block low-rank matrix; QR factorization; householder reflections; task-based execution

Funding

  1. JSPS KAKENHI [JP20K20624, JP21H03447]
  2. Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures
  3. High Performance Computing Infrastructure in Japan [jh210024-NAHI]

Ask authors/readers for more resources

The study introduces two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices, achieving arithmetic complexities of O(mn) and O(mn(1.5)) respectively. The algorithms are compared in terms of parallel task-based execution and performance, showing significant speedups compared to dense QR methods. The study also demonstrates robustness to ill conditioning and better orthogonal factors compared to existing methods.
We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of O(mn), while the tiled BLR-QR exhibits O(mn(1.5)) complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram-Schmidt (MGS) BLR-QR using fork-join parallelism and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k x 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available