3.8 Proceedings Paper

FAFNIR: Accelerating Sparse Gathering by Using Efficient Near-Memory Intelligent Reduction

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/HPCA51647.2021.00080

Keywords

-

Funding

  1. Booz Allen Hamilton Inc.
  2. Laboratory for Physical Sciences (LPS)

Ask authors/readers for more resources

The study introduces an efficient solution for sparse data gathering called Fafnir, which minimizes data movement utilizing an intelligent reduction tree in memory and maximizes parallel memory accesses in near-data processing. Fafnir does not rely on spatial locality, offering higher efficiency and faster performance compared to existing NDP proposals.
Memory-bound sparse gathering, caused by irregular random memory accesses, has become an obstacle in several on-demand applications such as embedding lookup in recommendation systems. To reduce the amount of data movement, and thereby better utilize memory bandwidth, previous studies have proposed near-data processing (NDP) solutions. The issue of prior work, however, is that they either minimize data movement effectively at the cost of limited memory parallelism or try to improve memory parallelism (up to a certain degree) but cannot successfully decrease data movement, as prior proposals rely on spatial locality (an optimistic assumption) to utilize NDP. More importantly, neither approach proposes a solution for gathering data from random memory addresses; rather they just offload operations to NDP. We propose an effective solution for sparse gathering, an efficient near-memory intelligent reduction (Fafnir) tree, the leaves of which are all the ranks in a memory system, and the nodes gradually apply reduction operations while data is gathered from any rank. By using such an overall tree, Fafnir does not rely on spatial locality; therefore, it minimizes data movement by performing entire operations at NDP and fully benefits from parallel memory accesses in parallel processing at NDP. Further, Fafnir offers other advantages such as using fewer connections (because of the tree topology), eliminating redundant memory accesses without using costly and less effective caching mechanisms, and being applicable to other domains of sparse problems such as scientific computing and graph analytics. To evaluate Fafnir, we implement it on an XCVU9P Xilinx FPGA and in 7 am, ASAP ASIC. Fafnir looks up the embedding tables up to 21.3 x more quickly than the state-of-the-art NDP proposal. Furthermore, the generic architecture of Fafnir allows running classic sparse problems using the same 1.2 mm(2) hardware up to 4.6x more quickly than the state of the art.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available