4.7 Article

ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric Systems

Journal

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING
Volume 11, Issue 2, Pages 388-403

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TETC.2022.3226132

Keywords

Near-data processing; inter-segment data movement; application partitioning

Ask authors/readers for more resources

Partitioning applications between near-data processing (NDP) and host CPU cores causes inter-segment data movement overhead, which can be mitigated by ALP, a programmer-transparent technique that proactively and accurately transfers required data between segments based on the invariant instructions. Evaluation on a wide range of workloads demonstrates significant speedup over traditional CPU-only and NDP-only executions.
Partitioning applications between near-data processing (NDP) and host CPU cores causes inter-segment data movement overhead, which is caused by moving data generated by one segment (e.g., instructions, functions) and used in other consecutive segments. Prior works take two approaches to this problem. The first approach maps segments to NDP or host cores based on the properties of each segment, neglecting the inter-segment data movement overhead. The second approach partitions applications based on the overall memory bandwidth savings, and does not offload each segment to the best-fitting core if they incur high inter-segment data movement. We show that 1) mapping each segment to its best-fitting core ideally can provide substantial benefits, and 2) the inter-segment data movement reduces this benefit significantly. We introduce ALP, a new programmer-transparent technique to alleviate the inter-segment data movement overhead between host and memory in NDP systems. ALP proactively and accurately transfers the required data between the segments based on the key observation that the instructions that generate the inter-segment data stay the same across different executions of a program. ALP uses a compiler pass to identify these instructions and uses specialized hardware to transfer their produced data at runtime. We evaluate ALP across a wide range of workloads and demonstrate 54.3% and 45.4% average speedup over CPU-only and NDP-only executions, respectively.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available