期刊
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
卷 97, 期 -, 页码 78-95出版社
ACADEMIC PRESS INC ELSEVIER SCIENCE
DOI: 10.1016/j.jpdc.2016.06.005
关键词
Residue number systems; RNS; Reconstruction coefficient; Partial reconstruction; Reduced precision; Fast sign detection
We identify a canonical parameter in the Chinese Remainder Theorem (CRT) and call it the Reconstruction Coefficient, (denoted by R-c); and introduce the notions of Partial and Full Reconstruction. If the R-c can be determined efficiently, then arithmetic operations that are (relatively) harder to realize in RNS; including Sign Detection, Base change/extension and Scaling or division by a constant can also be implemented efficiently. This paper therefore focuses on and presents two distinct methods to efficiently evaluate the R-c at long wordlengths. A straightforward application of these methods leads to ultra-fast sign-detection. An independent contribution of this paper is to illustrate non-trivial trade-offs between run-time computation vs. pre-computation and look-up. We show a simple method to select the moduli which leads to both the (i) number of RNS channels n; as well as (ii) the largest channel modulus m(n), satisfying {0(n), O(m(n))} (sic) N equivalent to the full-precision bit-length. The net result is that for many canonical operations; exhaustive look-up covering all possible input values is feasible even at long cryptographic bit-lengths N. Under fairly general and realistic assumptions about the capabilities of current hardware, the memory needed for exhaustive look-up tables is shown to be bounded by a low degree polynomial of n. Moreover, both methods to compute R-c can achieve a delay of O(lg n) in a RN system with n channels. To the best of our knowledge, no other method published to date has shown a path to achieve that lower bound on the execution delay. Further, small values of channel moduli make it ideal to implement each individual RNS channel on a simple core in a many-core processor or as a distributed node, and our algorithms require a limited number of inter-channel communications, averaging O(n). Results from a multi-core GPU implementation corroborate the theory. (C) 2016 Elsevier Inc. All rights reserved.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据