4.5 Article

New distributed algorithms for fast sign detection in residue number systems (RNS)

期刊

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE
DOI: 10.1016/j.jpdc.2016.06.005

关键词

Residue number systems; RNS; Reconstruction coefficient; Partial reconstruction; Reduced precision; Fast sign detection

向作者/读者索取更多资源

We identify a canonical parameter in the Chinese Remainder Theorem (CRT) and call it the Reconstruction Coefficient, (denoted by R-c); and introduce the notions of Partial and Full Reconstruction. If the R-c can be determined efficiently, then arithmetic operations that are (relatively) harder to realize in RNS; including Sign Detection, Base change/extension and Scaling or division by a constant can also be implemented efficiently. This paper therefore focuses on and presents two distinct methods to efficiently evaluate the R-c at long wordlengths. A straightforward application of these methods leads to ultra-fast sign-detection. An independent contribution of this paper is to illustrate non-trivial trade-offs between run-time computation vs. pre-computation and look-up. We show a simple method to select the moduli which leads to both the (i) number of RNS channels n; as well as (ii) the largest channel modulus m(n), satisfying {0(n), O(m(n))} (sic) N equivalent to the full-precision bit-length. The net result is that for many canonical operations; exhaustive look-up covering all possible input values is feasible even at long cryptographic bit-lengths N. Under fairly general and realistic assumptions about the capabilities of current hardware, the memory needed for exhaustive look-up tables is shown to be bounded by a low degree polynomial of n. Moreover, both methods to compute R-c can achieve a delay of O(lg n) in a RN system with n channels. To the best of our knowledge, no other method published to date has shown a path to achieve that lower bound on the execution delay. Further, small values of channel moduli make it ideal to implement each individual RNS channel on a simple core in a many-core processor or as a distributed node, and our algorithms require a limited number of inter-channel communications, averaging O(n). Results from a multi-core GPU implementation corroborate the theory. (C) 2016 Elsevier Inc. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据