4.5 Article

FloatX: A C plus plus Library for Customized Floating-Point Arithmetic

Journal

Publisher

ASSOC COMPUTING MACHINERY
DOI: 10.1145/3368086

Keywords

ACM proceedings; LATEX; text tagging

Funding

  1. CICYT project of the MINECO [TIN2014-53495-R, TIN2017-82972-R]
  2. CICYT project of the FEDER [TIN2014-53495-R, TIN2017-82972-R]
  3. EU [732631]

Ask authors/readers for more resources

We present FloatX (Float eXtended), a C++ framework to investigate the effect of leveraging customized floating-point formats in numerical applications. FloatX formats are based on binary IEEE 754 with smaller significand and exponent bit counts specified by the user. Among other properties, FloatX facilitates an incremental transformation of the code, relies on hardware-supported floating-point types as back-end to preserve efficiency, and incurs no storage overhead. The article discusses in detail the design principles, programming interface, and datatype casting rules behind FloatX. Furthermore, it demonstrates FloatX's usage and benefits via several case studies from well-known numerical dense linear algebra libraries, such as BLAS and LAPACK; the Ginkgo library for sparse linear systems; and two neural network applications related with image processing and text recognition.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available