wiki:FlonumsCowan

Version 17 (modified by cowan, 12 months ago) (diff)

--

Abstract

Flonums are a subset of the inexact real numbers provided by a Scheme implementation. In most Schemes, the flonums and the inexact reals are the same. It is required that if two flonums are equal in the sense of =, they are also equal in the sense of eqv?. That is, if 12.0f0 is a 32-bit inexact number, and 12.0 is a 64-bit inexact number, they cannot both be flonums. In this situation, it is recommended that the 64-bit numbers be flonums.

Rationale

Flonum arithmetic is already supported by many systems, mainly to remove type-dispatching overhead. Standardizing flonum arithmetic increases the portability of code that uses it. Standardizing the range of flonums would make flonum operations inefficient on some systems, which would defeat their purpose. Therefore, this SRFI specifies some of the semantics of flonums, but makes the range implementation-dependent.

The sources of the procedures in this SRFI are R7RS-small, SRFI 141, the R6RS flonum library, and the C99/Posix library <math.h>, which should be available directly or indirectly to Scheme implementers. (The C90 version of <math.h> lacks arcsinh, arccosh, arctanh, erf, and tgamma.)

Specification

Flonum operations must be at least as accurate as their generic counterparts applied to flonum arguments. It is an error, except as otherwise noted, for an argument not to be a flonum. In some cases, operations should be more accurate than their naive generic expansions because they have a smaller total roundoff error. If the generic result is a non-real number, the result is +nan.0 if the implementation supports that number, or an arbitrary flonum if not.

This SRFI uses x, y, z as parameter names for flonum arguments, and ix, iy, iz as a name for integer-valued flonum arguments, i.e., flonums for which the integer? predicate returns true.

Constants

The following C99 constants are provided as Scheme variables.

fl-e

Value of the mathematical constant e. (C99 M_E)

fl-log2-e

Value of (fllog fl-e 2.0). (C99 M_LOG2E)

fl-log10-e

Value of (fllog fl-e 10.0). (C99 M_LOG10E)

fl-ln-2

Value of (fllog 2.0) (C99 M_LN2)

fl-ln-10

Value of (fllog 10.0) (C99 M_LN10)

fl-pi

Value of the mathematical constant π. (C99 M_PI)

fl-pi/2

Value of (fl/ fl-pi 2.0) (C99 M_PI_2)

fl-pi/4

Value of (fl/ fl-pi 4.0) (C99 M_PI_4)

fl-1/pi

Value of (fl/ 1.0 fl-pi). (C99 M_1_PI)

fl-2/pi

Value of (fl/ 2.0 fl-pi). (C99 M_2_PI)

fl-2/sqrt-pi

Value of (fl/ 2.0 (flsqrt fl-pi)). (C99 M_2_SQRTPI)

fl-sqrt-2

Value of (flsqrt 2.0). (C99 M_SQRT2)

fl-1/sqrt-2

Value of (fl/ 1,0 (flsqrt 2.0)). (C99 M_SQRT1_2)

fl-maximum

Value of the largest finite flonum.

fl-fast-multiply-add

Equal to #t if (fl+* x y z) is known to be faster than (fl+ (fl* x y) z), or #f otherwise. (C99 FP_FAST_FMA`)

fl-integer-exponent-zero

Value of (flinteger-binary-log 0). (C99 FP_ILOGB0)

fl-integer-exponent-nan

Value of (flinteger-binary-log +0.nan). (C99 FP_ILOGBNAN)

Constructors

(flonum number)

Returns the closest flonum equivalent to number in the sense of = and <.

(fladjacent x y)

Returns a flonum adjacent to x in the direction of y. Specifically: if x < y, returns the smallest flonum larger than x; if x > y, returns the largest flonum smaller than x; if x = y, returns x. (C99 nextafter)

(flcopysign x y)

Returns a flonum whose magnitude is the magnitude of x and whose sign is the sign of y. (C99 copysign)

Predicates

(flonum? obj)

Returns #t if obj is a flonum and #f otherwise.

(fl= x y z ...)

(fl< x y z ...)

(fl> x y z ...)

(fl<= x y z ...)

(fl>= x y z ...)

These procedures return #t if their arguments are (respectively): equal, monotonically increasing, monotonically decreasing, monotonically nondecreasing, or monotonically nonincreasing; they return #f otherwise. These predicates must be transitive.

(flinteger? x`)‌‌

(flzero? x)

(flpositive? x)

(flnegative? x)

(flodd? ix)

(fleven? ix)

(flfinite? ix)

(flinfinite? ix)

(flnan? ix)

These numerical predicates test a flonum for a particular property, returning #t or #f. The flinteger? procedure tests whether the flonum is an integer, flzero? tests whether it is fl=? to zero, flpositive? tests whether it is greater than zero, flnegative? tests whether it is less than zero, flodd? tests whether it is odd, fleven? tests whether it is even, flfinite? tests whether it is not an infinity and not a NaN, flinfinite? tests whether it is an infinity, and flnan? tests whether it is a NaN.

Note that (flnegative? -0.0)< must return #f; otherwise it would lose the correspondence with (fl< -0.0 0.0), which is #f according to IEEE 754.

Arithmetic

(fl+ x)

(fl* x)

Return the flonum sum or product of their flonum arguments. In general, they should return the flonum that best approximates the mathematical sum or product. (For implementations that represent flonums using IEEE binary floating point, the meaning of "best" is defined by the IEEE standards.)

(fl+* x y z)

Returns (fl+ (fl* x y) z), possibly faster. If the constant fl-fast-fl+* is #f, it will definitely be faster. (C99 fma)

(fl- x y ...)

(fl/ x y ...)

With two or more arguments, these procedures return the difference or quotient of their arguments, associating to the left. With one argument, however, they return the additive or multiplicative inverse of their argument. In general, they should return the flonum that best approximates the mathematical difference or quotient. For undefined quotients, fl/ behaves as specified by the IEEE standards.

For implementations that represent flonums using IEEE binary floating point, the meaning of "best" is reasonably well-defined by the IEEE standards.

(flmax x ...)

(flmin x ...)

Return the maximum/minimum argument. If there are no arguments, these procedures return +inf.0/-inf.0 if the implementation provides these numbers, and fl-maximum / its negation otherwise.

(flabs x)

Returns the absolute value of x.

(flabsdiff x y)

Returns (flabs (fl- x y)). (C99 fdim)

(flsgn x)

Returns (flcopy-sign 1.0 x).

(flnumerator x)

(fldenominator x)

Returns the numerator/enominator of x as a flonum; the result is computed as if <i>x</i> was represented as a fraction in lowest terms. The denominator is always positive. The denominator of 0.0 is defined to be 1.0.

(flfloor x)

(flceiling x)

(flround x)

(fltruncate x)

These procedures return integral flonums for flonum arguments that are not infinities or NaNs?. For such arguments, flfloor returns the largest integral flonum not larger than x. The flceiling procedure returns the smallest integral flonum not smaller than x. The fltruncate procedure returns the integral flonum closest to x whose absolute value is not larger than the absolute value of x. The flround procedure returns the closest integral flonum to x, rounding to even when x represents a number halfway between two integers.

Although infinities and NaNs? are not integer objects, these procedures return an infinity when given an infinity as an argument, and a NaN when given a NaN.

Exponents and logarithms

Trigonometric functions

(flsin x) (C99 sin)

(flcos x) (C99 cos)

(fltan x) (C99 tan)

(flasin x) (C99 asin)

(flacos x) (C99 acos)

(flatan x [y]) (C99 atan and atan2)

(flsinh x) (C99 sinh)

(flcosh x) (C99 cosh)

(fltanh x) (C99 tanh)

(flasinh x) (C99 asinh)

(flacosh x) (C99 acosh)

(flatanh x) (C99 atanh)

These are the usual trigonometric functions. Note that if the result is not a real number, +nan.0 is returned if available, or if not, an arbitrary flonum. The flatan function, when passed two arguments, returns (flatan (/ y x)) without requiring the use of complex numbers. Implementations that use IEEE binary floating-point arithmetic should follow the relevant standards for these procedures.

flremquodouble remquo(double, double, int *)returns two values, rounded remainder and low-order n bits of the quotient (n is implementation-defined)

Integer division

The following procedures are the flonum counterparts of procedures from SRFI 141:

flfloor/ flfloor-quotient flfloor-remainder
flceiling/ flceiling-quotient flceiling-remainder
fltruncate/ fltruncate-quotient fltruncate-remainder
flround/ flround-quotient flround-remainder
fleuclidean/ fleuclidean-quotient fleuclidean-remainder
flbalanced/ flbalanced-quotient flbalanced-remainder

They have the same arguments and semantics as their generic counterparts, except that it is an error if the arguments are not flonums.

Special functions

Scheme nameC signatureComments
flcomplementary-error-functiondouble erfc(double)-
flerror-functiondouble erf(double)-
flfirst-besseldouble jn(int n, double)bessel function of the first kind, order n
flgammadouble tgamma(double)-
fllog-gammadouble lgamma(double)returns two values, log(|gamma(x)|) and sgn(gamma(x))
flsecond-besseldouble yn(int n, double)bessel function of the second kind, order n

Junk

Returns the principal square root of fl. For −0.0, flsqrt should return −0.0; for other negative arguments, the result may be a NaN or some unspecified flonum.

(flexpt fl1 fl2)procedure 

Either fl1 should be non-negative, or, if fl1 is negative, fl2 should be an integer object. The flexpt procedure returns fl1 raised to the power fl2. If fl1 is negative and fl2 is not an integer object, the result may be a NaN, or may be some unspecified flonum. If fl1 is zero, then the result is zero.

Decoding flonums

decode-float float => significand, exponent, sign

scale-float float integer => scaled-float

float-radix float => float-radix

float-sign float-1 &optional float-2 => signed-float

float-digits float => digits1

float-precision float => digits2

integer-decode-float float => significand, exponent, integer-sign

Description:

decode-float computes three values that characterize float. The first value is of the same type as float and represents the significand. The second value represents the exponent to which the radix (notated in this description by b) must be raised to obtain the value that, when multiplied with the first result, produces the absolute value of float. If float is zero, any integer value may be returned, provided that the identity shown for scale-float holds. The third value is of the same type as float and is 1.0 if float is greater than or equal to zero or -1.0 otherwise.

decode-float divides float by an integral power of b so as to bring its value between 1/b (inclusive) and 1 (exclusive), and returns the quotient as the first value. If float is zero, however, the result equals the absolute value of float (that is, if there is a negative zero, its significand is considered to be a positive zero).

scale-float returns (* float (expt (float b float) integer)), where b is the radix of the floating-point representation. float is not necessarily between 1/b and 1.

float-radix returns the radix of float.

float-sign returns a number z such that z and float-1 have the same sign and also such that z and float-2 have the same absolute value. If float-2 is not supplied, its value is (float 1 float-1). If an implementation has distinct representations for negative zero and positive zero, then (float-sign -0.0) => -1.0.

float-digits returns the number of radix b digits used in the representation of float (including any implicit digits, such as a hidden bit).

float-precision returns the number of significant radix b digits present in float; if float is a float zero, then the result is an integer zero.

For normalized floats, the results of float-digits and float-precision are the same, but the precision is less than the number of representation digits for a denormalized or zero number.

integer-decode-float computes three values that characterize float - the significand scaled so as to be an integer, and the same last two values that are returned by decode-float. If float is zero, integer-decode-float returns zero as the first value. The second value bears the same relationship to the first value as for decode-float:

 (multiple-value-bind (signif expon sign)
                      (integer-decode-float f)
   (scale-float (float signif f) expon)) ==  (abs f)

NaN functions

The following functions are applicable not only to flonum NaNs? but to all inexact real NaNs?, which is why their names do not begin with fl.

(make-nan payload)

Returns a NaN, using the exact integer payload in an implementation-defined way to generate the payload bits. In particular, the sign bit of the NaN is set from the sign of payload. If the implementation does not support NaNs?, it is an error.

(nan-payload nan)

Returns the payload of nan.

(nan-signaling? nan)

Returns #t if nan is a signaling NaN, and #f otherwise. This function is required because different floating-point processors implement the signaling bit in different ways: on most processors, the most significant bit of the payload is clear if the NaN is signaling, but on the PA-RISC and MIPS processors it is set.

(nan= x y)

Returns #t if x and y are both NaNs? the same payload.