C/C++

Arbitrary Precision Floating-point Arithmetic

By Frederick C. Motteler, August 01, 1993

Here's a general-purpose C library for extended precision and IEEE-754 compatibility. The K&R/ANSI C/C++ compatible package supports single, double, double-extended, and longer, IEEE-754-like formats and is portable across operating systems including UNIX and MS-DOS.

SEP93: Arbitrary Precision Floating-point Arithmetic

Fred is a senior engineer at Zetron and has worked on a variety of portable native applications and embedded applications. He can be contacted at either [email protected] or Zetron Inc., 12335 134th Ct. N.E., Redmond, WA 98052-2433.

Floating-point arithmetic is usually something most programmers take for granted because virtually all compilers support single- and double-precision for applications that require it. Many compilers even support double-extended precision.

However, there are exceptions: Your application might need a few more bits of precision than what your compiler or hardware provides, or you may have just spent the last three days (and nights) trying to optimize your algorithm, attempting to squeeze it into doubles without success. Perhaps you have a ton of data in IEEE-754 format and your compiler doesn't know what an IEEE number is. Alternately, your application may need to run on a processor that's either so old or so new that the only compiler you can get for it doesn't support floating point. Possibly you're looking for something that will run on a DEC Alpha and on a CP/M machine and produce the same result.

Whatever the reason, there are times when extended precision, IEEE-754 compatibility, and/or portability are important constraints. The C library presented here was developed with IEEE-754 compatibility and portability as its primary goals. Its first application was as part of a portable cross compiler and I assumed that both host and target environments did not have IEEE compatible single- and double-precision floating-point support.

Later applications required extended precision and the ability to convert to and from IEEE-754 double- and single-precision formats. The result is a general-purpose library that supports single, double, double-extended, and longer IEEE-754-like formats. The library has been ported to and tested on a variety of systems including CP/M, PCs running MS-DOS (Zortech, Microsoft, Lattice, Mark Williams compilers), PCs running Coherent, Sun 3s (Sun's C compiler and the GNU compiler), Sparcstations (Sun's C compiler), and the IBM RS/6000 under AIX. The package is K&R, ANSI C, and C++ compatible. A table-driven tester included with the library checks if it has compiled properly. The tester also gives you an idea of what the package is capable of doing.

Sound to good to be true? The only known drawback is that slugs and snails probably run faster. (All right, you DEC Alpha users, make that a really fast slug!) This is the result of numerous tradeoffs required to achieve portability, extensibility, and IEEE compatibility. Also, if execution speed is important, the integer portion of the package can be significantly optimized. Most of the time required is actually spent doing integer operations (see my article "Statistical Performance Analysis," DDJ, December 1991). Operations such as shifting multibyte integers are not efficient when coded in C.

The package requires dynamic memory allocation, malloc(), and free(). Included is support for verifying that all memory malloced during an operation has been properly freed when the operation is complete. Also required are unsigned char (8 bits or more), unsigned int (16 bits or more), and unsigned long (32 bits or more) data types that are compatible with the ANSI C specification.

The package is quite big, so rather than examine the nitty gritty details of how to do arithmetic and write portable software, I'll highlight interesting features and aspects of it. For implementation details, source code comments are the best guide. Listing One, page 88, (fmlib.h) gives function prototypes for all user-interface functions. The complete package is provided electronically; see "Availability," page 3.

Dynamic Creation of a Numerical Representation

All numbers are represented as unsigned character (byte) arrays. The length of a specific array depends on the length of the numerical representation that it contains. For example, a standard IEEE double requires eight bytes.

The advantage of byte arrays is the ease with which new representations may be created and manipulated. For example, a representation with 100 mantissa bits and 11 exponent bits needs a total of 112 bits, or 14 bytes. Creating a value with this representation is a two-step process. First, 14 bytes of memory must be allocated for each specific instance of the representation. The memory may be either statically allocated (at compile time) or dynamically allocated (at run time) on the stack, or by using malloc() or calloc().

Once the memory is allocated, one of the functions fltoflt(), intoflt(), or strtoflt() perform the initial conversion operations. Example 1(a), for instance, converts the double (pointed to by doubleptr) to the 14-byte representation (pointed to by destptr). With similar ease, Example 1(b) converts the long pointed to by longptr to the 14-byte representation. And for data represented in a string format, Example 1(c) converts a string (up to 128 bytes of it) to the 14-byte representation. A relatively free format string representation with no internal white space is expected by the strtoflt() function. Examples of valid strings are 3.501, 5, --.34e-12, and +123.456E+19.

The only caveat to these operations is that the byte ordering of native integer and float representations may not match the byte order and byte packing assumed in this library. The representations used here are all big-endian format. For integers, this means that the most significant byte is in low memory. For floats, the exponent is in low memory. These representations are compatible with Motorola 68xxx formats, but backwards for Intel 80x8x formats (for more information, refer to Motorola's MC68881 Floating-Point Coprocessor User's Manual and Intel's Microsystem Components Handbook). For PC users, this means that the native values must be byte-order flipped before being passed to a conversion function.

Arithmetic With a Numerical Representation

Once a number has been converted to the desired representation, the number may be part of an arithmetic operation. The functions, faddm(), fsubm(), fmultm(), and fdivm() perform the basic +, --, *, and / operations.

For example, if dividend_ptr and divisor_ptr point to two values in our 14 byte representation, then Example 2 divides the value pointed to by dividend_ptr by the value pointed to by divisor_ptr. The resulting quotient is returned in the 14 bytes pointed to by dividend_ptr. This overwrites the original dividend value.

All interface functions in the package return an unsigned byte condition code. Functions that return a floating-point value use the following floating-point condition code format:

msb| 0 | 0 | 0 | 0 | N | Z | I | n |lsb

where N=sign (1 if result negative), Z=zero (1 if result 0), I=infinite (1 if result infinite), and n=Not A Number (1 if result is Not A Number). After each operation, the value of the returned condition code must be examined to determine the integrity of the result. For example, a divide-by-zero operation returns a condition code with the I (infinite) bit set.

Getting Results

Conversion of results back to a native representation is just as important as the initial conversion to a custom format. The fltoflt(), fltoint(), and fltostr() functions perform the inverse conversion operations. Example 3(a) converts the 14-byte representation pointed to by srcptr to a double in the 8 bytes pointed to by doubleptr. The fltoflt() function is its own inverse. As a conversion function, it is very powerful since it can convert between any of the possible floating-point representations supported by the package. Correspondingly, Example 3(b) converts the 14-byte representation to a long in the 4 bytes pointed to by longptr. For data represented in a string format, Example 3(c) converts the 14-byte representation to a string (up to 128 bytes long) pointed to by stringptr. The output format is similar to the printf() "e" format. All significant figures are included. The strings produced by fltostr() are compatible with strings expected by strtoflt().

Functions that return an integer value, such as fltoint(), use the integer condition code format:

msb| 0 | 0 | 0 | 0 | Z | V | S | C |lsb

where Z=zero (1 if result 0), V=overflow (1 if overflow result), S=sign (1 if result negative), and C=carry. As mentioned earlier, the value of the returned condition code must be examined to determine the integrity of the result. For example, an overflow during conversion of a large float value to an integer returns a condition code with the V (overflow) bit set.

An Example: The Square Root of 1/2 to 100 Places

Listing Two, page 88, (root2ext.c) was used to calculate the square root of 1/2 presented in Figure 1. The command line root2ext .5 1e-105 372 11 1 guarantees accuracy to about 105 places. The first 100 digits are in Figure 1.

A simple power series expansion is used. Calculation of successive terms involves a simple recursion relation that requires six floating operations per iteration. When calculating the square root of 1/2, each successive term results in about one bit of additional accuracy. The series converges for values between 0 and 2. Most rapid convergence is for values near 1.

All numerical input values are accepted as strings. For the sake of convenience, even constant float values are initially represented as strings and converted to the desired representation. The root2ext program was run on a Sun 3/60 (big-endian byte order) and a PC (little-endian byte order). When compiling for a PC, the symbol IBM_PC should be defined on the compile-command line to ensure proper conversion of PC native integers to float values.

The result is presented as a string with all significant digits represented. The root2ext program was also used to determine approximate flops (floating-point operations per second) values for various representations. Table 1 summarizes the results. As expected, the time required for each floating-point operation increases rapidly with increasing precision.

Float-to-String Conversion

One of the more useful aspects of this package is that it includes functions to convert binary representations to and from string representations (see Listing Three, page 88). This is one operation that most programmers leave to printf() and scanf()--probably for good reason. Efficient conversion of binary floating-point values is much more difficult than binary integer values.

Conversion from a binary float representation to a string representation first requires that the binary exponent be converted into a decimal exponent. This is determined by using:

decimal exponent
     = binary exponent * log10(2)
     = binary exponent * 0.30103000

It's relatively easy to convert the binary exponent to a float, multiply it by 0.30103000, and convert the result back to an integer. Use of a single-precision float restricts the accuracy of the conversion to 24 bits. As a result, this limits the maximum exponent-field width to 24 bits.

Once the approximate decimal exponent is known, the actual floating-point value of ten raised to the decimal exponent must be calculated. A common approach to this is to have a table of floating-point powers of ten. For the standard-IEEE single-precision float representation, a table with 12 entries (10¹, 10^-1, and so on, for powers of 10 +/-2, +/-4, +/-8, +/-16, and +/-32) is required. For the standard double representation, a table with six more entries (for powers of 10 +/-64, +/-128, +/-256) is required. To calculate a specific power of 10, say 10²⁷, the values for 10¹⁶, 10⁸, 10², and 10¹ are multiplied together (27=16+ 8+2+1). This approach is a reasonable compromise between accuracy, table size, and computation time.

The disadvantage of this approach is that the table values must be known in advance. This is fine for supporting a limited number of representations such as IEEE float and double. For arbitrary representations that are defined at run time, the table values must effectively be calculated on the fly. Starting with a conversion of the integer 10 to the desired representation, subsequent binary powers of 10 in the representation are calculated by squaring the previous value. For the example above (1027), this requires that 10¹, 10², 10⁴, 10⁸, 10¹⁶ be calculated. In the library, these operations are combined together in the intopten() function.

The next step is to determine decimal mantissa value. The decimal mantissa must be between 0.1 and 1. To good approximation, the decimal mantissa is given by dividing the original float value by 10 raised to the decimal exponent:

mantissa value = original float/10<SUP>E</SUP>

where E equals decimal exponent.

Due to truncation errors, the resulting mantissa value may not be between 0.1 and 1. If it is less than 0.1, then the mantissa value is multiplied by 10 and the exponent value is decremented. If it is greater than or equal to 1, then the mantissa value is divided by 10 and the exponent value is incremented. A mantissa value equal to 0.1 is a special case and is explicitly handled.

At this point, the mantissa value is just a floating point value between 0.1 and 1. It must be converted to decimal digits. The number of decimal digits is determined by the bit length of the mantissa:

decimal mantissa digits
     = binary mantissa digits * log10(2)
     = binary mantissa digits * 0.30103000

The binary mantissa value is then multiplied by 10^D, where D equals the number of decimal mantissa digits. When converted back to an integer, this creates a multi-byte integer value that has all of the significant figures of the original binary mantissa. The integer value is then converted to decimal by repeated integer division by 10. The remainder of each division gives successively more significant decimal digits.

The resulting decimal mantissa digits and decimal exponent are then concatenated into a string representation.

String-to-Float Conversion

Conversion of a string representation back to a binary representation is handled in somewhat of a reverse order from float-to-string conversion. The mantissa portion is converted into a float value by successively adding in each digit after multiplying the previous digits' value by 10. While the mantissa is being converted, the number of digits behind the decimal point (if any) is counted.

Next, the number of digits behind the decimal point (D) is subtracted from the decimal exponent (E). The value 10(^E-D) then is calculated using intopten(). This value is multiplied by the mantissa value to obtain the converted binary float value. Special precautions are required to handle cases where partial results may underflow or overflow, but the final result is within range of the representation.

Verification

An arithmetic package is useful only if it reliably produces accurate results over the entire domain (all possible input values) and range (all possible output values) of all functions. This places stringent, but well-defined analytical requirements on the arithmetic package.

A comprehensive table-driven tester is included with the package and is provided electronically; see "Availability," page 3. The tables give string representations of input values and expected output values. This has the advantage of testing the string conversion functions as well as the specific arithmetic operations. In addition to providing comprehensive regression testing, new test cases are easy to set up and verify. In all cases, the returned condition codes are checked.

The current version tests a variety of input and output conditions. These include overflow, underflow, almost overflow, barely overflow, almost underflow, barely underflow, divide-by-zero, multiply-by-zero, infinite input value, and so on.

Portability

As mentioned earlier, the library has been ported to, and tested on, a variety of systems and is K&R, ANSI C, and C++ compatible. The table-driven tester checks if it has compiled properly. For ANSI and C++ environments, the symbol PROTOTYPES must be defined on the compile command line. For environments without <stdlib.h>, the symbol MWC must be defined on the compile-command line. To verify that all dynamically-allocated memory is properly released, the symbol MEMTEST must be defined on the compile command line.

When building the package, I recommend that imlib.c be compiled with the TEST symbol defined. This produces a stand-alone tester of the integer portion of the package. The integer portion of the package must work reliably for the rest of the package to work.

What About C++?

The next step for this package is to encapsulate it into a C++ class. The package already supports static and dynamic creation and manipulation of floating-point data representations. The ability to overload operators in C++ allows arbitrary precision representations to be used with the same ease as native representations. This would greatly facilitate porting application software to take advantage of arbitrary-precision floating point.

Example 1:

(a)
    unsigned char *destptr;
    double *doubleptr;
    destptr = malloc(14);
    fltoflt(doubleptr, 52, 11, destptr, 100, 11);
(b)
    long *longptr;
    intoflt(longptr, 4, destptr, 100, 11);
(c)
    char *stringptr;
    strtoflt(stringptr, 128, destptr, 100, 11);

Example 2:

    unsigned char *dividend_ptr, *divisor_ptr;
    fdivm(dividend_ptr, divisor_ptr, 11, 100);

Example 3:

(a)
    unsigned char *srcptr;
    double *doubleptr;
    fltoflt(srcptr, 100, 11, doubleptr, 52, 11);
(b)
    long *longptr;
    fltoint(srcptr, 100, 11, longptr, 4);
(c)
    char *stringptr;
    fltostr(srcptr, 100, 11, stringptr, 128)

Figure 1:

 
sqrt(1/2) = 
             0.70710 67811 86547 52440 08443
               62104 84903 92848 35937 68847
               40365 88339 86899 53662 39231
               05351 94251 93767 16382 07864

Table 1: Results of the root2ext program. Each iteration requires six floating-point operations. (Flops = 6 * Iterations * Trials/Time.)

===========================================================================
                                               Sun 3/60       40MHz/386DX
    Mantissa   Exponent  Iterations  Trials  Time    Flops   Time    Flops
      bits       bits                        (sec)           (sec)
===========================================================================
  "float" 23      8         17        60      54     110       9      680
  "double" 52    11         42        15      69      55      14      270
      104        11         90         4     108      20      22       98
      212        11        188         1     189       6      40       28
      372        11        319         1     965       2     195       10
===========================================================================



_ARBITRARY PRECISION FLOATING-POINT ARITHMETIC_
by Frederick Motteler

[LISTING ONE]

/* Extended IEEE Compatible Floating Point Arithmetic Library
** Version 1.1 Copyright 1990, 1992 by Fred Motteler, All Rights Reserved */

/* Bit masks for floating point condition code values */
#define FFNAN 1
#define FFINF 2
#define FFZERO 4
#define FFNEG 8

/* Floating point functions */
#ifdef PROTOTYPES
unsigned char fmultm(unsigned char *prodPB, unsigned char *termPB,
                     int expbitN, int fracbitN);
unsigned char fdivm(unsigned char *dividPB, unsigned char *termPB,
                    int expbitN, int fracbitN);
int fcmpm(unsigned char *flt1PB, unsigned char *flt2PB, int expbitN,
          int fracbitN);
unsigned char fsubm(unsigned char *diffPB, unsigned char *termPB,
                    int expbitN, int fracbitN);
unsigned char faddm(unsigned char *sumPB, unsigned char *termPB,
                    int expbitN, int fracbitN);
#else
unsigned char fmultm();         /* Floating point multiply */
unsigned char fdivm();          /* Floating point dividend */
unsigned char faddm();          /* Floating point addition */
unsigned char fsubm();          /* Floating point subtraction */
int fcmpm();                    /* Floating point comparison */
#endif
/* Numeric conversion functions */
#ifdef PROTOTYPES
unsigned char intoflt(unsigned char *intvalBP, int intlenN,
                      unsigned char *fltvalBP, int fracbitN, int expbitN);
unsigned char fltoint(unsigned char *fltvalBP, int fracbitN, int expbitN,
                      unsigned char *intvalBP, int intlenN);
unsigned char fltoflt(unsigned char *fltinBP, int mantinN, int expinN,
                      unsigned char *fltoutBP, int mantoutN, int expoutN);
#else
unsigned char intoflt();        /* Integer to float */
unsigned char fltoint();        /* Float to integer */
unsigned char fltoflt();        /* Float to float */
#endif

/* Numberic / string conversion functions */
#ifdef PROTOTYPES
unsigned char intostr(unsigned char *intvalBP, int intlenN,
                      char *strBP, int slenN, int radixN);

unsigned char fltostr(unsigned char *fltvalBP, int fracbitN, int expbitN,
                      char *strBP, int slenN);
unsigned char strtoint(char *strBP, int slenN, unsigned char *intvalBP,
                       int intlenN, int radixN);
unsigned char strtoflt(char *strBP, int slenN, unsigned char *fltvalBP,
                       int fracbitN, int expbitN);
unsigned char intopten(unsigned char *fltexpBP, int expbyteN,
                       unsigned char *ptenfltBP, int fracbitN, int expbitN);
#else
unsigned char intostr();        /* Integer to ASCII string */
unsigned char fltostr();        /* Float to decimal ASCII string */
unsigned char strtoint();       /* ASCII string to integer */
unsigned char strtoflt();       /* ASCII string to float */
unsigned char intopten();       /* Generate float integral powers of 10 */
#endif

[LISTING TWO]

<a name="0269_0015">

/*  This is a simple program to find the square root of a number
**  using a power series expansion.
**  sqrt(1 - x) = 1 - (1/2) * x - (1/8) * x ^ 2 - ... + f[n](0)/n! * x ^ n
**  Recursion relation:
**  f[n](0)/n! = ((2n - 3) / (2 * n)) * f[n-1](0)   n > 1
*/
#include <stdio.h>
#include "fmlib.h"

#define LENGTH 48               /* Up to 44 byte mantissa, 4 byte exponent */
#define TRIALS 10

void
usage()
{
    printf("Usage: root2ext value accuracy mant exp trials\n");
    printf("Where:          value = value to find square root of\n");
    printf("                accuracy = desired accuracy of result\n");
    printf("                mant = number of mantissa bits (23 to 384)\n");
    printf("                exp = number of exponent bits (8 to 31)\n");
    printf("                trials = number of repeat calcuations\n");
    exit(-1);
}
void
main(argc, argv)
int argc;
char *argv[];
{
    unsigned char y[LENGTH];    /* Value to find square root of, this
                                 * should be between 0 and 2. */
    unsigned char x[LENGTH];    /* Value used in series expansion, this
                                 * should between -1 and 1 */
    unsigned char term[LENGTH]; /* Value of specific term */
    unsigned char sum[LENGTH];  /* Sum of terms */
    unsigned char limit[LENGTH]; /* Accuracy limit */
    unsigned char two_n[LENGTH]; /* Temporary variable for (2 * n) */
    unsigned char three[LENGTH]; /* Value of 3 */
    unsigned char one[LENGTH];  /* Value of 1 */
    unsigned char temp[LENGTH]; /* Temporary variable */
    unsigned char cc;           /* Condition codes */
    char string_array[128];     /* String conversion area */
    int mant_bits;              /* Number of bits in the mantissa */
    int exp_bits;               /* Number of bits in the exponent */
    int trials;                 /* Number of trials */
    int n;                      /* Current term number */
    int twon;
    int i;                      /* Copy index */
    int j;                      /* Trial index */

    /* Check input arguments for acceptable values */
    if (argc != 6)
        usage();
    if (sscanf(argv[3], "%d", &mant_bits) != 1)
        usage();
    if (sscanf(argv[4], "%d", &exp_bits) != 1)
        usage();
    if (sscanf(argv[5], "%d", &trials) != 1)
        usage();
    if ((mant_bits < 23) || (mant_bits > 384) ||
        (exp_bits < 8) || (exp_bits > 31))
        usage();
    for (j = 0; j < trials; j++)
    {
        /* Convert value to find square root of and accuracy */
        strtoflt(argv[1], 128, y, mant_bits, exp_bits);
        strtoflt(argv[2], 128, limit, mant_bits, exp_bits);
        /* Calculate constants, constant 3 for recursion relation */
        strtoflt("3", 2, three, mant_bits, exp_bits);
        /* Sum = 1.0, initial sum */
        strtoflt("1.0", 12, sum, mant_bits, exp_bits);
        /* x = 1 - y, Calculate x value to use in series */
        for (i = 0; i < LENGTH; i++)
            x[i] = sum[i];
        fsubm(x, y, exp_bits, mant_bits);
        /* Calculate initial term = 0.5 * x, n = 1 term */
        strtoflt("0.5", 4, term, mant_bits, exp_bits);
        cc = fmultm(term, x, exp_bits, mant_bits);
        n = 2;                  /* Next term to calculate */
        /* Loop until term is less than limiting value.  Note that for
         * an alternating series, only the positive values are tested. */
        while ((fcmpm(term, limit, mant_bits, exp_bits) > 0) ||
              ((cc & FFNEG) != 0))
        {
            /* sum -= term */
            fsubm(sum, term, exp_bits, mant_bits);
            /* term(n) = term(n-1) * (x * ((2 * n) - 3) / (2 * n)) */
            twon = 2 * n;
            intoflt(&twon, 4, two_n, mant_bits, exp_bits);
            for (i = 0; i < LENGTH; i++)
                temp[i] = two_n[i];
            fsubm(temp, three, exp_bits, mant_bits);
            fdivm(temp, two_n, exp_bits, mant_bits);
            fmultm(temp, x, exp_bits, mant_bits);
            cc = fmultm(term, temp, exp_bits, mant_bits);
            n++;
        }
    }
    printf("Done...\n");
    /* Print out the result */
    fltostr(sum, mant_bits, exp_bits, string_array, 128);
    printf("Iteration %d is: %s\n", n, string_array);
    return;
}



<a name="0269_0016"><a name="0269_0017">

[LISTING THREE]

<a name="0269_0017">

/* Extended IEEE Compatible Floating Point Arithmetic Library
** Version 1.1 Copyright 1990, by Fred Motteler, All Rights Reserved */
#include <stdio.h>
#ifndef MWC
#include <stdlib.h>
#endif
#include "imlib.h"
#include "ffmlib.h"
#include "fmlib.h"

#define SINGLEXP 8
#define SINGLEFRAC 23
#define SINGLETOT 4

static unsigned char log10of2AB[4] = {0x3e, 0x9a, 0x20, 0x9b};

#ifdef TEST
#define DOUBLEXP 11
#define DOUBLEFRAC 52
#define DOUBLETOT 8
#define EXTENDEXP 15
#define EXTENDFRAC 63
#define EXTENDTOT 10
unsigned char intval1AB[4] = {0x0, 0x0, 0x12, 0x34};
unsigned char intval2AB[4] = {0x12, 0x34, 0x56, 0x78};
unsigned char fltval1AB[4] = {0x3e, 0x9a, 0x20, 0x9b};
unsigned char dblval1AB[8] = {0x52, 0x34, 0x56, 0x78,
                              0x9a, 0xbc, 0xde, 0xf0};

/* Function:    unsigned char fltostr(unsigned char *fltvalBP, int fracbitN,
**                                    int expbitN, char *strBP, int slenN)
** Converts floating point value pointed to fltvalBP to a decimal ASCII string
** representation of the value pointed to by strBP. fracbitN length of the
** floating point mantissa in bits. expbitN is length of floating
** point exponent in bits. strBP is the length of string buffer in bytes. */
unsigned char
#ifdef PROTOTYPES
fltostr(unsigned char*fltvalBP,int fracbitN,int expbitN, char*strBP,int slenN)
#else
fltostr(fltvalBP, fracbitN, expbitN, strBP, slenN)
unsigned char *fltvalBP;
int fracbitN;
int expbitN;
char *strBP;
int slenN;
#endif
{
    int expbyteN, fracbyteN;
    unsigned char *expbiasBP;
    unsigned char *exponeBP;
    unsigned char *fltfracBP, *fltexpBP;
    unsigned char *tempfltBP, *tempexpBP;
    unsigned char *ptenfltBP, *onetenthBP, *oneBP;
    unsigned char ptenccB;
    unsigned char fltsignB;
    unsigned char condcodeB;
    int i, totalenN;
    unsigned char minusoneB, zeroB;
    unsigned int mantlenN;
    unsigned char *mantintBP;
    unsigned char *mantlenBP;

    minusoneB = 0xff;
    zeroB = 0;
    /* Initialize the condition code byte to zero */
    condcodeB = 0;
    /* Determine the total byte length of the floating point number */
    totalenN = fftotlen(expbitN, fracbitN);
    /* Determine number of bytes required to hold mantissa and exponent. */
    expbyteN = ffexplen(expbitN);
    fracbyteN = ffraclen(fracbitN);
    fltfracBP = (unsigned char *) FCALLOC(fracbyteN, 1, "FLTOSTR1");
    fltexpBP = (unsigned char *) FMALLOC(expbyteN, "FLTOSTR2");
    expbiasBP = (unsigned char *) FCALLOC(expbyteN, 1, "FLTOSTR3");
    exponeBP = (unsigned char *) FCALLOC(expbyteN, 1, "FLTOSTR4");
    /* Isolate the mantissas, exponents, and signs. */
    ffextall(fltvalBP, totalenN, fracbitN, fracbyteN, expbitN, expbyteN,
             fltfracBP, fltexpBP, &fltsignB);
    /* Write sign bit and decimal point to the output string. */
    if (fltsignB == 0)
    {
        *strBP++ = '+';
    }
    else
    {
        *strBP++ = '-';
        condcodeB = FFNEG;
        ffbitclr(fltvalBP, totalenN, (fracbitN + expbitN));
    }
    *strBP++ = '.';
    *(exponeBP + (expbyteN - 1)) = 1;
    /* Check the type of floating point number that we have: zero, infinity,
     * or Not-A-Number.  First check for zero value. */
    if (ffchkzero(fltexpBP, expbyteN) == 0)
    {
        /* The exponent value is zero.  Check if the mantissa is also zero.
         * First clear the implied most significant mantissa bit. */
        ffbitclr(fltfracBP, fracbyteN, fracbitN);
        if (ffchkzero(fltfracBP, fracbyteN) == 0)
        {
            /* The mantissa value is also zero, we have a zero value. */
            strcpy((char *) strBP, "0e+0");
            FFREE(fltfracBP);
            FFREE(fltexpBP);
            FFREE(expbiasBP);
            FFREE(exponeBP);
            return((unsigned char) (condcodeB | FFZERO));
        }
        /* Set the implied most significant mantissa bit to 1 */
        ffbitset(fltfracBP, fracbyteN, fracbitN);
    }
    /* Having the exponent bias is useful for the next step. */
    ffgenbias(expbiasBP, expbyteN, expbitN);
    /* Check if exponent value is set to maximum possible value. This is done
     * by making a copy of exponent bias, shifting it left once, then setting
     * LSB to one. The result is compared  with exponent value. */
    tempexpBP = (unsigned char *) FMALLOC(expbyteN, "FLTOSTR5");
    for (i = 0; i < expbyteN; i++)
        *(tempexpBP + i) = *(expbiasBP + i);
    ushftlm(tempexpBP, expbyteN);
    *(tempexpBP + expbyteN - 1) |= 1;
    if (ucmpm(tempexpBP, fltexpBP, expbyteN) == 0)
    {
        /* The exponent value is set to its maximum value.
         * First clear the implied most significant mantissa bit. */
        ffbitclr(fltfracBP, fracbyteN, fracbitN);
        if (ffchkzero(fltfracBP, fracbyteN) == 0)
        {
            /* The mantissa value is zero, we have an Infinite value. */
            strcpy((char *) (--strBP), "Infinity");
            condcodeB |= FFINF;
        }
        else
        {
            /* The mantissa value is non-zero, we have a Not-A-Number. */
            strcpy((char *) (--strBP), "Not-A-Number");
            condcodeB |= FFNAN;
        }
        FFREE(fltfracBP);
        FFREE(fltexpBP);
        FFREE(expbiasBP);
        FFREE(exponeBP);
        FFREE(tempexpBP);
        return(condcodeB);
    }
    /* Ok floating point value. */
    FFREE(tempexpBP);
    /* Back to working on exponent... Subtract exponent bias, note if result
     * is negative, sign is properly extended. Important since it allows
     * exponent overflow and underflow to be detected much more easily. */
    isubm(fltexpBP, expbiasBP, expbyteN);
    /* Convert power of 2 exponent into an approximate power of 10 exponenet.
     * This is done by converting exponent value into a float, then
     * multiplying it by log10(2). Result is converted back to a integer. */
    tempfltBP = (unsigned char *) FMALLOC(SINGLETOT, "FLTOSTR6");
    intoflt(fltexpBP, expbyteN, tempfltBP, SINGLEFRAC, SINGLEXP);
    fmultm(tempfltBP, log10of2AB, SINGLEXP, SINGLEFRAC);
    fltoint(tempfltBP, SINGLEFRAC, SINGLEXP, fltexpBP, expbyteN);
    /* Add one to the power of 10 exponent */
    iaddm(fltexpBP, exponeBP, expbyteN);
    /* Convert the exponent power of ten into a float value */
    ptenfltBP = (unsigned char *) FMALLOC(totalenN, "FLTOSTR7");
    tempexpBP = (unsigned char *) FMALLOC(expbyteN, "FLTOSTR8");
    for (i = 0; i < expbyteN; i++)
        *(tempexpBP + i) = *(fltexpBP + i);
    ptenccB = intopten(tempexpBP, expbyteN, ptenfltBP, fracbitN, expbitN);
    /* Check if either an overflow or underflow occurred. */
    if (((ptenccB & FFINF) == FFINF) || ((ptenccB & FFZERO) == FFZERO))
    {
        if ((ptenccB & FFINF) == FFINF)
        {
            /* The power of ten is a bit too big... */
            isubm(fltexpBP, exponeBP, expbyteN);
        }
        else
        {
            /* The power of ten is a bit too small... */
            iaddm(fltexpBP, exponeBP, expbyteN);
        }
        /* Re initialize temporary copy of exponent value, and recalculate
         * the modified power of ten. */
        for (i = 0; i < expbyteN; i++)
            *(tempexpBP + i) = *(fltexpBP + i);
        ptenccB = intopten(tempexpBP, expbyteN, ptenfltBP, fracbitN, expbitN);
    }
    FFREE(tempexpBP);
    /* Divide the original float by the power of 10. */
    fdivm(fltvalBP, ptenfltBP, expbitN, fracbitN);
    /* Check if the result is less than 1/10.  First generate
     * 1/10 in the desired floating point format. */
    onetenthBP = (unsigned char *) FMALLOC(totalenN, "FLTOSTR9");
    intopten(&minusoneB, 1, onetenthBP, fracbitN, expbitN);
    if ((i = fcmpm(fltvalBP, onetenthBP, expbitN, fracbitN)) < 0)
    {
        /* If so, divide the result by 1/10 (same as multiplying it by 10)
        ** and subtract 1 from the power of ten. */
        fdivm(fltvalBP, onetenthBP, expbitN, fracbitN);
        /* Subtract one from the power of 10 exponent */
        isubm(fltexpBP, exponeBP, expbyteN);
    }
    else if (i == 0)
    {
        /* The result is equal to 1/10...  This is a special boundary
         * case that needs to be handled by brute force... */
        strcpy((char *) strBP, "1e+1");
        FFREE(fltfracBP);
        FFREE(fltexpBP);
        FFREE(expbiasBP);
        FFREE(exponeBP);
        FFREE(tempfltBP);
        FFREE(ptenfltBP);
        FFREE(onetenthBP);
        return(condcodeB);
    }
    /* Check if the result is greater than 1.  First generate 1 in the
     * desired floating point format. */
    oneBP = (unsigned char *) FMALLOC(totalenN, "FLTOSTR10");
    intopten(&zeroB, 1, oneBP, fracbitN, expbitN);
    if (fcmpm(fltvalBP, oneBP, expbitN, fracbitN) > 0)
    {
        /* If so, multiply the result by 1/10 and add 1 to the power
         * of ten. */
        fmultm(fltvalBP, onetenthBP, expbitN, fracbitN);
        /* Add one to the power of 10 exponent */
        iaddm(fltexpBP, exponeBP, expbyteN);
    }
    /* fltvalBP points to a floating point number between 1/10 and 1. */
    /* Convert binary length of the mantissa into decimal digit length.
     * This gives the number of significant digits and determines the power
     * of then to multiply the mantissa by to convert it to an integer. */
    mantlenN = (fracbitN + 1);
    /* Convert mantissa bit length from an int to a format compatible with
     * two byte integers. */
    mantlenBP = (unsigned char *) FMALLOC(2, "FLTOSTR11");
    *(mantlenBP+1) = (unsigned char) mantlenN;
    *mantlenBP = (unsigned char) (mantlenN >> 8);
    /* Now do conversion from binary length to decimal digit length. */
    intoflt(mantlenBP, 2, tempfltBP, SINGLEFRAC, SINGLEXP);
    fmultm(tempfltBP, log10of2AB, SINGLEXP, SINGLEFRAC);
    fltoint(tempfltBP, SINGLEFRAC, SINGLEXP, mantlenBP, 2);
    FFREE(tempfltBP);
    /* Convert result back to int */
    mantlenN = (((unsigned int) (*mantlenBP)) << 8) +
        (unsigned int) (*(mantlenBP+1));
    mantlenN++;
    /* And then back to a two byte integer */
    *(mantlenBP+1) = (unsigned char) mantlenN;
    *mantlenBP = (unsigned char) (mantlenN >> 8);
    /* Calculate appropriate power of 10 to multiply mantissa by, then multiply
     * it. Convert result to an int, then to an ASCII string. */
    tempfltBP = (unsigned char *) FMALLOC(totalenN, "FLTOSTR12");
    intopten(mantlenBP, 2, tempfltBP, fracbitN, expbitN);
    fmultm(fltvalBP, tempfltBP, expbitN, fracbitN);
    mantlenN = mantlenN >> 1;
    mantintBP = (unsigned char *) FMALLOC(mantlenN, "FLTOSTR13");
    fltoint(fltvalBP, fracbitN, expbitN, mantintBP, mantlenN);
    condcodeB = intostr(mantintBP, mantlenN, strBP, (slenN - 2), 10);
    /* Free the temporary buffer for the two byte integer mantissa length */
    FFREE(mantlenBP);
    /* Add on exponent part */
    strBP += strlen((char *) strBP);
    slenN -= (2 + strlen((char *) strBP));
    *strBP++ = 'e';
    /* Determine the exponent sign. */
    if (((*(fltexpBP + expbyteN - 1)) & 0x80) != 0x80)
        *strBP++ = '+';
    condcodeB = intostr(fltexpBP, expbyteN, strBP, slenN, 10);
    FFREE(fltfracBP);
    FFREE(fltexpBP);
    FFREE(expbiasBP);
    FFREE(exponeBP);
    FFREE(tempfltBP);
    FFREE(ptenfltBP);
    FFREE(onetenthBP);
    FFREE(oneBP);
    FFREE(mantintBP);
    return(condcodeB);
}
/* Function:    unsigned char *intopten(unsigned char *fltexpBP, int expbyteN,
**                        unsigned char *ptenfltBP, int fracbitN, int expbitN)
** This function calulates 10 to the integer value pointed to by fltexpBP.
** fltexpBP points to an integer value that is expbyteN bytes long. The float
** result is written to memory buffer pointed to by ptenfltBP. fracbitN and
** expbitN give number of bits in floating point's mantissa and exponent. */
unsigned char
#ifdef PROTOTYPES
intopten(unsigned char *fltexpBP, int expbyteN, unsigned char *ptenfltBP,
         int fracbitN, int expbitN)
#else
intopten(fltexpBP, expbyteN, ptenfltBP, fracbitN, expbitN)
unsigned char *fltexpBP;
int expbyteN;
unsigned char *ptenfltBP;
int fracbitN;
int expbitN;
#endif
{
    unsigned char condcodeB;
    unsigned char *ptenBP;
    unsigned char tenB;
    unsigned char oneB;
    int totalenN;
    /* Initialize one and ten values */
    oneB = 1;
    tenB = 10;
    /* Initialize the condition code byte to zero */
    condcodeB = 0;
    /* Determine the total byte length of the floating point number */
    totalenN = fftotlen(expbitN, fracbitN);
    /* Allocate space for the power of ten multiplier/divisor and initialize
     * it to 10. */
    ptenBP = (unsigned char *) FMALLOC(totalenN, "INTOPTEN");
    intoflt(&tenB, 1, ptenBP, fracbitN, expbitN);
    /* Initialize the result to 1. */
    intoflt(&oneB, 1, ptenfltBP, fracbitN, expbitN);
    /* Check the sign bit of the exponent */
    if (((*fltexpBP) & 0x80) == 0)
    {
        /* Exponent is positive.  Multiply result value by binary power of 10
         * for each corresponding bit set in the exponent. Loop until the
         * exponent value is zero. */
        while (ucheckm(fltexpBP, expbyteN) != 0)
        {
            /* Check if the least significant bit is one. */
            if (((*(fltexpBP + expbyteN - 1)) & 1) != 0)
            {
                /* If so, multiply in the power of ten. */
                condcodeB = fmultm(ptenfltBP, ptenBP, expbitN, fracbitN);
                /* Check if an overflow has occurred */
                if ((condcodeB & FFINF) == FFINF)
                {
                    FFREE(ptenBP);
                    return(condcodeB);
                }
            }
            ushftrm(fltexpBP, expbyteN);
            /* Generate next binary power of 10 by squaring previous result. */
            fmultm(ptenBP, ptenBP, expbitN, fracbitN);
        }
    }
    else
    {
        /* Exponent is negative. Change sign of exponent value, then divide
         * result value by binary power of 10 for each corresponding bit set
         * in exponent. Loop until the exponent value is zero. */
        inegm(fltexpBP, expbyteN);
        while (ucheckm(fltexpBP, expbyteN) != 0)
        {
            /* Check if the least significant bit is one. */
            if (((*(fltexpBP + expbyteN - 1)) & 1) != 0)
            {
                /* If so, then divide the previous result by power of ten. */
                fdivm(ptenfltBP, ptenBP, expbitN, fracbitN);
                /* Check if an underflow has occurred */
                if ((condcodeB & FFZERO) == FFZERO)
                {
                    FFREE(ptenBP);
                    return(condcodeB);
                }
            }
            ushftrm(fltexpBP, expbyteN);
            /* Generate the next binary power of 10 by squaring the previous
             * result. */
            fmultm(ptenBP, ptenBP, expbitN, fracbitN);
        }
    }
    /* Clean up and return. */
    FFREE(ptenBP);
    return(condcodeB);
}

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

Arbitrary Precision Floating-point Arithmetic

Dynamic Creation of a Numerical Representation

Arithmetic With a Numerical Representation

Getting Results

An Example: The Square Root of 1/2 to 100 Places

Float-to-String Conversion

String-to-Float Conversion

Verification

Portability

What About C++?

Example 1:

Example 2:

Example 3:

Figure 1:

Table 1: Results of the root2ext program. Each iteration requires six floating-point operations. (Flops = 6 * Iterations * Trials/Time.)

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

Arbitrary Precision Floating-point Arithmetic

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content