[Solved] How do I truncate the significand of a floating point number to an arbitrary precision in Java? [duplicate]

Question

Suppose x is the number you wish to reduce the precision of and bits is the number of significant bits you wish to retain.

When bits is sufficiently large and the order of magnitude of x is sufficiently close to 0, then x * (1L << (bits - Math.getExponent(x))) will scale x so that the bits that need to be removed will appear in the fractional component (after the radix point) while the bits that will be retained will appear in the integer component (before the radix point). You can then round this to remove the fractional component and then divide the rounded number by (1L << (bits - Math.getExponent(x))) to restore the order of magnitude of x, i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.round(x * (1L << exponent)) / (1L << exponent);
}

However, (1L << exponent) will break down when Math.getExponent(x) > bits || Math.getExponent(x) < bits - 62. The solution is to use Math.pow(2, exponent) (or the fast pow2(exponent) implementation from this answer) to calculate a fractional, or a very large, power of 2, i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.round(x * Math.pow(2, exponent)) * Math.pow(2, -exponent);
}

However, Math.pow(2, exponent) will break down as exponent approaches -1074 or +1023. The solution is to use Math.scalb(x, exponent) so that the power of 2 doesn’t have to be explicitly calculated, i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.scalb(Math.round(Math.scalb(x, exponent)), -exponent);
}

However, Math.round(y) returns a long so it does not preserve Infinity, NaN, and cases where Math.abs(x) > Long.MAX_VALUE / Math.pow(2, exponent). Furthermore, Math.round(y) always rounds ties to positive infinity (e.g. Math.round(0.5) == 1 && Math.round(1.5) == 2). The solution is to use Math.rint(y) to receive a double and preserve the unbiased IEEE 754 round-to-nearest, ties-to-even rule (e.g. Math.rint(0.5) == 0.0 && Math.rint(1.5) == 2.0), i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.scalb(Math.rint(Math.scalb(x, exponent)), -exponent);
}

Finally, here is a unit test confirming our expectations:

public static String decompose(double d) {
    int SIGN_WIDTH = 1;
    int EXP_WIDTH = 11;
    int SIGNIFICAND_WIDTH = 53;
    String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
    return s.substring(0, 0 + SIGN_WIDTH) + " "
            + s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH) + " "
            + s.substring(0 + SIGN_WIDTH + EXP_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH + SIGNIFICAND_WIDTH - 1);
}

public static void test() {
    // Use a fixed seed so the generated numbers are reproducible.
    java.util.Random r = new java.util.Random(0);

    // Generate a floating point number that makes use of its full 52 bits of significand precision.
    double a = r.nextDouble() * 100;
    System.out.println(decompose(a) + " " + a);
    Assert.assertFalse(decompose(a).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));

    // Cast the double to a float to produce a "ground truth" of precision loss to compare against.
    double b = (float) a;
    System.out.println(decompose(b) + " " + b);
    Assert.assertTrue(decompose(b).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
    // 32-bit float has a 23 bit significand, so c's bit pattern should be identical to b's bit pattern.
    double c = reducePrecision(a, 23);
    System.out.println(decompose(c) + " " + c);
    Assert.assertTrue(b == c);

    // 23rd-most significant bit in c is 1, so rounding it to the 22nd-most significant bit requires breaking a tie.
    // Since 22nd-most significant bit in c is 0, d will be rounded down so that its 22nd-most significant bit remains 0.
    double d = reducePrecision(c, 22);
    System.out.println(decompose(d) + " " + d);
    Assert.assertTrue(decompose(d).split(" ")[2].substring(22).equals(String.format("%0" + (52 - 22) + "d", 0)));
    Assert.assertTrue(decompose(c).split(" ")[2].charAt(22) == '1' && decompose(c).split(" ")[2].charAt(21) == '0');
    Assert.assertTrue(decompose(d).split(" ")[2].charAt(21) == '0');
    // 21st-most significant bit in d is 1, so rounding it to the 20th-most significant bit requires breaking a tie.
    // Since 20th-most significant bit in d is 1, e will be rounded up so that its 20th-most significant bit becomes 0.
    double e = reducePrecision(c, 20);
    System.out.println(decompose(e) + " " + e);
    Assert.assertTrue(decompose(e).split(" ")[2].substring(20).equals(String.format("%0" + (52 - 20) + "d", 0)));
    Assert.assertTrue(decompose(d).split(" ")[2].charAt(20) == '1' && decompose(d).split(" ")[2].charAt(19) == '1');
    Assert.assertTrue(decompose(e).split(" ")[2].charAt(19) == '0');

    // Reduce the precision of a number close to the largest normal number.
    double f = reducePrecision(a * 0x1p+1017, 23);
    System.out.println(decompose(f) + " " + f);
    // Reduce the precision of a number close to the smallest normal number.
    double g = reducePrecision(a * 0x1p-1028, 23);
    System.out.println(decompose(g) + " " + g);
    // Reduce the precision of a number close to the smallest subnormal number.
    double h = reducePrecision(a * 0x1p-1051, 23);
    System.out.println(decompose(h) + " " + h);
}

And its output:

0 10000000101 0010010001100011000110011111011100100100111000111011 73.0967787376657
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110000000000000000000000000000000 73.09677124023438
0 10000000101 0010010001100011001000000000000000000000000000000000 73.0968017578125
0 11111111110 0010010001100011000110100000000000000000000000000000 1.0266060746443803E308
0 00000000001 0010010001100011000110100000000000000000000000000000 2.541339559435826E-308
0 00000000000 0000000000000000000000100000000000000000000000000000 2.652494739E-315

Accepted Answer

Suppose x is the number you wish to reduce the precision of and bits is the number of significant bits you wish to retain.

When bits is sufficiently large and the order of magnitude of x is sufficiently close to 0, then x * (1L << (bits - Math.getExponent(x))) will scale x so that the bits that need to be removed will appear in the fractional component (after the radix point) while the bits that will be retained will appear in the integer component (before the radix point). You can then round this to remove the fractional component and then divide the rounded number by (1L << (bits - Math.getExponent(x))) to restore the order of magnitude of x, i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.round(x * (1L << exponent)) / (1L << exponent);
}

However, (1L << exponent) will break down when Math.getExponent(x) > bits || Math.getExponent(x) < bits - 62. The solution is to use Math.pow(2, exponent) (or the fast pow2(exponent) implementation from this answer) to calculate a fractional, or a very large, power of 2, i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.round(x * Math.pow(2, exponent)) * Math.pow(2, -exponent);
}

However, Math.pow(2, exponent) will break down as exponent approaches -1074 or +1023. The solution is to use Math.scalb(x, exponent) so that the power of 2 doesn’t have to be explicitly calculated, i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.scalb(Math.round(Math.scalb(x, exponent)), -exponent);
}

However, Math.round(y) returns a long so it does not preserve Infinity, NaN, and cases where Math.abs(x) > Long.MAX_VALUE / Math.pow(2, exponent). Furthermore, Math.round(y) always rounds ties to positive infinity (e.g. Math.round(0.5) == 1 && Math.round(1.5) == 2). The solution is to use Math.rint(y) to receive a double and preserve the unbiased IEEE 754 round-to-nearest, ties-to-even rule (e.g. Math.rint(0.5) == 0.0 && Math.rint(1.5) == 2.0), i.e.:

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.scalb(Math.rint(Math.scalb(x, exponent)), -exponent);
}

Finally, here is a unit test confirming our expectations:

public static String decompose(double d) {
    int SIGN_WIDTH = 1;
    int EXP_WIDTH = 11;
    int SIGNIFICAND_WIDTH = 53;
    String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
    return s.substring(0, 0 + SIGN_WIDTH) + " "
            + s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH) + " "
            + s.substring(0 + SIGN_WIDTH + EXP_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH + SIGNIFICAND_WIDTH - 1);
}

public static void test() {
    // Use a fixed seed so the generated numbers are reproducible.
    java.util.Random r = new java.util.Random(0);

    // Generate a floating point number that makes use of its full 52 bits of significand precision.
    double a = r.nextDouble() * 100;
    System.out.println(decompose(a) + " " + a);
    Assert.assertFalse(decompose(a).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));

    // Cast the double to a float to produce a "ground truth" of precision loss to compare against.
    double b = (float) a;
    System.out.println(decompose(b) + " " + b);
    Assert.assertTrue(decompose(b).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
    // 32-bit float has a 23 bit significand, so c's bit pattern should be identical to b's bit pattern.
    double c = reducePrecision(a, 23);
    System.out.println(decompose(c) + " " + c);
    Assert.assertTrue(b == c);

    // 23rd-most significant bit in c is 1, so rounding it to the 22nd-most significant bit requires breaking a tie.
    // Since 22nd-most significant bit in c is 0, d will be rounded down so that its 22nd-most significant bit remains 0.
    double d = reducePrecision(c, 22);
    System.out.println(decompose(d) + " " + d);
    Assert.assertTrue(decompose(d).split(" ")[2].substring(22).equals(String.format("%0" + (52 - 22) + "d", 0)));
    Assert.assertTrue(decompose(c).split(" ")[2].charAt(22) == '1' && decompose(c).split(" ")[2].charAt(21) == '0');
    Assert.assertTrue(decompose(d).split(" ")[2].charAt(21) == '0');
    // 21st-most significant bit in d is 1, so rounding it to the 20th-most significant bit requires breaking a tie.
    // Since 20th-most significant bit in d is 1, e will be rounded up so that its 20th-most significant bit becomes 0.
    double e = reducePrecision(c, 20);
    System.out.println(decompose(e) + " " + e);
    Assert.assertTrue(decompose(e).split(" ")[2].substring(20).equals(String.format("%0" + (52 - 20) + "d", 0)));
    Assert.assertTrue(decompose(d).split(" ")[2].charAt(20) == '1' && decompose(d).split(" ")[2].charAt(19) == '1');
    Assert.assertTrue(decompose(e).split(" ")[2].charAt(19) == '0');

    // Reduce the precision of a number close to the largest normal number.
    double f = reducePrecision(a * 0x1p+1017, 23);
    System.out.println(decompose(f) + " " + f);
    // Reduce the precision of a number close to the smallest normal number.
    double g = reducePrecision(a * 0x1p-1028, 23);
    System.out.println(decompose(g) + " " + g);
    // Reduce the precision of a number close to the smallest subnormal number.
    double h = reducePrecision(a * 0x1p-1051, 23);
    System.out.println(decompose(h) + " " + h);
}

And its output:

0 10000000101 0010010001100011000110011111011100100100111000111011 73.0967787376657
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110000000000000000000000000000000 73.09677124023438
0 10000000101 0010010001100011001000000000000000000000000000000000 73.0968017578125
0 11111111110 0010010001100011000110100000000000000000000000000000 1.0266060746443803E308
0 00000000001 0010010001100011000110100000000000000000000000000000 2.541339559435826E-308
0 00000000000 0000000000000000000000100000000000000000000000000000 2.652494739E-315