ieee-754 Archives

[Solved] a is a double, printf(“%d”, a); works differently in IA32 and IA32-64 [closed]

January 22, 2023 by Kirat

%d actually is used for printing int. Historically the d stood for “decimal”, to contrast with o for octal and x for hexadecimal. For printing double you should use %e, %f or %g. Using the wrong format specifier causes undefined behaviour which means anything may happen, including unexpected results. 5 solved a is a double, … Read more

[Solved] float to embedded c float over UART [closed]

November 16, 2022 by Kirat

You cannot convert an integer bitpattern to a float by means of a cast. That just converts the integer to the nearest floating-point number, it does not re-interpret the bits as being a float. If it worked like you think, this: const float x = 512; would set x to 7.175E-43, which would not be … Read more

[Solved] How do I truncate the significand of a floating point number to an arbitrary precision in Java? [duplicate]

October 7, 2022 by Kirat

Suppose x is the number you wish to reduce the precision of and bits is the number of significant bits you wish to retain. When bits is sufficiently large and the order of magnitude of x is sufficiently close to 0, then x * (1L << (bits – Math.getExponent(x))) will scale x so that the … Read more

[Solved] Problematic understanding of IEEE 754 [closed]

September 28, 2022 by Kirat

What is precision? It refers to how closely a binary floating point representation can represent a real value. Real values have infinite precision and infinite range. Digital values have finite range and precision. In practice a single-precision IEEE-754 can represent real values of a precision of 6 significant figures (decimal), while double-precision is good for … Read more

[Solved] Blatant floating point error in C++ program

September 23, 2022 by Kirat

80-bit long double (not sure about its size in MSVS) can store around 18 significant decimal digits without loss of precision. 1300010000000000000144.5700788999 has 32 significant decimal digits and cannot be stored exactly as long double. Read Number of Digits Required For Round-Trip Conversions for more details. 8 solved Blatant floating point error in C++ program