[Solved] Assembly code fsqrt and fmul instructions

Question

It looks like you are trying to do something similar to this:

#include <stdio.h>

double hullSpeed(double lgth)
{
    double result;

    __asm__(
            "fldl %1\n\t" //st(0)=>st(1), st(0)=lgth . FLDL means load double float
            "fsqrt\n\t"   //st(0) = square root st(0)
            "fmulp\n\t"   //Multiplies st(0) and st(1) (1.34). Result in st(0)
            : "=&t" (result) : "m" (lgth), "0" (1.34));

    return result;
}

int main()
{
    printf ("%f\n", hullSpeed(64.0));
}

The template I used can be simplified, but for demonstration purposes it will suffice. We use "=&t" constraint since we are returning the result at the top of the floating point stack in st(0), and we use ampersand to denote early clobber (we’ll be using the top of the floating point stack to pass in 1.34). We pass the address of lgth with a memory reference via the constraint "m" (lgth), and the "0"(1.34) constraint says we will pass in 1.34 in the same register as parameter 0, which in this case is the top of the floating point stack. These are registers(or memory) that our assembler will overwrite but don’t appear as an input or output constraint.

Learning assembly language with inline assembler is a very difficult way to learn. The machine constraints specific to x86 can be found here under x86 family. Information on the constraint modifiers can be found here, and information on GCC extended assembler templates can be found here.

I’m only giving you a starting point, as GCC‘s inline assembler usage can be rather complex and any answer may be too broad for a Stackoverflow answer. The fact you are using inline assembler with x87 floating point makes it that much more complex.

Once you have a handle on constraints and modifiers another mechanism that would yield better assembler code by the compiler would be:

__asm__(
        "fsqrt\n\t"   // st(0) = square root st(0)
        "fmulp\n\t"   // Multiplies st(0) and st(1) (1.34). Result in st(0)
        : "=t"(result) : "0"(lgth), "u" (1.34) : "st(1)");

Hint: Constraint "u" places a value in x87 floating point register st(1). The assembler template constraints effectively place lgth in st(0) and 1.34 in st(1). st(1) is invalid after the inline assembly is complete so we list it as a clobber. We use the constraints to place our values on the floating point stack for us. This has the effect of reducing the work we have to do inside the assembler code itself.

If you are developing 64-bit applications I highly recommend using SSE/SSE2 at a minimum for basic floating point calculations. The code above should work on 32-bit and 64-bit. In 64-bit code the x87 floating point instructions are generally not as efficient as SSE/SSE2, but they will work.

Rounding with Inline Assembly and x87

If you are attempting to round based on one of the 4 rounding modes on the x87 you can utilize code like this:

#include <stdint.h>
#include <stdio.h>

#define RND_CTL_BIT_SHIFT   10

typedef enum {
    ROUND_NEAREST_EVEN =    0 << RND_CTL_BIT_SHIFT,
    ROUND_MINUS_INF =       1 << RND_CTL_BIT_SHIFT,
    ROUND_PLUS_INF =        2 << RND_CTL_BIT_SHIFT,
    ROUND_TOWARD_ZERO =     3 << RND_CTL_BIT_SHIFT
} RoundingMode;

double roundd (double n, RoundingMode mode)
{
    uint16_t cw;        /* Storage for the current x87 control register */
    uint16_t newcw;     /* Storage for the new value of the control register */
    uint16_t dummyreg;  /* Temporary dummy register used in the template */

    __asm__ __volatile__ (
            "fstcw %w[cw]          \n\t" /* Read current x87 control register into cw*/
            "fwait                 \n\t" /* Do an fwait after an fstcw instruction */
            "mov %w[cw],%w[treg]   \n\t" /* ax = value in cw variable*/
            "and $0xf3ff,%w[treg]  \n\t" /* Set rounding mode bits 10 and 11 of control
                                            register to zero*/
            "or %w[rmode],%w[treg] \n\t" /* Set the rounding mode bits */
            "mov %w[treg],%w[newcw]\n\t" /* newcw = value for new control reg value*/
            "fldcw %w[newcw]       \n\t" /* Set control register to newcw */
            "frndint               \n\t" /* st(0) = round(st(0)) */
            "fldcw %w[cw]          \n\t" /* restore control reg to orig value in cw*/
            : [cw]"=m"(cw),
              [newcw]"=m"(newcw),
              [treg]"=&r"(dummyreg),  /* Register constraint with dummy variable
                                         allows compiler to choose available register */
              [n]"+t"(n)              /* +t constraint passes `n` through 
                                         top of FPU stack (st0) for both input&output*/
            : [rmode]"rmi"((uint16_t)mode)); /* "g" constraint same as "rmi" */

    return n;
}

double hullSpeed(double lgth)
{
    double result;

    __asm__(
            "fsqrt\n\t"   // st(0) = square root st(0)
            "fmulp\n\t"   // Multiplies st(0) and st(1) (1.34). Result in st(0)
            : "=t"(result) : "0"(lgth), "u" (1.34) : "st(1)");
    
    return result;
}

int main()
{
    double dbHullSpeed = hullSpeed(64.0);
    printf ("%f, %f\n", dbHullSpeed, roundd(dbHullSpeed, ROUND_NEAREST_EVEN));
    printf ("%f, %f\n", dbHullSpeed, roundd(dbHullSpeed, ROUND_MINUS_INF));
    printf ("%f, %f\n", dbHullSpeed, roundd(dbHullSpeed, ROUND_PLUS_INF));
    printf ("%f, %f\n", dbHullSpeed, roundd(dbHullSpeed, ROUND_TOWARD_ZERO));
    return 0;
}

As you pointed out in the comments, there was equivalent code in this Stackoverflow answer but it used multiple __asm__ statements and you were curious how a single __asm__ statement could be coded.

The rounding modes (0,1,2,3) can be found in the Intel Architecture Document:

Rounding Mode RC Field
00B Rounded result is the closest to the infinitely precise result. If two values are equally close, the result is the even value (that is, the one with the least-significant bit of zero). Default Round down (toward −∞)
01B Rounded result is closest to but no greater than the infinitely precise result. Round up (toward +∞)
10B Rounded result is closest to but no less than the infinitely precise result.Round toward zero (Truncate)
11B Rounded result is closest to but no greater in absolute value than the infinitely precise result.

In section 8.1.5 (rounding mode specifically described in section 8.1.5.3) there is a description of the fields. The 4 rounding modes are defined in figure 4-8 under section 4.8.4.

Accepted Answer

It looks like you are trying to do something similar to this:

#include <stdio.h>

double hullSpeed(double lgth)
{
    double result;

    __asm__(
            "fldl %1\n\t" //st(0)=>st(1), st(0)=lgth . FLDL means load double float
            "fsqrt\n\t"   //st(0) = square root st(0)
            "fmulp\n\t"   //Multiplies st(0) and st(1) (1.34). Result in st(0)
            : "=&t" (result) : "m" (lgth), "0" (1.34));

    return result;
}

int main()
{
    printf ("%f\n", hullSpeed(64.0));
}

The template I used can be simplified, but for demonstration purposes it will suffice. We use "=&t" constraint since we are returning the result at the top of the floating point stack in st(0), and we use ampersand to denote early clobber (we’ll be using the top of the floating point stack to pass in 1.34). We pass the address of lgth with a memory reference via the constraint "m" (lgth), and the "0"(1.34) constraint says we will pass in 1.34 in the same register as parameter 0, which in this case is the top of the floating point stack. These are registers(or memory) that our assembler will overwrite but don’t appear as an input or output constraint.

Learning assembly language with inline assembler is a very difficult way to learn. The machine constraints specific to x86 can be found here under x86 family. Information on the constraint modifiers can be found here, and information on GCC extended assembler templates can be found here.

I’m only giving you a starting point, as GCC‘s inline assembler usage can be rather complex and any answer may be too broad for a Stackoverflow answer. The fact you are using inline assembler with x87 floating point makes it that much more complex.

Once you have a handle on constraints and modifiers another mechanism that would yield better assembler code by the compiler would be:

__asm__(
        "fsqrt\n\t"   // st(0) = square root st(0)
        "fmulp\n\t"   // Multiplies st(0) and st(1) (1.34). Result in st(0)
        : "=t"(result) : "0"(lgth), "u" (1.34) : "st(1)");

Hint: Constraint "u" places a value in x87 floating point register st(1). The assembler template constraints effectively place lgth in st(0) and 1.34 in st(1). st(1) is invalid after the inline assembly is complete so we list it as a clobber. We use the constraints to place our values on the floating point stack for us. This has the effect of reducing the work we have to do inside the assembler code itself.

If you are developing 64-bit applications I highly recommend using SSE/SSE2 at a minimum for basic floating point calculations. The code above should work on 32-bit and 64-bit. In 64-bit code the x87 floating point instructions are generally not as efficient as SSE/SSE2, but they will work.

Rounding with Inline Assembly and x87

If you are attempting to round based on one of the 4 rounding modes on the x87 you can utilize code like this:

#include <stdint.h>
#include <stdio.h>

#define RND_CTL_BIT_SHIFT   10

typedef enum {
    ROUND_NEAREST_EVEN =    0 << RND_CTL_BIT_SHIFT,
    ROUND_MINUS_INF =       1 << RND_CTL_BIT_SHIFT,
    ROUND_PLUS_INF =        2 << RND_CTL_BIT_SHIFT,
    ROUND_TOWARD_ZERO =     3 << RND_CTL_BIT_SHIFT
} RoundingMode;

double roundd (double n, RoundingMode mode)
{
    uint16_t cw;        /* Storage for the current x87 control register */
    uint16_t newcw;     /* Storage for the new value of the control register */
    uint16_t dummyreg;  /* Temporary dummy register used in the template */

    __asm__ __volatile__ (
            "fstcw %w[cw]          \n\t" /* Read current x87 control register into cw*/
            "fwait                 \n\t" /* Do an fwait after an fstcw instruction */
            "mov %w[cw],%w[treg]   \n\t" /* ax = value in cw variable*/
            "and $0xf3ff,%w[treg]  \n\t" /* Set rounding mode bits 10 and 11 of control
                                            register to zero*/
            "or %w[rmode],%w[treg] \n\t" /* Set the rounding mode bits */
            "mov %w[treg],%w[newcw]\n\t" /* newcw = value for new control reg value*/
            "fldcw %w[newcw]       \n\t" /* Set control register to newcw */
            "frndint               \n\t" /* st(0) = round(st(0)) */
            "fldcw %w[cw]          \n\t" /* restore control reg to orig value in cw*/
            : [cw]"=m"(cw),
              [newcw]"=m"(newcw),
              [treg]"=&r"(dummyreg),  /* Register constraint with dummy variable
                                         allows compiler to choose available register */
              [n]"+t"(n)              /* +t constraint passes `n` through 
                                         top of FPU stack (st0) for both input&output*/
            : [rmode]"rmi"((uint16_t)mode)); /* "g" constraint same as "rmi" */

    return n;
}

double hullSpeed(double lgth)
{
    double result;

    __asm__(
            "fsqrt\n\t"   // st(0) = square root st(0)
            "fmulp\n\t"   // Multiplies st(0) and st(1) (1.34). Result in st(0)
            : "=t"(result) : "0"(lgth), "u" (1.34) : "st(1)");
    
    return result;
}

int main()
{
    double dbHullSpeed = hullSpeed(64.0);
    printf ("%f, %f\n", dbHullSpeed, roundd(dbHullSpeed, ROUND_NEAREST_EVEN));
    printf ("%f, %f\n", dbHullSpeed, roundd(dbHullSpeed, ROUND_MINUS_INF));
    printf ("%f, %f\n", dbHullSpeed, roundd(dbHullSpeed, ROUND_PLUS_INF));
    printf ("%f, %f\n", dbHullSpeed, roundd(dbHullSpeed, ROUND_TOWARD_ZERO));
    return 0;
}

As you pointed out in the comments, there was equivalent code in this Stackoverflow answer but it used multiple __asm__ statements and you were curious how a single __asm__ statement could be coded.

The rounding modes (0,1,2,3) can be found in the Intel Architecture Document:

Rounding Mode RC Field
00B Rounded result is the closest to the infinitely precise result. If two values are equally close, the result is the even value (that is, the one with the least-significant bit of zero). Default Round down (toward −∞)
01B Rounded result is closest to but no greater than the infinitely precise result. Round up (toward +∞)
10B Rounded result is closest to but no less than the infinitely precise result.Round toward zero (Truncate)
11B Rounded result is closest to but no greater in absolute value than the infinitely precise result.

In section 8.1.5 (rounding mode specifically described in section 8.1.5.3) there is a description of the fields. The 4 rounding modes are defined in figure 4-8 under section 4.8.4.