[Solved] What is the most efficient way to zero all bits below the most significant set bit?

Question

There’s no single instruction that can do this. BMI1 blsi dst,src can isolate the lowest set bit, not the highest. i.e. x & -x. If x86 had a bit-reversed version of blsi, we could use that, but it doesn’t.

But you can do much better than what you were suggesting. An all-zero input is always going to be a special case for bit-scan and shift. Otherwise our output has exactly 1 bit set. It’s 1 << bsr(input).

;; input: x in RDI
;; output: result in RAX
isolate_msb:
    xor   eax, eax           ; tmp = 0
    bsr   rdi, rdi           ; edi = bit index of MSB in input
    jz    .input_was_zero
    bts   rax, rdi           ; rax |= 1<<edi

.input_was_zero:             ; return 0 for input=0
    ret

Obviously for 32-bit inputs, use only 32-bit registers. And if zero is not possible, omit the JZ. Using BSR instead of LZCNT gives us a bit-index, not 31-bitidx, so we can use it directly. But LZCNT is significantly faster on AMD.

The xor-zeroing is off the critical path, to prepare an input for BTS. xor-zero + BTS is the most efficient way to implement 1<<n on Intel CPUs. It’s 2 uops with 2c latency on AMD, so mov rax,1 / shl rax,cl would be better there. But worse on Intel because variable-count shifts are 3 uops, unless you use BMI2 shlx.

Anyway, the real work here is BSR + BTS, so that’s 3 cycle + 1 cycle latency on Intel SnB-family. (https://agner.org/optimize/)

In C / C++, you’d write this as

unsigned isolate_msb32(unsigned x) {
    unsigned bitidx = BSR32(x);
    //return 1ULL << bitidx;           // if x is definitely non-zero
    return x ? 1U << bitidx : x;
}

unsigned isolate_msb64(uint64_t x) {
    unsigned bitidx = BSR64(x);
    return x ? 1ULL << bitidx : x;
}

Where BSR32 is defined in terms of an intrinsic your compiler supports. This is where things get tricky, especially if you want a 64-bit version. There’s no single portable intrinsic. GNU C provides count-leading-zeros intrinsics, but GCC and ICC suck at optimizing 63-__builtin_clzll(x) back into just BSR. Instead they negate twice. There are builtins for BSR specifically, but those are even more compiler-specific than just MSVC vs. compilers that support GNU extensions (gcc/clang/ICC).

#include <stdint.h>

// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
    #ifdef __INTEL_COMPILER
        typedef unsigned int bsr_idx_t;
    #else
        #include <intrin.h>   // MSVC
        typedef unsigned long bsr_idx_t;
    #endif

    static inline
    unsigned BSR32(unsigned long x){
        bsr_idx_t idx;
        _BitScanReverse(&idx, x); // ignore bool retval
        return idx;
    }
    static inline
    unsigned BSR64(uint64_t x) {
        bsr_idx_t idx;
        _BitScanReverse64(&idx, x); // ignore bool retval
        return idx;
    }
#elif defined(__GNUC__)

  #ifdef __clang__
    static inline unsigned BSR64(uint64_t x) {
        return 63-__builtin_clzll(x);
      // gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
    }
  #else
    #define BSR64 __builtin_ia32_bsrdi
  #endif

    #include <x86intrin.h>
    #define BSR32(x) _bit_scan_reverse(x)

#endif

On the Godbolt compiler explorer, clang and ICC compile this branchlessly, even when they don’t know that x is non-zero.

All 4 compilers fail to use bts to implement 1<<bit. 🙁 It’s very cheap on Intel.

# clang7.0 -O3 -march=ivybridge   (for x86-64 System V)
# with -march=haswell and later it uses lzcnt and has to negate.  /sigh.
isolate_msb32(unsigned int):
        bsr     ecx, edi
        mov     eax, 1
        shl     rax, cl
        test    edi, edi
        cmove   eax, edi       # return 1<<bsr(x)  or  x (0) if x was zero
        ret

GCC and MSVC make branchy code. e.g.

# gcc8.2 -O3 -march=haswell
    mov     eax, edi
    test    edi, edi
    je      .L6
    bsr     eax, edi
    mov     edi, 1
    shlx    rax, rdi, rax    # BMI2:  1 uop instead of 3 for shl rax,cl
.L6:
    ret

Accepted Answer