There’s no single instruction that can do this. BMI1 blsi dst,src
can isolate the lowest set bit, not the highest. i.e. x & -x
. If x86 had a bit-reversed version of blsi
, we could use that, but it doesn’t.
But you can do much better than what you were suggesting. An all-zero input is always going to be a special case for bit-scan and shift. Otherwise our output has exactly 1 bit set. It’s 1 << bsr(input)
.
;; input: x in RDI
;; output: result in RAX
isolate_msb:
xor eax, eax ; tmp = 0
bsr rdi, rdi ; edi = bit index of MSB in input
jz .input_was_zero
bts rax, rdi ; rax |= 1<<edi
.input_was_zero: ; return 0 for input=0
ret
Obviously for 32-bit inputs, use only 32-bit registers. And if zero is not possible, omit the JZ. Using BSR instead of LZCNT gives us a bit-index, not 31-bitidx, so we can use it directly. But LZCNT is significantly faster on AMD.
The xor-zeroing is off the critical path, to prepare an input for BTS. xor-zero + BTS is the most efficient way to implement 1<<n
on Intel CPUs. It’s 2 uops with 2c latency on AMD, so mov rax,1
/ shl rax,cl
would be better there. But worse on Intel because variable-count shifts are 3 uops, unless you use BMI2 shlx
.
Anyway, the real work here is BSR + BTS, so that’s 3 cycle + 1 cycle latency on Intel SnB-family. (https://agner.org/optimize/)
In C / C++, you’d write this as
unsigned isolate_msb32(unsigned x) {
unsigned bitidx = BSR32(x);
//return 1ULL << bitidx; // if x is definitely non-zero
return x ? 1U << bitidx : x;
}
unsigned isolate_msb64(uint64_t x) {
unsigned bitidx = BSR64(x);
return x ? 1ULL << bitidx : x;
}
Where BSR32
is defined in terms of an intrinsic your compiler supports. This is where things get tricky, especially if you want a 64-bit version. There’s no single portable intrinsic. GNU C provides count-leading-zeros intrinsics, but GCC and ICC suck at optimizing 63-__builtin_clzll(x)
back into just BSR. Instead they negate twice. There are builtins for BSR specifically, but those are even more compiler-specific than just MSVC vs. compilers that support GNU extensions (gcc/clang/ICC).
#include <stdint.h>
// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
#ifdef __INTEL_COMPILER
typedef unsigned int bsr_idx_t;
#else
#include <intrin.h> // MSVC
typedef unsigned long bsr_idx_t;
#endif
static inline
unsigned BSR32(unsigned long x){
bsr_idx_t idx;
_BitScanReverse(&idx, x); // ignore bool retval
return idx;
}
static inline
unsigned BSR64(uint64_t x) {
bsr_idx_t idx;
_BitScanReverse64(&idx, x); // ignore bool retval
return idx;
}
#elif defined(__GNUC__)
#ifdef __clang__
static inline unsigned BSR64(uint64_t x) {
return 63-__builtin_clzll(x);
// gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
}
#else
#define BSR64 __builtin_ia32_bsrdi
#endif
#include <x86intrin.h>
#define BSR32(x) _bit_scan_reverse(x)
#endif
On the Godbolt compiler explorer, clang and ICC compile this branchlessly, even when they don’t know that x
is non-zero.
All 4 compilers fail to use bts
to implement 1<<bit
. 🙁 It’s very cheap on Intel.
# clang7.0 -O3 -march=ivybridge (for x86-64 System V)
# with -march=haswell and later it uses lzcnt and has to negate. /sigh.
isolate_msb32(unsigned int):
bsr ecx, edi
mov eax, 1
shl rax, cl
test edi, edi
cmove eax, edi # return 1<<bsr(x) or x (0) if x was zero
ret
GCC and MSVC make branchy code. e.g.
# gcc8.2 -O3 -march=haswell
mov eax, edi
test edi, edi
je .L6
bsr eax, edi
mov edi, 1
shlx rax, rdi, rax # BMI2: 1 uop instead of 3 for shl rax,cl
.L6:
ret
1
solved What is the most efficient way to zero all bits below the most significant set bit?