[Solved] loop unrolling not giving expected speedup for floating-point dot product

Your unroll doesn’t help with the FP latency bottleneck: sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; … so you haven’t done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is not … Read more

[Solved] Does (p+x)-x always result in p for pointer p and integer x in gcc linux x86-64 C++

Yes, for gcc5.x and later specifically, that specific expression is optimized very early to just p, even with optimization disabled, regardless of any possible runtime UB. This happens even with a static array and compile-time constant size. gcc -fsanitize=undefined doesn’t insert any instrumentation to look for it either. Also no warnings at -Wall -Wextra -Wpedantic … Read more