[Solved] loop unrolling not giving expected speedup for floating-point dot product

Your unroll doesn’t help with the FP latency bottleneck: sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; … so you haven’t done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is not … Read more