[Solved] loop unrolling not giving expected speedup for floating-point dot product

Question

Your unroll doesn’t help with the FP latency bottleneck:

sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; … so you haven’t done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is not the bottleneck, it’s the 3 cycle latency of addss on Haswell, so this unroll makes basically no difference.

What would work is sum += u[i]*v[i] + u[i+1]*v[i+1] + ... as a way to unroll without multiple accumulators, because then the sum of each group of elements is independent.

It costs slightly more math operations that way, like starting with a mul and ending with an add, but the middle ones can still contract into FMAs if you compile with -march=haswell. See comments on AVX performance slower for bitwise xor op and popcount for an example of GCC turning a naive unroll like sum += u0*v0; sum += u1*v1 into sum += u0*v0 + u1*v1;. In that case the problem was slightly different: sum of squared differences like sum += (u0-v0)**2 + (u1-v1)**2;, but it boils down to the same latency problem of ultimately doing some multiplies and adds.

The other way to solve the problem is with multiple accumulators, allowing all the operations to be FMAs. But Haswell has 5-cycle latency FMA, and 3-cycle latency addss, so doing the sum += ... addition on its own, not as part of an FMA, actually helps with the latency bottleneck on Haswell (unlike on Skylake add/sub/mul are all 4 cycle latency). The following all show unrolling with multiple accumulators, instead of with adding groups together like the first towards pairwise summation like you’re doing:

Why does mulss take only 3 cycles on Haswell, different from Agner’s instruction tables? (Unrolling FP loops with multiple accumulators)
When, if ever, is loop unrolling still useful?
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

FP math instruction throughput isn’t the bottleneck for a big dot product on modern CPUs, only latency. Or load throughput if you unroll enough.

Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.

Each element takes 2 loads, and with only 2 load ports, that’s a hard throughput bottleneck. (https://agner.org/optimize/ / https://www.realworldtech.com/haswell-cpu/5/)

I’m assuming you’re counting an “element” as an i value, a pair of floats, one each from udata[i] and vdata[i]. The FP FMA throughput bottleneck is also 2/clock on Haswell (whether they’re scalar, 128-bit, or 256-bit vectors), but dot product takes 2 loads per FMA. In theory, even Sandybridge or maybe even K8 could achieve 1 element per clock, with separate mul and add instructions, since they both support 2 loads per clock, and have a wide enough pipeline to get load / mulss / addss through the pipeline with some room to spare.

Accepted Answer

Your unroll doesn’t help with the FP latency bottleneck:

sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; … so you haven’t done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is not the bottleneck, it’s the 3 cycle latency of addss on Haswell, so this unroll makes basically no difference.

What would work is sum += u[i]*v[i] + u[i+1]*v[i+1] + ... as a way to unroll without multiple accumulators, because then the sum of each group of elements is independent.

It costs slightly more math operations that way, like starting with a mul and ending with an add, but the middle ones can still contract into FMAs if you compile with -march=haswell. See comments on AVX performance slower for bitwise xor op and popcount for an example of GCC turning a naive unroll like sum += u0*v0; sum += u1*v1 into sum += u0*v0 + u1*v1;. In that case the problem was slightly different: sum of squared differences like sum += (u0-v0)**2 + (u1-v1)**2;, but it boils down to the same latency problem of ultimately doing some multiplies and adds.

The other way to solve the problem is with multiple accumulators, allowing all the operations to be FMAs. But Haswell has 5-cycle latency FMA, and 3-cycle latency addss, so doing the sum += ... addition on its own, not as part of an FMA, actually helps with the latency bottleneck on Haswell (unlike on Skylake add/sub/mul are all 4 cycle latency). The following all show unrolling with multiple accumulators, instead of with adding groups together like the first towards pairwise summation like you’re doing:

Why does mulss take only 3 cycles on Haswell, different from Agner’s instruction tables? (Unrolling FP loops with multiple accumulators)
When, if ever, is loop unrolling still useful?
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

FP math instruction throughput isn’t the bottleneck for a big dot product on modern CPUs, only latency. Or load throughput if you unroll enough.

Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.

Each element takes 2 loads, and with only 2 load ports, that’s a hard throughput bottleneck. (https://agner.org/optimize/ / https://www.realworldtech.com/haswell-cpu/5/)

I’m assuming you’re counting an “element” as an i value, a pair of floats, one each from udata[i] and vdata[i]. The FP FMA throughput bottleneck is also 2/clock on Haswell (whether they’re scalar, 128-bit, or 256-bit vectors), but dot product takes 2 loads per FMA. In theory, even Sandybridge or maybe even K8 could achieve 1 element per clock, with separate mul and add instructions, since they both support 2 loads per clock, and have a wide enough pipeline to get load / mulss / addss through the pipeline with some room to spare.