[Solved] loop unrolling not giving expected speedup for floating-point dot product

Your unroll doesn’t help with the FP latency bottleneck: sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; … so you haven’t done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is not … Read more

(Solved) Why is processing a sorted array faster than processing an unsorted array?

You are a victim of branch prediction fail. What is Branch Prediction? Consider a railroad junction: Image by Mecanismo, via Wikimedia Commons. Used under the CC-By-SA 3.0 license. Now for the sake of argument, suppose this is back in the 1800s – before long-distance or radio communication. You are the operator of a junction and … Read more