{"id":14518,"date":"2022-10-08T03:14:35","date_gmt":"2022-10-07T21:44:35","guid":{"rendered":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/"},"modified":"2022-10-08T03:14:35","modified_gmt":"2022-10-07T21:44:35","slug":"solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product","status":"publish","type":"post","link":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/","title":{"rendered":"[Solved] loop unrolling not giving expected speedup for floating-point dot product"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div id=\"answer-71332825\" class=\"answer js-answer accepted-answer js-accepted-answer\" data-answerid=\"71332825\" data-parentid=\"71274300\" data-score=\"2\" data-position-on-page=\"1\" data-highest-scored=\"1\" data-question-has-accepted-highest-score=\"1\" itemprop=\"acceptedAnswer\" itemscope itemtype=\"https:\/\/schema.org\/Answer\">\n<div class=\"post-layout\">\n<div class=\"votecell post-layout--left\"><\/div>\n<div class=\"answercell post-layout--right\">\n<div class=\"js-endorsements\" data-for-answer=\"71332825\">\n<\/div>\n<div class=\"s-prose js-post-body\" itemprop=\"text\">\n<p>Your unroll doesn&#8217;t help with the FP latency bottleneck:<\/p>\n<p><code>sum + x + y + z<\/code> without <code>-ffast-math<\/code> is the same order of operations as <code>sum += x;<\/code> <code>sum += y;<\/code> &#8230; so you haven&#8217;t done anything about the single dependency chain running through all the <code>+<\/code> operations.  Loop overhead (or front-end throughput) is <em>not<\/em> the bottleneck, it&#8217;s the 3 cycle latency of <code>addss<\/code> on Haswell, so this unroll makes basically no difference.<\/p>\n<p>What would work is <code>sum += u[i]*v[i] + u[i+1]*v[i+1] + ...<\/code> as a way to unroll without multiple accumulators, because <strong>then the sum of each group of elements is independent<\/strong>.<\/p>\n<p>It costs slightly more math operations that way, like starting with a mul and ending with an add, but the middle ones can still contract into FMAs if you compile with <code>-march=haswell<\/code>.  See comments on AVX performance slower for bitwise xor op and popcount for an example of GCC turning a naive unroll like <code>sum += u0*v0;<\/code> <code>sum += u1*v1<\/code> into <code>sum += u0*v0 + u1*v1;<\/code>.  In that case the problem was slightly different: sum of squared differences like <code>sum += (u0-v0)**2 + (u1-v1)**2;<\/code>, but it boils down to the same latency problem of ultimately doing some multiplies and adds.<\/p>\n<p>The other way to solve the problem is with multiple accumulators, allowing all the operations to be FMAs.  But Haswell has 5-cycle latency FMA, and 3-cycle latency <code>addss<\/code>, so doing the <code>sum += ...<\/code> addition on its own, not as part of an FMA, actually helps with the latency bottleneck on Haswell (unlike on Skylake add\/sub\/mul are all 4 cycle latency).  The following all show unrolling with multiple accumulators, instead of with adding groups together like the first towards pairwise summation like you&#8217;re doing:<\/p>\n<ul>\n<li>Why does mulss take only 3 cycles on Haswell, different from Agner&#8217;s instruction tables? (Unrolling FP loops with multiple accumulators)<\/li>\n<li>When, if ever, is loop unrolling still useful?<\/li>\n<li>Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell<\/li>\n<\/ul>\n<hr>\n<p>FP math instruction <em>throughput<\/em> isn&#8217;t the bottleneck for a big dot product on modern CPUs, only latency.  Or load throughput if you unroll enough.<\/p>\n<blockquote>\n<p>Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.<\/p>\n<\/blockquote>\n<p>Each element takes 2 loads, and with only 2 load ports, that&#8217;s a hard throughput bottleneck.  (<a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/agner.org\/optimize\/\">https:\/\/agner.org\/optimize\/<\/a> \/ <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/www.realworldtech.com\/haswell-cpu\/5\/\">https:\/\/www.realworldtech.com\/haswell-cpu\/5\/<\/a>)<\/p>\n<p>I&#8217;m assuming you&#8217;re counting an &#8220;element&#8221; as an <code>i<\/code> value, a pair of floats, one each from <code>udata[i]<\/code> and <code>vdata[i]<\/code>. The FP FMA throughput bottleneck is also 2\/clock on Haswell (whether they&#8217;re scalar, 128-bit, or 256-bit vectors), but <strong>dot product takes 2 loads per FMA<\/strong>.  In theory, even Sandybridge or maybe even K8 could achieve 1 element per clock, with separate mul and add instructions, since they both support 2 loads per clock, and have a wide enough pipeline to get load \/ mulss \/ addss through the pipeline with some room to spare.<\/p>\n<\/p><\/div>\n<div class=\"mt24\"><\/div>\n<\/div>\n<p>            <span class=\"d-none\" itemprop=\"commentCount\"><\/span> <\/p><\/div>\n<\/div>\n<p>[ad_2]<\/p>\n<p>solved loop unrolling not giving expected speedup for floating-point dot product <\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Your unroll doesn&#8217;t help with the FP latency bottleneck: sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; &#8230; so you haven&#8217;t done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is &#8230; <a title=\"[Solved] loop unrolling not giving expected speedup for floating-point dot product\" class=\"read-more\" href=\"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/\" aria-label=\"More on [Solved] loop unrolling not giving expected speedup for floating-point dot product\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[320],"tags":[324,326,3798,3799,1034],"class_list":["post-14518","post","type-post","status-publish","format-standard","hentry","category-solved","tag-c","tag-cpu-architecture","tag-dot-product","tag-loop-unrolling","tag-x86-64"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>[Solved] loop unrolling not giving expected speedup for floating-point dot product - JassWeb<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"[Solved] loop unrolling not giving expected speedup for floating-point dot product - JassWeb\" \/>\n<meta property=\"og:description\" content=\"[ad_1] Your unroll doesn&#8217;t help with the FP latency bottleneck: sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; &#8230; so you haven&#8217;t done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/\" \/>\n<meta property=\"og:site_name\" content=\"JassWeb\" \/>\n<meta property=\"article:published_time\" content=\"2022-10-07T21:44:35+00:00\" \/>\n<meta name=\"author\" content=\"Kirat\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Kirat\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\\\/\"},\"author\":{\"name\":\"Kirat\",\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/#\\\/schema\\\/person\\\/65c9c7b7958150c0dc8371fa35dd7c31\"},\"headline\":\"[Solved] loop unrolling not giving expected speedup for floating-point dot product\",\"datePublished\":\"2022-10-07T21:44:35+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\\\/\"},\"wordCount\":471,\"publisher\":{\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/#organization\"},\"keywords\":[\"c++\",\"cpu-architecture\",\"dot-product\",\"loop-unrolling\",\"x86-64\"],\"articleSection\":[\"Solved\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\\\/\",\"url\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\\\/\",\"name\":\"[Solved] loop unrolling not giving expected speedup for floating-point dot product - JassWeb\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/#website\"},\"datePublished\":\"2022-10-07T21:44:35+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/jassweb.com\\\/solved\\\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"[Solved] loop unrolling not giving expected speedup for floating-point dot product\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/#website\",\"url\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/\",\"name\":\"JassWeb\",\"description\":\"Build High-quality Websites\",\"publisher\":{\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/#organization\",\"name\":\"Jass Web\",\"url\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/jassweb.com\\\/wp-content\\\/uploads\\\/2021\\\/02\\\/jass-website-logo-1.png\",\"contentUrl\":\"https:\\\/\\\/jassweb.com\\\/wp-content\\\/uploads\\\/2021\\\/02\\\/jass-website-logo-1.png\",\"width\":693,\"height\":132,\"caption\":\"Jass Web\"},\"image\":{\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/#\\\/schema\\\/logo\\\/image\\\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/#\\\/schema\\\/person\\\/65c9c7b7958150c0dc8371fa35dd7c31\",\"name\":\"Kirat\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/wp-content\\\/litespeed\\\/avatar\\\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1777008400\",\"url\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/wp-content\\\/litespeed\\\/avatar\\\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1777008400\",\"contentUrl\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/wp-content\\\/litespeed\\\/avatar\\\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1777008400\",\"caption\":\"Kirat\"},\"sameAs\":[\"http:\\\/\\\/jassweb.com\"],\"url\":\"https:\\\/\\\/jassweb.com\\\/solved\\\/author\\\/jaspritsinghghumangmail-com\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"[Solved] loop unrolling not giving expected speedup for floating-point dot product - JassWeb","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/","og_locale":"en_US","og_type":"article","og_title":"[Solved] loop unrolling not giving expected speedup for floating-point dot product - JassWeb","og_description":"[ad_1] Your unroll doesn&#8217;t help with the FP latency bottleneck: sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; &#8230; so you haven&#8217;t done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is ... Read more","og_url":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/","og_site_name":"JassWeb","article_published_time":"2022-10-07T21:44:35+00:00","author":"Kirat","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Kirat","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/#article","isPartOf":{"@id":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/"},"author":{"name":"Kirat","@id":"https:\/\/jassweb.com\/solved\/#\/schema\/person\/65c9c7b7958150c0dc8371fa35dd7c31"},"headline":"[Solved] loop unrolling not giving expected speedup for floating-point dot product","datePublished":"2022-10-07T21:44:35+00:00","mainEntityOfPage":{"@id":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/"},"wordCount":471,"publisher":{"@id":"https:\/\/jassweb.com\/solved\/#organization"},"keywords":["c++","cpu-architecture","dot-product","loop-unrolling","x86-64"],"articleSection":["Solved"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/","url":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/","name":"[Solved] loop unrolling not giving expected speedup for floating-point dot product - JassWeb","isPartOf":{"@id":"https:\/\/jassweb.com\/solved\/#website"},"datePublished":"2022-10-07T21:44:35+00:00","breadcrumb":{"@id":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/jassweb.com\/solved\/solved-loop-unrolling-not-giving-expected-speedup-for-floating-point-dot-product\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/jassweb.com\/solved\/"},{"@type":"ListItem","position":2,"name":"[Solved] loop unrolling not giving expected speedup for floating-point dot product"}]},{"@type":"WebSite","@id":"https:\/\/jassweb.com\/solved\/#website","url":"https:\/\/jassweb.com\/solved\/","name":"JassWeb","description":"Build High-quality Websites","publisher":{"@id":"https:\/\/jassweb.com\/solved\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/jassweb.com\/solved\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/jassweb.com\/solved\/#organization","name":"Jass Web","url":"https:\/\/jassweb.com\/solved\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/jassweb.com\/solved\/#\/schema\/logo\/image\/","url":"https:\/\/jassweb.com\/wp-content\/uploads\/2021\/02\/jass-website-logo-1.png","contentUrl":"https:\/\/jassweb.com\/wp-content\/uploads\/2021\/02\/jass-website-logo-1.png","width":693,"height":132,"caption":"Jass Web"},"image":{"@id":"https:\/\/jassweb.com\/solved\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/jassweb.com\/solved\/#\/schema\/person\/65c9c7b7958150c0dc8371fa35dd7c31","name":"Kirat","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/jassweb.com\/solved\/wp-content\/litespeed\/avatar\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1777008400","url":"https:\/\/jassweb.com\/solved\/wp-content\/litespeed\/avatar\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1777008400","contentUrl":"https:\/\/jassweb.com\/solved\/wp-content\/litespeed\/avatar\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1777008400","caption":"Kirat"},"sameAs":["http:\/\/jassweb.com"],"url":"https:\/\/jassweb.com\/solved\/author\/jaspritsinghghumangmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/posts\/14518","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/comments?post=14518"}],"version-history":[{"count":0,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/posts\/14518\/revisions"}],"wp:attachment":[{"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/media?parent=14518"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/categories?post=14518"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/tags?post=14518"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}