for i = 0 ... N-1 in steps of 2 1. Load h[i] and h[i+1], starting from memory address &h+i, placing h[i] in the lower-half of register 1 and h[i+1] in the upper-half of register 1. 2. Load x[i] and x[i+1], starting from memory address &x+i, placing x[i] in the lower-half of register 2 and x[i+1] in the upper-half of register 2. 3. Multiply lower-half of register 1 by lower-half of register 2 and place 32-bit result in register 3. 4. Multiply upper-half of register 1 by upper-half of register 2 and place 32-bit result in register 4. 5. Add contents of register 3 to running sum stored in register 5. 6. Add contents of register 4 to running sum stored in register 5. end
Example 3: Vector dot product using packed data.