Author: | Wojciech Muła |
---|---|
Added on: | 2025-01-03 |
This is a follow-up text for Dividing unsigned 8-bit numbers. We checked if dividing 16-bit unsigned numbers is also feasible for SIMD instructions.
Apart from obvious path, where we use floating-point division (DIVPS), 8-bit numbers could also utilize the approximate reciprocal instruction RCPPS.
Unfortunately, for 16-bit numbers the latter instruction cannot be used directly. To properly divide 16-bit integers we need to perform a single step of the Newton-Raphson algorithm, which kills performance.
Summary:
All benchmark programs were compiled with -O3 -march=native options on each machine separately.
Procedure | Comments |
---|---|
scalar | plain 16-bit division |
AVX2 | division followed by integer casting with truncation (CVTTPS2DQ) |
AVX2 (x2) | the base AVX2 procedure unrolled twice |
AVX2 (x4) | the base AVX2 procedure unrolled fourfold |
AVX2 (rcp) | multiplication by approximate reciprocal followed by a single step of the Newton-Raphson algorithm |
AVX512 | division followed by integer casting with truncation (CVTTPS2DQ) |
AVX512 (x2) | the base AVX-512 procedure unrolled twice |
AVX512 (x4) | the base AVX-512 procedure unrolled fourfold |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
scalar | 1.783 | 1.759 | 1.0 | ████▏ |
AVX2 | 0.208 | 0.182 | 9.7 | ████████████████████████████████████████ |
AVX2 (x2) | 0.201 | 0.185 | 9.5 | ███████████████████████████████████████▎ |
AVX2 (x4) | 0.215 | 0.188 | 9.4 | ██████████████████████████████████████▋ |
AVX2 (rcp) | 0.323 | 0.302 | 5.8 | ████████████████████████ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
scalar | 3.714 | 3.368 | 1.0 | ████▍ |
AVX2 | 0.507 | 0.389 | 8.7 | ██████████████████████████████████████▋ |
AVX2 (x2) | 0.477 | 0.384 | 8.8 | ███████████████████████████████████████▏ |
AVX2 (x4) | 0.518 | 0.376 | 9.0 | ████████████████████████████████████████ |
AVX2 (rcp) | 0.531 | 0.440 | 7.7 | ██████████████████████████████████▏ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
scalar | 6.032 | 6.014 | 1.0 | ██▋ |
AVX2 | 0.630 | 0.628 | 9.6 | █████████████████████████▌ |
AVX2 (x2) | 0.628 | 0.626 | 9.6 | █████████████████████████▌ |
AVX2 (x4) | 0.404 | 0.401 | 15.0 | ████████████████████████████████████████ |
AVX2 (rcp) | 0.501 | 0.491 | 12.2 | ████████████████████████████████▋ |
AVX512 | 0.431 | 0.428 | 14.1 | █████████████████████████████████████▍ |
AVX512 (x4) | 0.407 | 0.405 | 14.8 | ███████████████████████████████████████▌ |
AVX512 (x2) | 0.415 | 0.413 | 14.6 | ██████████████████████████████████████▊ |
procedure | time in cycles per byte | speed-up | ||
---|---|---|---|---|
average | best | |||
scalar | 8.116 | 8.092 | 1.0 | █████▊ |
AVX2 | 1.235 | 1.212 | 6.7 | ██████████████████████████████████████▍ |
AVX2 (x2) | 1.236 | 1.219 | 6.6 | ██████████████████████████████████████▏ |
AVX2 (x4) | 1.234 | 1.210 | 6.7 | ██████████████████████████████████████▍ |
AVX2 (rcp) | 1.394 | 1.368 | 5.9 | ██████████████████████████████████ |
AVX512 | 1.197 | 1.183 | 6.8 | ███████████████████████████████████████▎ |
AVX512 (x4) | 1.196 | 1.181 | 6.9 | ███████████████████████████████████████▍ |
AVX512 (x2) | 1.185 | 1.164 | 7.0 | ████████████████████████████████████████ |
Sample implementation is available at GitHub.