Dividing unsigned 16-bit numbers

Author: Wojciech Muła
Added on:2025-01-03

Contents

Introduction

This is a follow-up text for Dividing unsigned 8-bit numbers. We checked if dividing 16-bit unsigned numbers is also feasible for SIMD instructions.

Apart from obvious path, where we use floating-point division (DIVPS), 8-bit numbers could also utilize the approximate reciprocal instruction RCPPS.

Unfortunately, for 16-bit numbers the latter instruction cannot be used directly. To properly divide 16-bit integers we need to perform a single step of the Newton-Raphson algorithm, which kills performance.

Experiment results

Summary:

All benchmark programs were compiled with -O3 -march=native options on each machine separately.

Tested procedures
Procedure Comments
scalar plain 16-bit division
AVX2 division followed by integer casting with truncation (CVTTPS2DQ)
AVX2 (x2) the base AVX2 procedure unrolled twice
AVX2 (x4) the base AVX2 procedure unrolled fourfold
AVX2 (rcp) multiplication by approximate reciprocal followed by a single step of the Newton-Raphson algorithm
AVX512 division followed by integer casting with truncation (CVTTPS2DQ)
AVX512 (x2) the base AVX-512 procedure unrolled twice
AVX512 (x4) the base AVX-512 procedure unrolled fourfold

Ryzen 7

  • Compiler: gcc (Debian 14.1.0-5) 14.1.0
  • CPU: AMD Ryzen 7 7730U with Radeon Graphics
procedure time in cycles per byte speed-up
  average best    
scalar 1.783 1.759 1.0 ████▏
AVX2 0.208 0.182 9.7 ████████████████████████████████████████
AVX2 (x2) 0.201 0.185 9.5 ███████████████████████████████████████▎
AVX2 (x4) 0.215 0.188 9.4 ██████████████████████████████████████▋
AVX2 (rcp) 0.323 0.302 5.8 ████████████████████████

AlderLake

  • Compiler: gcc (Debian 13.2.0-25) 13.2.0
  • CPU: 12th Gen Intel(R) Core(TM) i7-1255U
procedure time in cycles per byte speed-up
  average best    
scalar 3.714 3.368 1.0 ████▍
AVX2 0.507 0.389 8.7 ██████████████████████████████████████▋
AVX2 (x2) 0.477 0.384 8.8 ███████████████████████████████████████▏
AVX2 (x4) 0.518 0.376 9.0 ████████████████████████████████████████
AVX2 (rcp) 0.531 0.440 7.7 ██████████████████████████████████▏

Icelake

  • Compiler: gcc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2)
  • CPU: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
procedure time in cycles per byte speed-up
  average best    
scalar 6.032 6.014 1.0 ██▋
AVX2 0.630 0.628 9.6 █████████████████████████▌
AVX2 (x2) 0.628 0.626 9.6 █████████████████████████▌
AVX2 (x4) 0.404 0.401 15.0 ████████████████████████████████████████
AVX2 (rcp) 0.501 0.491 12.2 ████████████████████████████████▋
AVX512 0.431 0.428 14.1 █████████████████████████████████████▍
AVX512 (x4) 0.407 0.405 14.8 ███████████████████████████████████████▌
AVX512 (x2) 0.415 0.413 14.6 ██████████████████████████████████████▊

Skylake-X

  • CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
  • Compiler: gcc (GCC) 11.2.0
procedure time in cycles per byte speed-up
  average best    
scalar 8.116 8.092 1.0 █████▊
AVX2 1.235 1.212 6.7 ██████████████████████████████████████▍
AVX2 (x2) 1.236 1.219 6.6 ██████████████████████████████████████▏
AVX2 (x4) 1.234 1.210 6.7 ██████████████████████████████████████▍
AVX2 (rcp) 1.394 1.368 5.9 ██████████████████████████████████
AVX512 1.197 1.183 6.8 ███████████████████████████████████████▎
AVX512 (x4) 1.196 1.181 6.9 ███████████████████████████████████████▍
AVX512 (x2) 1.185 1.164 7.0 ████████████████████████████████████████

Source code

Sample implementation is available at GitHub.