# Speeding up multiple vector operations using SIMD

Author: Wojciech Muła 2018-11-14

# Introduction

One step of k-means algorithm is calculating the distance between all centroids and all samples. Then centroids are recalculated and samples re-assigned. Centroids and also samples are vectors of fixed size.

I was curious how SIMD might help in this task (or similar ones).

# Baseline

C++ code that shows the idea of this step:

```// M - vector size
// K - number of centroids
// N - number of samples

// ...
using Vector = float[M];
// ...
Vector centroids[K];
float  distances[K];
Vector samples[N];
// ...

for (int i=0; i < K; i++) {
for (int j=0; j < N; j++) {
distances[j] = calculate_distance(M, centroids[i], samples[j]);
}

// somehow consume 'distances' vector
}
```

The distance function is often the sum of squares of differences:

```float scalar_dist_vectors(size_t n, float* a, float* b) {

float ret = 0.0f;

for (size_t i=0; i < n; i++) {
const float d = a[i] - b[i];
ret += d*d;
}

return ret;
}
```

# Optimization

## Distance function

Since we operates on vectors, the most obvious use of SIMD instructions is to vectorize the distance procedure calculate_distance.

A decent compiler is able to auto-vectorize such a simple function; GCC and clang require flag -ffast-math to enable this. Here is the assembly code generated by GCC 7.3.1:

```.L7:
movaps  (%r10,%rax), %xmm1
movups  (%rcx,%rax), %xmm3
cmpl    %r11d, %r8d
subps   %xmm3, %xmm1
mulps   %xmm1, %xmm1
jb      .L7
```

Below is a hand-coded version, without processing vectors' tail.

```float sse_dist_vector(size_t n, float* a, float* b) {

__m128 ret = _mm_setzero_ps();

for (size_t i=0; i < n; i += 4) {

const __m128 ai = _mm_loadu_ps(a + i);
const __m128 bi = _mm_loadu_ps(b + i);

const __m128 d  = _mm_sub_ps(ai, bi);
const __m128 d2 = _mm_mul_ps(d, d);

}

float v;
_mm_storeu_ps(v, ret);

return v + v + v + v;
}
```

## Multiple distances at once

In the inner loop the expression centroid[i] is constant. We can use this fact and calculate multiple distances at once, minimizing number of memory fetches from centroids array and also utilizing all SIMD resources.

So the algorithm becomes:

```for (int i=0; i < K; i++) {
calculate_multiple_distances(centroid[i], samples, distances);
}
```

And actual implementation of calculate_multiple_distances, which processes four vectors at once:

```void sse_dist_vector_many(size_t n, float const_vector[], size_t k, float* in[], float out[]) {

for (size_t j=0; j < k; j += 4) {

float* in0 = in[j + 0]; // Note: factoring out these pointer gave pretty
float* in1 = in[j + 1]; //       nice boost, as GCC 7.3.1 couldn't handle
float* in2 = in[j + 2]; //       well complex addresses like `in[j + 0][i]`
float* in3 = in[j + 3];

__m128 out0 = _mm_setzero_ps();
__m128 out1 = _mm_setzero_ps();
__m128 out2 = _mm_setzero_ps();
__m128 out3 = _mm_setzero_ps();

for (size_t i=0; i < n; i += 4) {
const __m128 vc = _mm_loadu_ps(const_vector + i);

const __m128 x0 = _mm_loadu_ps(in0 + i);
const __m128 x1 = _mm_loadu_ps(in1 + i);
const __m128 x2 = _mm_loadu_ps(in2 + i);
const __m128 x3 = _mm_loadu_ps(in3 + i);

const __m128 d0   = _mm_sub_ps(x0, vc);
const __m128 d0_2 = _mm_mul_ps(d0, d0);
const __m128 d1   = _mm_sub_ps(x1, vc);
const __m128 d1_2 = _mm_mul_ps(d1, d1);
const __m128 d2   = _mm_sub_ps(x2, vc);
const __m128 d2_2 = _mm_mul_ps(d2, d2);
const __m128 d3   = _mm_sub_ps(x3, vc);
const __m128 d3_2 = _mm_mul_ps(d3, d3);

}

float v;

_mm_storeu_ps(v, out0);
out[j + 0] = v + v + v + v;

_mm_storeu_ps(v, out1);
out[j + 1] = v + v + v + v;

_mm_storeu_ps(v, out2);
out[j + 2] = v + v + v + v;

_mm_storeu_ps(v, out3);
out[j + 3] = v + v + v + v;
}
}
```

While in a naive algorithm we fetch numbers from the centroids table K * N times, here we do K * N/4, just 25%.

# Evaluation

 scalar baseline SSE hand-written distance function SSE (custom) multiple distances at once

CPU: Intel(R) Core(TM) i5 CPU M 540 @ 2.53GHz

Compiler: GCC version 7.3.0 (Debian 7.3.0-16)

procedure 8 vectors 16 vectors 32 vectors 64 vectors 128 vectors 256 vectors
avg cycles speedup avg cycles speedup avg cycles speedup avg cycles speedup avg cycles speedup avg cycles speedup
vector size 64
scalar 7.722 0.77 17.460 0.67 29.995 0.78 59.743 0.78 125.639 0.82 243.951 0.80
SSE 5.918 1.00 11.660 1.00 23.435 1.00 46.609 1.00 103.063 1.00 195.983 1.00
SSE (custom) 4.316 1.37 8.262 1.41 16.068 1.46 31.796 1.47 64.827 1.59 131.814 1.49
vector size 128
scalar 6.324 0.81 12.562 0.80 27.698 0.74 56.080 0.77 112.804 0.78 205.904 0.86
SSE 5.123 1.00 10.104 1.00 20.384 1.00 43.358 1.00 87.848 1.00 176.197 1.00
SSE (custom) 4.071 1.26 7.378 1.37 14.630 1.39 29.439 1.47 59.328 1.48 118.533 1.49
vector size 256
scalar 5.673 0.86 11.791 0.84 24.314 1.26 49.156 0.84 98.680 0.89 199.404 0.87
SSE 4.870 1.00 9.846 1.00 30.579 1.00 41.315 1.00 88.000 1.00 172.773 1.00
SSE (custom) 3.577 1.36 6.997 1.41 13.973 2.19 28.017 1.47 56.665 1.55 122.198 1.41
vector size 512
scalar 5.512 0.95 11.288 0.94 23.021 0.91 46.159 0.92 96.044 0.90 193.548 0.93
SSE 5.263 1.00 10.556 1.00 21.044 1.00 42.326 1.00 86.618 1.00 179.824 1.00
SSE (custom) 3.425 1.54 6.830 1.55 13.708 1.54 27.406 1.54 58.562 1.48 123.773 1.45
vector size 1024
scalar 5.537 0.93 11.006 0.93 22.110 0.95 46.855 0.90 93.629 0.92 186.649 0.92
SSE 5.148 1.00 10.225 1.00 20.995 1.00 42.357 1.00 85.820 1.00 172.297 1.00
SSE (custom) 6.028 0.85 6.840 1.49 13.595 1.54 28.767 1.47 58.689 1.46 118.137 1.46
vector size 4096
scalar 7.279 0.93 11.265 0.95 22.857 0.95 45.792 0.93 99.772 0.95 359.419 0.98
SSE 6.805 1.00 10.667 1.00 21.734 1.00 42.647 1.00 94.667 1.00 352.802 1.00
SSE (custom) 4.504 1.51 7.500 1.42 14.748 1.47 29.902 1.43 64.656 1.46 302.095 1.17
vector size 8192
scalar 8.293 0.98 13.242 0.94 24.831 0.98 53.673 0.99 180.724 0.99 408.179 1.00
SSE 8.160 1.00 12.426 1.00 24.274 1.00 52.892 1.00 178.407 1.00 406.330 1.00
SSE (custom) 4.751 1.72 7.466 1.66 15.167 1.60 32.015 1.65 145.880 1.22 358.551 1.13

# Conclusions

Results surprised me. I didn't expect that such a simple reorganization of computations might be beneficial. In this case we have speedups ranging from 1.10 to 1.60. Which is pretty nice, especially in context of k-means where this operation is the most time consuming.

# Source code

Sample code is available on github.