Author: | Wojciech Muła |
---|---|
Added on: | 2018-11-14 |
Contents
One step of k-means algorithm is calculating the distance between all centroids and all samples. Then centroids are recalculated and samples re-assigned. Centroids and also samples are vectors of fixed size.
I was curious how SIMD might help in this task (or similar ones).
C++ code that shows the idea of this step:
// M - vector size // K - number of centroids // N - number of samples // ... using Vector = float[M]; // ... Vector centroids[K]; float distances[K]; Vector samples[N]; // ... for (int i=0; i < K; i++) { for (int j=0; j < N; j++) { distances[j] = calculate_distance(M, centroids[i], samples[j]); } // somehow consume 'distances' vector }
The distance function is often the sum of squares of differences:
float scalar_dist_vectors(size_t n, float* a, float* b) { float ret = 0.0f; for (size_t i=0; i < n; i++) { const float d = a[i] - b[i]; ret += d*d; } return ret; }
Since we operates on vectors, the most obvious use of SIMD instructions is to vectorize the distance procedure calculate_distance.
A decent compiler is able to auto-vectorize such a simple function; GCC and clang require flag -ffast-math to enable this. Here is the assembly code generated by GCC 7.3.1:
.L7: movaps (%r10,%rax), %xmm1 addl $1, %r8d movups (%rcx,%rax), %xmm3 addq $16, %rax cmpl %r11d, %r8d subps %xmm3, %xmm1 mulps %xmm1, %xmm1 addps %xmm1, %xmm2 jb .L7
Below is a hand-coded version, without processing vectors' tail.
float sse_dist_vector(size_t n, float* a, float* b) { __m128 ret = _mm_setzero_ps(); for (size_t i=0; i < n; i += 4) { const __m128 ai = _mm_loadu_ps(a + i); const __m128 bi = _mm_loadu_ps(b + i); const __m128 d = _mm_sub_ps(ai, bi); const __m128 d2 = _mm_mul_ps(d, d); ret = _mm_add_ps(ret, d2); } float v[4]; _mm_storeu_ps(v, ret); return v[0] + v[1] + v[2] + v[3]; }
In the inner loop the expression centroid[i] is constant. We can use this fact and calculate multiple distances at once, minimizing number of memory fetches from centroids array and also utilizing all SIMD resources.
So the algorithm becomes:
for (int i=0; i < K; i++) { calculate_multiple_distances(centroid[i], samples, distances); }
And actual implementation of calculate_multiple_distances, which processes four vectors at once:
void sse_dist_vector_many(size_t n, float const_vector[], size_t k, float* in[], float out[]) { for (size_t j=0; j < k; j += 4) { float* in0 = in[j + 0]; // Note: factoring out these pointer gave pretty float* in1 = in[j + 1]; // nice boost, as GCC 7.3.1 couldn't handle float* in2 = in[j + 2]; // well complex addresses like `in[j + 0][i]` float* in3 = in[j + 3]; __m128 out0 = _mm_setzero_ps(); __m128 out1 = _mm_setzero_ps(); __m128 out2 = _mm_setzero_ps(); __m128 out3 = _mm_setzero_ps(); for (size_t i=0; i < n; i += 4) { const __m128 vc = _mm_loadu_ps(const_vector + i); const __m128 x0 = _mm_loadu_ps(in0 + i); const __m128 x1 = _mm_loadu_ps(in1 + i); const __m128 x2 = _mm_loadu_ps(in2 + i); const __m128 x3 = _mm_loadu_ps(in3 + i); const __m128 d0 = _mm_sub_ps(x0, vc); const __m128 d0_2 = _mm_mul_ps(d0, d0); const __m128 d1 = _mm_sub_ps(x1, vc); const __m128 d1_2 = _mm_mul_ps(d1, d1); const __m128 d2 = _mm_sub_ps(x2, vc); const __m128 d2_2 = _mm_mul_ps(d2, d2); const __m128 d3 = _mm_sub_ps(x3, vc); const __m128 d3_2 = _mm_mul_ps(d3, d3); out0 = _mm_add_ps(out0, d0_2); out1 = _mm_add_ps(out1, d1_2); out2 = _mm_add_ps(out2, d2_2); out3 = _mm_add_ps(out3, d3_2); } float v[4]; _mm_storeu_ps(v, out0); out[j + 0] = v[0] + v[1] + v[2] + v[3]; _mm_storeu_ps(v, out1); out[j + 1] = v[0] + v[1] + v[2] + v[3]; _mm_storeu_ps(v, out2); out[j + 2] = v[0] + v[1] + v[2] + v[3]; _mm_storeu_ps(v, out3); out[j + 3] = v[0] + v[1] + v[2] + v[3]; } }
While in a naive algorithm we fetch numbers from the centroids table K * N times, here we do K * N/4, just 25%.
scalar | baseline |
SSE | hand-written distance function |
SSE (custom) | multiple distances at once |
CPU: Intel(R) Core(TM) i5 CPU M 540 @ 2.53GHz
Compiler: GCC version 7.3.0 (Debian 7.3.0-16)
procedure | 8 vectors | 16 vectors | 32 vectors | 64 vectors | 128 vectors | 256 vectors | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
avg cycles | speedup | avg cycles | speedup | avg cycles | speedup | avg cycles | speedup | avg cycles | speedup | avg cycles | speedup | |
vector size 64 | ||||||||||||
scalar | 7.722 | 0.77 | 17.460 | 0.67 | 29.995 | 0.78 | 59.743 | 0.78 | 125.639 | 0.82 | 243.951 | 0.80 |
SSE | 5.918 | 1.00 | 11.660 | 1.00 | 23.435 | 1.00 | 46.609 | 1.00 | 103.063 | 1.00 | 195.983 | 1.00 |
SSE (custom) | 4.316 | 1.37 | 8.262 | 1.41 | 16.068 | 1.46 | 31.796 | 1.47 | 64.827 | 1.59 | 131.814 | 1.49 |
vector size 128 | ||||||||||||
scalar | 6.324 | 0.81 | 12.562 | 0.80 | 27.698 | 0.74 | 56.080 | 0.77 | 112.804 | 0.78 | 205.904 | 0.86 |
SSE | 5.123 | 1.00 | 10.104 | 1.00 | 20.384 | 1.00 | 43.358 | 1.00 | 87.848 | 1.00 | 176.197 | 1.00 |
SSE (custom) | 4.071 | 1.26 | 7.378 | 1.37 | 14.630 | 1.39 | 29.439 | 1.47 | 59.328 | 1.48 | 118.533 | 1.49 |
vector size 256 | ||||||||||||
scalar | 5.673 | 0.86 | 11.791 | 0.84 | 24.314 | 1.26 | 49.156 | 0.84 | 98.680 | 0.89 | 199.404 | 0.87 |
SSE | 4.870 | 1.00 | 9.846 | 1.00 | 30.579 | 1.00 | 41.315 | 1.00 | 88.000 | 1.00 | 172.773 | 1.00 |
SSE (custom) | 3.577 | 1.36 | 6.997 | 1.41 | 13.973 | 2.19 | 28.017 | 1.47 | 56.665 | 1.55 | 122.198 | 1.41 |
vector size 512 | ||||||||||||
scalar | 5.512 | 0.95 | 11.288 | 0.94 | 23.021 | 0.91 | 46.159 | 0.92 | 96.044 | 0.90 | 193.548 | 0.93 |
SSE | 5.263 | 1.00 | 10.556 | 1.00 | 21.044 | 1.00 | 42.326 | 1.00 | 86.618 | 1.00 | 179.824 | 1.00 |
SSE (custom) | 3.425 | 1.54 | 6.830 | 1.55 | 13.708 | 1.54 | 27.406 | 1.54 | 58.562 | 1.48 | 123.773 | 1.45 |
vector size 1024 | ||||||||||||
scalar | 5.537 | 0.93 | 11.006 | 0.93 | 22.110 | 0.95 | 46.855 | 0.90 | 93.629 | 0.92 | 186.649 | 0.92 |
SSE | 5.148 | 1.00 | 10.225 | 1.00 | 20.995 | 1.00 | 42.357 | 1.00 | 85.820 | 1.00 | 172.297 | 1.00 |
SSE (custom) | 6.028 | 0.85 | 6.840 | 1.49 | 13.595 | 1.54 | 28.767 | 1.47 | 58.689 | 1.46 | 118.137 | 1.46 |
vector size 4096 | ||||||||||||
scalar | 7.279 | 0.93 | 11.265 | 0.95 | 22.857 | 0.95 | 45.792 | 0.93 | 99.772 | 0.95 | 359.419 | 0.98 |
SSE | 6.805 | 1.00 | 10.667 | 1.00 | 21.734 | 1.00 | 42.647 | 1.00 | 94.667 | 1.00 | 352.802 | 1.00 |
SSE (custom) | 4.504 | 1.51 | 7.500 | 1.42 | 14.748 | 1.47 | 29.902 | 1.43 | 64.656 | 1.46 | 302.095 | 1.17 |
vector size 8192 | ||||||||||||
scalar | 8.293 | 0.98 | 13.242 | 0.94 | 24.831 | 0.98 | 53.673 | 0.99 | 180.724 | 0.99 | 408.179 | 1.00 |
SSE | 8.160 | 1.00 | 12.426 | 1.00 | 24.274 | 1.00 | 52.892 | 1.00 | 178.407 | 1.00 | 406.330 | 1.00 |
SSE (custom) | 4.751 | 1.72 | 7.466 | 1.66 | 15.167 | 1.60 | 32.015 | 1.65 | 145.880 | 1.22 | 358.551 | 1.13 |
Results surprised me. I didn't expect that such a simple reorganization of computations might be beneficial. In this case we have speedups ranging from 1.10 to 1.60. Which is pretty nice, especially in context of k-means where this operation is the most time consuming.
Sample code is available on github.