# SIMD-friendly Rabin-Karp modification

Author: Wojciech Muła 2014-03-11 2016-01-17 (results from Core i5)

# Algorithm

Rabin-Karp algorithm uses a weak hash function to locate possible substring positions. This modification uses merely equality of the first and the last char of searched substring, however equality of chars can be done very fast in parallel, even without SIMD instruction.

Let packed_byte(x) is a function that fills a CPU register with byte x, for example on 32-bit architecture:

```// packed_byte(0xab) = 0xabababab
uint32_t packed_byte(uint8_t byte) {
return 0x01010101 * byte;
}
```

In a single iteration two registers are filled with part of string:

```const size_t n = strlen(string);
const size_t k = strlen(substring);

const size_t first = packed_byte(substring[0]);
const size_t last  = packed_byte(substring[k - 1]);

for (size_t i = 0; i < n - k; i += 4) {
const uint32_t block_first = string[i     .. i + 4];
const uint32_t block_last  = string[i + k .. i + k + 4];

...
}
```

Then parallel comparison is done with simple xor:

```const uint32_t first_zeros = block_first ^ first;
const uint32_t last_zeros  = block_last  ^ last;
```

Zero bytes in first_zeros and last_zeros indicate equality of chars. The positions of zero bytes have to be equal, so an additional bit or is needed:

```const uint32_t zeros = first_zeros | last_zeros;
```

Getting zeros requires only:

• two memory fetches,
• two bit-xor,
• one bit-or.

Now we have to check if zeros has any zero bytes, then iterate through zeros and perform byte-wise comparisons for all zero bytes:

```if (has_zero_byte(zeros)) {
for (int j = 0; j < 4; j++) {
if (is_zero(zeros, j) && memcmp(&string[i + j], substring, k) == 0) {
return i + z;
}
}
}
```

Function has_zero_byte(uint32_t word) could be implemented using algorithm from Bit Twiddling Hacks — Determine if a word has a zero byte. Function is_zero(uint32_t word, int k) may use results from has_zero_byte (warning: this method has a drawback).

The worst case complexity is O(n*m). However the method minimize memory fetches during a string scan, also comparisons are performed in parallel, and no preprocessing is required (except two packed_byte calls before the main loop).

# Sample programs

Sample implementation is available.

After run make analyze32 or make analyze64 famous document 386intel.txt is downloaded, split to words and all these words (~4500) are searched in the text.

Following function are measured:

• C++ std::string::find
• C strlen
• custom — implementation of the algorithm presented here

# Experiments

## Results from Core i5 (64-bit architecture)

GCC 4.9.2:

```custom vs std::string::find
----------------------------------------------------------------------------
slower:    12  ( 0.26%) =
same speed:    30  ( 0.65%) =
2.00 - 2.99 faster:   447  ( 9.75%) ====
3.00 - 3.99 faster:  3082  (67.22%) =================================
4.00 - 4.99 faster:   782  (17.06%) ========
5.00 - 5.99 faster:   160  ( 3.49%) =
6.00 - ...  faster:    68  ( 1.48%) =

custom vs strlen
----------------------------------------------------------------------------
slower:  4506  (98.28%) =================================================
same speed:    65  ( 1.42%) =
2.00 - 2.99 faster: N/A
3.00 - 3.99 faster: N/A
4.00 - 4.99 faster: N/A
5.00 - 5.99 faster: N/A
6.00 - ...  faster:    14  ( 0.31%) =
```

## Results from Pentium M (32-bit architecture)

```custom vs std::string::find
----------------------------------------------------------------------------
slower:    19  ( 0.41%) =
same speed:    78  ( 1.70%) =
2.00 - 2.99 faster:  3282  (71.58%) ===================================
3.00 - 3.99 faster:  1037  (22.62%) ===========
4.00 - 4.99 faster:    32  ( 0.70%) =
5.00 - 5.99 faster:     7  ( 0.15%) =
6.00 - ...  faster:     4  ( 0.09%) =

custom vs strlen
----------------------------------------------------------------------------
slower:    10  ( 0.22%) =
same speed:    49  ( 1.07%) =
2.00 - 2.99 faster:   806  (17.58%) ========
3.00 - 3.99 faster:  1520  (33.15%) ================
4.00 - 4.99 faster:  1003  (21.88%) ==========
5.00 - 5.99 faster:   928  (20.24%) ==========
6.00 - ...  faster:   219  ( 4.78%) ==
```

## Summary

• For the 32-bit code the presented method is around 2-4 times faster than C++ find, and around 2-6 times faster than C strlen.
• For the 64-bit code the presented method is faster than C++ find, however plan C strlen beats it.