Simple suggestions using popcount

Author: Wojciech Muła
Added on:2023-11-20

Contents

Introduction

One great feature that some of CLI applications and compilers recently gained is providing suggestions in the case of misspellings arguments or options. For instance, Python suggests possible method/fields names, like:

>>> 'lower'.is_upper()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'is_upper'. Did you mean: 'isupper'?

If we want to include similar feature in our program, then a quite obvious solution is to use Levenshtein distance. Its basic implementation is simple and short. If we want suggestion for larger corpus, we may use tries to speedup matching — see: Fast and Easy Levenshtein distance using a Trie by Steve Hanov.

Despite that, I was curious if simpler algorithm would do the job. We may assume that our users know what they are doing. Their inputs will be sane, with an exception of minor mistakes. Like omitting a single letter, swapping two adjacent letters or maybe hitting a wrong key from time to time.

The two additional assumptions I made: we search suggestions in small sets, and we use only ASCII letters (128 possible bytes).

Alternative approach

To estimate edit distance we use two histogram-based approaches:

  1. byte histogram — exact histogram for each byte of a string,
  2. bit histogram — mark in a bit-set which bytes from a string are present.

In the first case, histogram takes 128 bytes. Similarity is calculated as the sum of absolute values.

def diff1(h1, h2):
    s = 0
    for i in range(128):
        s += abs(h1[i] - h2[i])

    return s

In the second case histogram uses 128 bits = 16 bytes. Similarity uses Hamming distance — it is the number of 1 bits in the word calculated as a XOR of two corresponding words from histograms.

def diff2(h1, h2):
    # assume words are 64-bit numbers, thus we have 4-element arrays
    s = 0
    for i in range(4):
        d = h1[i] ^ h2[i]
        s += d.bit_count()

    return s

Let's assume a valid word "rebase" (the git command). And a misspelled sample is "revase" — the letter 'b' got replaced with the 'v'. The Levenshtein distance is 1.

Example

The byte histograms for the words are:

h1("rebase") = [..., 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, ...]
h2("revase") = [..., 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, ...]
                        ^                                                           ^

The estimated distance is 2.

Bit histograms are:

h1("rebase") = [... 0000000000001100000000000010011000000000000000000000000000000000 ...]
h2("revase") = [... 0000000001001100000000000010001000000000000000000000000000000000 ...]
                             ^                   ^

The estimated distance is also 2.

Evaluation

The two sets of words were tested:

Each word was taken and got modified in the following ways:

A hit is defined as a suggestion that yields the original word.

  average hits
set Levenshtein Byte histogram Bit histogram
git commands 99.64% 98.05% 96.14%
str methods 98.84% 95.32% 90.02%

Git commands

word cases Levenshtein Byte histogram Bit histogram
misses hits misses hits misses hits
clone 301 0 100.00% 0 100.00% 0 100.00%
init 244 0 100.00% 0 100.00% 0 100.00%
add 188 0 100.00% 2 98.94% 2 100.00%
mv 133 1 99.25% 4 96.99% 4 100.00%
restore 418 0 100.00% 26 93.78% 26 100.00%
rm 133 5 96.24% 33 75.19% 33 76.69%
bisect 359 0 100.00% 0 100.00% 0 100.00%
diff 244 0 100.00% 3 98.77% 3 100.00%
grep 244 0 100.00% 3 98.77% 3 97.95%
log 188 0 100.00% 4 97.87% 4 97.87%
show 244 0 100.00% 0 100.00% 0 100.00%
status 359 0 100.00% 2 99.44% 2 100.00%
branch 359 0 100.00% 0 100.00% 0 100.00%
commit 359 0 100.00% 2 99.44% 2 100.00%
merge 301 1 99.67% 62 79.40% 62 99.00%
rebase 359 0 100.00% 4 98.89% 4 99.44%
reset 301 1 99.67% 24 92.03% 24 94.35%
switch 359 0 100.00% 9 97.49% 9 97.49%
tag 188 3 98.40% 12 93.62% 12 96.81%
fetch 301 0 100.00% 0 100.00% 0 100.00%
pull 244 0 100.00% 4 98.36% 4 100.00%
push 244 3 98.77% 10 95.90% 10 97.54%
  average: 99.64% average: 98.05% average: 96.14%

Python str methods

word cases Levenshtein Byte histogram Bit histogram
misses hits misses hits misses hits
capitalize 601 0 100.00% 0 100.00% 0 100.00%
casefold 478 0 100.00% 0 100.00% 0 100.00%
center 359 0 100.00% 0 100.00% 0 100.00%
count 301 0 100.00% 4 98.67% 4 100.00%
encode 359 0 100.00% 3 99.16% 3 99.44%
endswith 478 0 100.00% 0 100.00% 0 100.00%
expandtabs 601 0 100.00% 0 100.00% 0 100.00%
find 244 0 100.00% 3 98.77% 3 98.77%
format 359 0 100.00% 0 100.00% 0 100.00%
format_map 601 0 100.00% 53 91.18% 53 100.00%
index 301 0 100.00% 6 98.01% 6 99.34%
isalnum 418 0 100.00% 0 100.00% 0 100.00%
isalpha 418 0 100.00% 1 99.76% 1 100.00%
isascii 418 0 100.00% 3 99.28% 3 100.00%
isdecimal 539 0 100.00% 2 99.63% 2 100.00%
isdigit 418 0 100.00% 0 100.00% 0 100.00%
isidentifier 728 0 100.00% 4 99.45% 4 100.00%
islower 418 0 100.00% 0 100.00% 0 100.00%
isnumeric 539 0 100.00% 0 100.00% 0 100.00%
isprintable 664 0 100.00% 0 100.00% 0 100.00%
isspace 418 0 100.00% 54 87.08% 54 99.04%
istitle 418 0 100.00% 31 92.58% 31 100.00%
isupper 418 0 100.00% 6 98.56% 6 100.00%
join 244 0 100.00% 4 98.36% 4 98.36%
ljust 301 0 100.00% 6 98.01% 6 100.00%
lower 301 2 99.34% 22 92.69% 22 92.69%
lstrip 359 3 99.16% 60 83.29% 60 85.24%
maketrans 539 0 100.00% 2 99.63% 2 100.00%
partition 539 0 100.00% 0 100.00% 0 100.00%
removeprefix 728 0 100.00% 2 99.73% 2 100.00%
removesuffix 728 0 100.00% 2 99.73% 2 100.00%
replace 418 0 100.00% 6 98.56% 6 100.00%
rfind 301 33 89.04% 30 90.03% 30 90.03%
rindex 359 34 90.53% 31 91.36% 31 91.92%
rjust 301 33 89.04% 37 87.71% 37 87.71%
rpartition 601 37 93.84% 601 0.00% 601 91.01%
rsplit 359 4 98.89% 359 0.00% 359 0.00%
rstrip 359 34 90.53% 27 92.48% 27 81.89%
split 301 29 90.37% 74 75.42% 74 96.68%
splitlines 601 0 100.00% 77 87.19% 77 100.00%
startswith 601 0 100.00% 3 99.50% 3 100.00%
strip 301 31 89.70% 301 0.00% 301 85.05%
swapcase 478 0 100.00% 42 91.21% 42 99.37%
title 301 3 99.00% 12 96.01% 12 92.03%
translate 539 0 100.00% 7 98.70% 7 98.89%
upper 301 2 99.34% 25 91.69% 25 92.69%
zfill 301 0 100.00% 8 97.34% 8 100.00%
  average: 98.84% average: 95.32% average: 90.02%

Source

All programs are available on Github.