Author: | Wojciech Muła |
---|---|
Added on: | 2023-11-20 |
Contents
One great feature that some of CLI applications and compilers recently gained is providing suggestions in the case of misspellings arguments or options. For instance, Python suggests possible method/fields names, like:
>>> 'lower'.is_upper() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'str' object has no attribute 'is_upper'. Did you mean: 'isupper'?
If we want to include similar feature in our program, then a quite obvious solution is to use Levenshtein distance. Its basic implementation is simple and short. If we want suggestion for larger corpus, we may use tries to speedup matching — see: Fast and Easy Levenshtein distance using a Trie by Steve Hanov.
Despite that, I was curious if simpler algorithm would do the job. We may assume that our users know what they are doing. Their inputs will be sane, with an exception of minor mistakes. Like omitting a single letter, swapping two adjacent letters or maybe hitting a wrong key from time to time.
The two additional assumptions I made: we search suggestions in small sets, and we use only ASCII letters (128 possible bytes).
To estimate edit distance we use two histogram-based approaches:
In the first case, histogram takes 128 bytes. Similarity is calculated as the sum of absolute values.
def diff1(h1, h2): s = 0 for i in range(128): s += abs(h1[i] - h2[i]) return s
In the second case histogram uses 128 bits = 16 bytes. Similarity uses Hamming distance — it is the number of 1 bits in the word calculated as a XOR of two corresponding words from histograms.
def diff2(h1, h2): # assume words are 64-bit numbers, thus we have 4-element arrays s = 0 for i in range(4): d = h1[i] ^ h2[i] s += d.bit_count() return s
Let's assume a valid word "rebase" (the git command). And a misspelled sample is "revase" — the letter 'b' got replaced with the 'v'. The Levenshtein distance is 1.
The byte histograms for the words are:
h1("rebase") = [..., 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, ...] h2("revase") = [..., 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, ...] ^ ^
The estimated distance is 2.
Bit histograms are:
h1("rebase") = [... 0000000000001100000000000010011000000000000000000000000000000000 ...] h2("revase") = [... 0000000001001100000000000010001000000000000000000000000000000000 ...] ^ ^
The estimated distance is also 2.
The two sets of words were tested:
Each word was taken and got modified in the following ways:
A hit is defined as a suggestion that yields the original word.
average hits | |||
---|---|---|---|
set | Levenshtein | Byte histogram | Bit histogram |
git commands | 99.64% | 98.05% | 96.14% |
str methods | 98.84% | 95.32% | 90.02% |
word | cases | Levenshtein | Byte histogram | Bit histogram | |||
---|---|---|---|---|---|---|---|
misses | hits | misses | hits | misses | hits | ||
clone | 301 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
init | 244 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
add | 188 | 0 | 100.00% | 2 | 98.94% | 2 | 100.00% |
mv | 133 | 1 | 99.25% | 4 | 96.99% | 4 | 100.00% |
restore | 418 | 0 | 100.00% | 26 | 93.78% | 26 | 100.00% |
rm | 133 | 5 | 96.24% | 33 | 75.19% | 33 | 76.69% |
bisect | 359 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
diff | 244 | 0 | 100.00% | 3 | 98.77% | 3 | 100.00% |
grep | 244 | 0 | 100.00% | 3 | 98.77% | 3 | 97.95% |
log | 188 | 0 | 100.00% | 4 | 97.87% | 4 | 97.87% |
show | 244 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
status | 359 | 0 | 100.00% | 2 | 99.44% | 2 | 100.00% |
branch | 359 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
commit | 359 | 0 | 100.00% | 2 | 99.44% | 2 | 100.00% |
merge | 301 | 1 | 99.67% | 62 | 79.40% | 62 | 99.00% |
rebase | 359 | 0 | 100.00% | 4 | 98.89% | 4 | 99.44% |
reset | 301 | 1 | 99.67% | 24 | 92.03% | 24 | 94.35% |
switch | 359 | 0 | 100.00% | 9 | 97.49% | 9 | 97.49% |
tag | 188 | 3 | 98.40% | 12 | 93.62% | 12 | 96.81% |
fetch | 301 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
pull | 244 | 0 | 100.00% | 4 | 98.36% | 4 | 100.00% |
push | 244 | 3 | 98.77% | 10 | 95.90% | 10 | 97.54% |
average: 99.64% | average: 98.05% | average: 96.14% |
word | cases | Levenshtein | Byte histogram | Bit histogram | |||
---|---|---|---|---|---|---|---|
misses | hits | misses | hits | misses | hits | ||
capitalize | 601 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
casefold | 478 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
center | 359 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
count | 301 | 0 | 100.00% | 4 | 98.67% | 4 | 100.00% |
encode | 359 | 0 | 100.00% | 3 | 99.16% | 3 | 99.44% |
endswith | 478 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
expandtabs | 601 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
find | 244 | 0 | 100.00% | 3 | 98.77% | 3 | 98.77% |
format | 359 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
format_map | 601 | 0 | 100.00% | 53 | 91.18% | 53 | 100.00% |
index | 301 | 0 | 100.00% | 6 | 98.01% | 6 | 99.34% |
isalnum | 418 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
isalpha | 418 | 0 | 100.00% | 1 | 99.76% | 1 | 100.00% |
isascii | 418 | 0 | 100.00% | 3 | 99.28% | 3 | 100.00% |
isdecimal | 539 | 0 | 100.00% | 2 | 99.63% | 2 | 100.00% |
isdigit | 418 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
isidentifier | 728 | 0 | 100.00% | 4 | 99.45% | 4 | 100.00% |
islower | 418 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
isnumeric | 539 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
isprintable | 664 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
isspace | 418 | 0 | 100.00% | 54 | 87.08% | 54 | 99.04% |
istitle | 418 | 0 | 100.00% | 31 | 92.58% | 31 | 100.00% |
isupper | 418 | 0 | 100.00% | 6 | 98.56% | 6 | 100.00% |
join | 244 | 0 | 100.00% | 4 | 98.36% | 4 | 98.36% |
ljust | 301 | 0 | 100.00% | 6 | 98.01% | 6 | 100.00% |
lower | 301 | 2 | 99.34% | 22 | 92.69% | 22 | 92.69% |
lstrip | 359 | 3 | 99.16% | 60 | 83.29% | 60 | 85.24% |
maketrans | 539 | 0 | 100.00% | 2 | 99.63% | 2 | 100.00% |
partition | 539 | 0 | 100.00% | 0 | 100.00% | 0 | 100.00% |
removeprefix | 728 | 0 | 100.00% | 2 | 99.73% | 2 | 100.00% |
removesuffix | 728 | 0 | 100.00% | 2 | 99.73% | 2 | 100.00% |
replace | 418 | 0 | 100.00% | 6 | 98.56% | 6 | 100.00% |
rfind | 301 | 33 | 89.04% | 30 | 90.03% | 30 | 90.03% |
rindex | 359 | 34 | 90.53% | 31 | 91.36% | 31 | 91.92% |
rjust | 301 | 33 | 89.04% | 37 | 87.71% | 37 | 87.71% |
rpartition | 601 | 37 | 93.84% | 601 | 0.00% | 601 | 91.01% |
rsplit | 359 | 4 | 98.89% | 359 | 0.00% | 359 | 0.00% |
rstrip | 359 | 34 | 90.53% | 27 | 92.48% | 27 | 81.89% |
split | 301 | 29 | 90.37% | 74 | 75.42% | 74 | 96.68% |
splitlines | 601 | 0 | 100.00% | 77 | 87.19% | 77 | 100.00% |
startswith | 601 | 0 | 100.00% | 3 | 99.50% | 3 | 100.00% |
strip | 301 | 31 | 89.70% | 301 | 0.00% | 301 | 85.05% |
swapcase | 478 | 0 | 100.00% | 42 | 91.21% | 42 | 99.37% |
title | 301 | 3 | 99.00% | 12 | 96.01% | 12 | 92.03% |
translate | 539 | 0 | 100.00% | 7 | 98.70% | 7 | 98.89% |
upper | 301 | 2 | 99.34% | 25 | 91.69% | 25 | 92.69% |
zfill | 301 | 0 | 100.00% | 8 | 97.34% | 8 | 100.00% |
average: 98.84% | average: 95.32% | average: 90.02% |
All programs are available on Github.