Simple suggestions using popcount

Author:	Wojciech Muła
Added on:	2023-11-20

Contents

Introduction
Alternative approach
- Example
Evaluation
- Git commands
- Python str methods
Source

Introduction

One great feature that some of CLI applications and compilers recently gained is providing suggestions in the case of misspellings arguments or options. For instance, Python suggests possible method/fields names, like:

>>> 'lower'.is_upper()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'is_upper'. Did you mean: 'isupper'?

If we want to include similar feature in our program, then a quite obvious solution is to use Levenshtein distance. Its basic implementation is simple and short. If we want suggestion for larger corpus, we may use tries to speedup matching — see: Fast and Easy Levenshtein distance using a Trie by Steve Hanov.

Despite that, I was curious if simpler algorithm would do the job. We may assume that our users know what they are doing. Their inputs will be sane, with an exception of minor mistakes. Like omitting a single letter, swapping two adjacent letters or maybe hitting a wrong key from time to time.

The two additional assumptions I made: we search suggestions in small sets, and we use only ASCII letters (128 possible bytes).

Alternative approach

To estimate edit distance we use two histogram-based approaches:

byte histogram — exact histogram for each byte of a string,
bit histogram — mark in a bit-set which bytes from a string are present.

In the first case, histogram takes 128 bytes. Similarity is calculated as the sum of absolute values.

def diff1(h1, h2):
    s = 0
    for i in range(128):
        s += abs(h1[i] - h2[i])

    return s

In the second case histogram uses 128 bits = 16 bytes. Similarity uses Hamming distance — it is the number of 1 bits in the word calculated as a XOR of two corresponding words from histograms.

def diff2(h1, h2):
    # assume words are 64-bit numbers, thus we have 4-element arrays
    s = 0
    for i in range(4):
        d = h1[i] ^ h2[i]
        s += d.bit_count()

    return s

Let's assume a valid word "rebase" (the git command). And a misspelled sample is "revase" — the letter 'b' got replaced with the 'v'. The Levenshtein distance is 1.

Example

The byte histograms for the words are:

h1("rebase") = [..., 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, ...]
h2("revase") = [..., 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, ...]
                        ^                                                           ^

The estimated distance is 2.

Bit histograms are:

h1("rebase") = [... 0000000000001100000000000010011000000000000000000000000000000000 ...]
h2("revase") = [... 0000000001001100000000000010001000000000000000000000000000000000 ...]
                             ^                   ^

The estimated distance is also 2.

Evaluation

The two sets of words were tested:

git commands (clone, init, add, mv, restore, rm, bisect, diff, grep, log, show, status, branch, commit, merge, rebase, reset, switch, tag, fetch, pull, push),
Python's str methods (capitalize, casefold, center, count, encode, endswith, expandtabs, find, format, format_map, index, isalnum, isalpha, isascii, isdecimal, isdigit, isidentifier, islower, isnumeric, isprintable, isspace, istitle, isupper, join, ljust, lower, lstrip, maketrans, partition, removeprefix, removesuffix, replace, rfind, rindex, rjust, rpartition, rsplit, rstrip, split, splitlines, startswith, strip, swapcase, title, translate, upper, zfill).

Each word was taken and got modified in the following ways:

remove i-th character,
replace i-th with an ASCII letters,
swap i-th and j-th characters,
insert at i-th position an ASCII letter.

A hit is defined as a suggestion that yields the original word.

	average hits
set	Levenshtein	Byte histogram	Bit histogram
git commands	99.64%	98.05%	96.14%
str methods	98.84%	95.32%	90.02%

Git commands

word	cases	Levenshtein		Byte histogram		Bit histogram
word	cases	misses	hits	misses	hits	misses	hits
clone	301	0	100.00%	0	100.00%	0	100.00%
init	244	0	100.00%	0	100.00%	0	100.00%
add	188	0	100.00%	2	98.94%	2	100.00%
mv	133	1	99.25%	4	96.99%	4	100.00%
restore	418	0	100.00%	26	93.78%	26	100.00%
rm	133	5	96.24%	33	75.19%	33	76.69%
bisect	359	0	100.00%	0	100.00%	0	100.00%
diff	244	0	100.00%	3	98.77%	3	100.00%
grep	244	0	100.00%	3	98.77%	3	97.95%
log	188	0	100.00%	4	97.87%	4	97.87%
show	244	0	100.00%	0	100.00%	0	100.00%
status	359	0	100.00%	2	99.44%	2	100.00%
branch	359	0	100.00%	0	100.00%	0	100.00%
commit	359	0	100.00%	2	99.44%	2	100.00%
merge	301	1	99.67%	62	79.40%	62	99.00%
rebase	359	0	100.00%	4	98.89%	4	99.44%
reset	301	1	99.67%	24	92.03%	24	94.35%
switch	359	0	100.00%	9	97.49%	9	97.49%
tag	188	3	98.40%	12	93.62%	12	96.81%
fetch	301	0	100.00%	0	100.00%	0	100.00%
pull	244	0	100.00%	4	98.36%	4	100.00%
push	244	3	98.77%	10	95.90%	10	97.54%
		average: 99.64%		average: 98.05%		average: 96.14%

Python str methods

word	cases	Levenshtein		Byte histogram		Bit histogram
word	cases	misses	hits	misses	hits	misses	hits
capitalize	601	0	100.00%	0	100.00%	0	100.00%
casefold	478	0	100.00%	0	100.00%	0	100.00%
center	359	0	100.00%	0	100.00%	0	100.00%
count	301	0	100.00%	4	98.67%	4	100.00%
encode	359	0	100.00%	3	99.16%	3	99.44%
endswith	478	0	100.00%	0	100.00%	0	100.00%
expandtabs	601	0	100.00%	0	100.00%	0	100.00%
find	244	0	100.00%	3	98.77%	3	98.77%
format	359	0	100.00%	0	100.00%	0	100.00%
format_map	601	0	100.00%	53	91.18%	53	100.00%
index	301	0	100.00%	6	98.01%	6	99.34%
isalnum	418	0	100.00%	0	100.00%	0	100.00%
isalpha	418	0	100.00%	1	99.76%	1	100.00%
isascii	418	0	100.00%	3	99.28%	3	100.00%
isdecimal	539	0	100.00%	2	99.63%	2	100.00%
isdigit	418	0	100.00%	0	100.00%	0	100.00%
isidentifier	728	0	100.00%	4	99.45%	4	100.00%
islower	418	0	100.00%	0	100.00%	0	100.00%
isnumeric	539	0	100.00%	0	100.00%	0	100.00%
isprintable	664	0	100.00%	0	100.00%	0	100.00%
isspace	418	0	100.00%	54	87.08%	54	99.04%
istitle	418	0	100.00%	31	92.58%	31	100.00%
isupper	418	0	100.00%	6	98.56%	6	100.00%
join	244	0	100.00%	4	98.36%	4	98.36%
ljust	301	0	100.00%	6	98.01%	6	100.00%
lower	301	2	99.34%	22	92.69%	22	92.69%
lstrip	359	3	99.16%	60	83.29%	60	85.24%
maketrans	539	0	100.00%	2	99.63%	2	100.00%
partition	539	0	100.00%	0	100.00%	0	100.00%
removeprefix	728	0	100.00%	2	99.73%	2	100.00%
removesuffix	728	0	100.00%	2	99.73%	2	100.00%
replace	418	0	100.00%	6	98.56%	6	100.00%
rfind	301	33	89.04%	30	90.03%	30	90.03%
rindex	359	34	90.53%	31	91.36%	31	91.92%
rjust	301	33	89.04%	37	87.71%	37	87.71%
rpartition	601	37	93.84%	601	0.00%	601	91.01%
rsplit	359	4	98.89%	359	0.00%	359	0.00%
rstrip	359	34	90.53%	27	92.48%	27	81.89%
split	301	29	90.37%	74	75.42%	74	96.68%
splitlines	601	0	100.00%	77	87.19%	77	100.00%
startswith	601	0	100.00%	3	99.50%	3	100.00%
strip	301	31	89.70%	301	0.00%	301	85.05%
swapcase	478	0	100.00%	42	91.21%	42	99.37%
title	301	3	99.00%	12	96.01%	12	92.03%
translate	539	0	100.00%	7	98.70%	7	98.89%
upper	301	2	99.34%	25	91.69%	25	92.69%
zfill	301	0	100.00%	8	97.34%	8	100.00%
		average: 98.84%		average: 95.32%		average: 90.02%

Source

All programs are available on Github.