pyahocorasick stabilisation story

Author:	Wojciech Muła
Added on:	2019-01-08

Contents

Introduction
- The bug
Problems
Random bits
See also

Introduction

pyahocorasick is a python module I started in 2011. That time I was interested in stringology and the Aho-Corasick algorithm appeared to be quite challenging. It was a sufficient reason to program it. However, I also decided that the result shouldn't be another proof-of-concept, that nobody --- except me — would use. Since I like Python, I chose form of a C extension, which nicely combines a friendly Python API with an efficient C implementation.

Moving fast forward, the module gained a few users worldwide. Maybe this is not the most popular package on pypi, but people keep installing it. Many users contributed to the code, documentation and infrastructure, or reported bugs and helped with debugging. Philippe Ombredanne helped a lot with different aspects of development — without him the project wouldn't be so great.

This text is a result of recent work on stabilisation the module, that was propelled by fixing a long-standing bug. The bug was driving me crazy for more than a year. I want to show what means were used to eliminate this and many other bugs. And also how the code quality was improved as a side effect. I hope some of you find an inspiration or solution.

The bug

Before we start I have to describe the bug, nobody should repeat my stupid mistake.

The bug was caused by misuse of python function PyBuild_Value, which is used by a pickling mechanism. Basically, pickling used to be done as a simple memory dump — the module created single memory area filled with some binary data.

The invocation Py_BuildValue("y#", ptr, size) constructs a bytes object with a copy of memory pointed by ptr, having given size. The problem is that such a format string gets size of type int. I wrongly assumed that on 64-bit machines int is a 64-bit number. It's not true, int has only 32 bits. Because of that, when size of the memory area was larger than 2GB, strange things happen, as shown in the table below.

64-bit size	outcome
range 0 to `0x7fffffff` (up to 2GB)	no errors
from `0x80000000` to `0xffffffff`	`int` is negative, empty buffer created but no error is reported
anything larger than 4GB, but bit 32th equals zero	created buffer of `size & 0x7fffffff`
anything larger than 4GB, but bit 32th one	`int` is negative, empty buffer created but no error is reported

The solution was pretty simple: a huge memory area is split into several smaller regions and the list of such regions is pickled. The size of single region is limited to a few megabytes, it will never be close to the 2GB boundary (although all data still can be larger than 2GB).

Problems

The module is written in ANSI C, thus obviously it suffers from common problems known in this language; also some python-specific issues are there. Here is a brief list:

logic errors: regular bugs, behaviours not matching documentation (or common sense).
memory leaks: there are two sources of leaks:
- lack of free() calls;
- leaks of Python object references. Each python object has a reference counter, there's a leak when we forget about decreasing the RC, causing the object to live forever.
memory errors: like double free, or decrease the reference of a not owned Python object.
failures returned by API calls: communication between a C extension and python interpreter is done via the set of function; most of these functions may fail and we have to handle erroneous conditions properly.
string types and their representation: Python 2 and Python 3 have different string types: there are strings, bytes and Unicode strings; a Unicode string can be represented in UCS-16 or UCS-32.
Windows builds: Python extensions for Windows has to be compiled with MSVC 2010. It can't be helped.

Cure #1 — unit tests

It might sound obvious, but unit tests are crucial when one develops a programming library. In pyahocorasick unit tests have been existing from the very start. They are written in Python using its standard library unittest; there are no C-level tests.

Firstly, unit tests have to cover basic functions — i.e. prevent from logic errors. For instance, pyahocorasick has functionality of a word dictionary, thus unit tests check if adding a word really adds it or if looking up a non-existing word fails, etc. Most tests are really simple, but surprisingly detected a lot of mistakes.

Secondly, unit tests can check some very specific scenarios, also serve as a regression tests repository. For example, I wrote a very detailed tests for the unpickling code, where some binary data is manually constructed and then handled by the module. This set of tests is useful during development, regular users don't need to run them.

Unit tests are the base for any serious work and further refactoring.

Cure #2 — code coverage

Getting code coverage for a Python extension is really simple. Just compile the module with -coverage option:

$ export CFLAGS="-coverage"
$ python setup.py build_ext --inplace

Then run unit tests, or other programs which use the module, and get a coverage report using — for instance — gcovr:

$ gcovr --html-details -o DIRECTORY

The coverage report reveals which parts of C code are untested. We instantly see whether we're able to trigger that untested part via a regular Python test or maybe a branch relies on the result of external function call. Then, we might consider intercepting this function and force returning certain value — this approach is described in following sections.

Cure #3 — memory leaks detection

A problem with a C extension for Python is that your C code is run within the same process as the interpreter. The standard compilation of C Python interpreter enables a custom allocator. That allocator makes the valgrind output unreadable, as valgrind reports tons of leaks or wrong writes/reads, while in fact none of them are bugs.

It's possible to compile python interpreter with special valgrind-friendly flags. While compilation of a Python interpreter is not a big deal, I decided to wrap malloc, free and (recently) realloc procedures. A special compile-time definition enables memory debug mode, thus in normal compilation there's no overhead.

There are two modes of debugging I implemented:

dump all memory-related events to a file;
inject memory allocation faults — described in the next section.

The output of dump is a list of actions given in plain text, like this:

A 1 0x7fafd9741148 40
A 2 0x7fafd9741120 40
R 1 (nil) 0x7fafd9d89728 8
A 3 0x7fafd97410f8 40
R 2 (nil) 0x7fafd9d89738 8
F 0x7fafd9da51a0

"A 1 0x7fafd9741148 40" is read as: allocation #1 of 40 bytes retuned address 0x7fafd9741148; "R 2 (nil) 0x7fafd9d89738 8" is read as: "reallocation #1 of 8 bytes, old address was NULL, returned address 0x7fafd9d89738"; finally "F 0x7fafd9da51a0" means "free address 0x7fafd9da51a0".

A simple python script parses such file and points which allocations have no corresponding free. Since each allocation/reallocation has unique id, we might simply figure out where exactly in the code the leaked object was created. I considered dumping also backtraces (there's nice backtrace function in GNU libc), but in practise it was way easier to run program in gdb and obtain backtrace from debugger — I introduced an additional environment variable that triggers __builtin_trap() on allocation failure.

Dumping allocation events helped several times in last two years, and this solution proved to be really helpful.

Cure #4 — injecting memory allocation faults

For given allocation id we can force returning NULL in runtime; it's done by setting an environment variable, thanks to that no recompilation is needed when we want to check different scenarios.

It appeared that some users of pyahocorasick can allocate all memory they have on their computers, thus failures of malloc are likely. In such case, the expected behaviour of C module is to properly report the error to a python interpreter (usually by setting MemoryError), not to segfault.

Injecting memory faults appeared to be an amazing solution. I know the word "amazing" is overused, but in this context is highly appropriate. While the cost of introducing this was minimal, it revealed many latent problems in the existing code and prevented other problems after introducing the technique.

Before each release I get the number of allocations for all unit tests and then check failure of each allocation. There are roughly 3000 allocation, so it's not a big problem, just time consuming.

Cure #5 — injecting Python API faults

Since injecting of memory allocation faults was so successful, I decided to do similar thing with Python API calls. Although the number of public functions one can find in /usr/include/python is huge, the module itself uses a small portion of them. Right now there are only ~25 functions used across the code.

All wrapped calls share the same call id and we can set, also via environment variable, which call is expected to fail.

The wrappers are also enabled with a preprocessor definition. In a C code a wrapped invocation looks like this:

py_value = F(PyTuple_GetItem)(args, 1);

Here the "F" is a preprocessor definition which either expands to the original name of argument (PyTuple_GetItem) or into its wrapped counterpart (PyTuple_GetItem_custom). The definitions of wrapped functions are similar, and looks like this:

#define PyTuple_GetItem_custom(...) (check_and_set_error() ? NULL : PyTuple_GetItem(__VA_ARGS__))

The only C code is function check_and_set_error, which is the same for all wrapped functions. Depending on the type of API function, the failure result can be the NULL pointer (like in case of PyTuple_GetItem), or a negative number or the zero — this distinction is done within #define.

A side note: in most cases returning NULL is a sign of exception. Python expects that a C extension sets also an error using one of designated API functions. If NULL is returned but no error is set, the interpreter unconditionally aborts execution! As you might guess, I found a few cases when this problem appeared.

This solution also helped to solve many issues, fortunately not reported yet by the users. Not to mention that the code coverage went up.

Unresolvable #1 — string representation

The existence of Python 2 and Python 3 adds a lot of problems. As I mentioned earlier, internal string representations differs between major python versions.

There are also problems with Python 3 string representation. Depending on a minor version, platform, compilation setting and other factors I have no idea about, strings can be saved in UCS-2 or UCS-4, and perhaps in UTF-8 --- I got lost now.

All these issues mean that the module code has to deal with three different data types and solve nuisances — whether a string is temporary (and we need to keep its copy) or is a reference. Of course it's doable, but costs dozens of defines, which lowers testability of code.

Unresolvable #2 — Windows build

The python official builds for Windows are done with MSVC 2010. This compiler is maybe not an ancient one, but does not support the C11 standard. In other words, declaring variables at any place in code is not possible --- all variable declarations have to be placed before the first statement.

Well, it's not a real problem, it's rather... a constraint. I just got used to it, but wish I didn't have to.

Random bits

I described the most important methods, here's a list of things that also help:

Automate what you can. I wrote several dead-simple shell and python scripts that run certain procedures (like naive benchmarks, memory consumption analysis etc.). State of most of them is "works for me" --- they can be improved later, if needed. They save time of re-typing the same commands over and over. And they really help if one has short memory, as I do have.
If you can, include in error messages as much details as you have at certain point of program. The bug I mentioned at the very beginning was manifested with short message "data truncated". It was silly, now I see how silly. Later, I added to the message more date, like the number of currently processed node ("error during unpickling node #1241094"). When I saw a really huge node number in user's report, something finally clicked.
When you deal with complex data structures write dumpers. Start with dumpers! A dumper is a procedure that shows, in a human readable way, details of data. Procedure can print something on console or save data on disc which can be later visualized. Aho-Corasick automaton is a graph structure, thus I coded dumper that produces graphviz input. Being able to see how graph nodes are connected (or disconnected...) saved hours of debugging at early stages of developing.
Use CI servers. For open source projects there are free and easy to setup solutions: Travis and AppVeyor. pyahocorasick uses both, and is tested on both Windows and Linux, it is build with Python 2 and Python 3, and unit tests are run on all these builds.