pyahocorasick stabilisation story

Author:Wojciech Muła
Added on:2019-01-08

Contents

Introduction

pyahocorasick is a python module I started in 2011. That time I was interested in stringology and the Aho-Corasick algorithm appeared to be quite challenging. It was a sufficient reason to program it. However, I also decided that the result shouldn't be another proof-of-concept, that nobody — except me — would use. Since I like Python, I chose form of a C extension, which nicely combines a friendly Python API with an efficient C implementation.

Moving fast forward, the module gained a few users worldwide. Maybe this is not the most popular package on pypi, but people keep installing it. Many users contributed to the code, documentation and infrastructure, or reported bugs and helped with debugging. Philippe Ombredanne helped a lot with different aspects of development — without him the project wouldn't be so great.

This text is a result of recent work on stabilisation the module, that was propelled by fixing a long-standing bug. The bug was driving me crazy for more than a year. I want to show what means were used to eliminate this and many other bugs. And also how the code quality was improved as a side effect. I hope some of you find an inspiration or solution.

The bug

Before we start I have to describe the bug, nobody should repeat my stupid mistake.

The bug was caused by misuse of python function PyBuild_Value, which is used by a pickling mechanism. Basically, pickling used to be done as a simple memory dump — the module created single memory area filled with some binary data.

The invocation Py_BuildValue("y#", ptr, size) constructs a bytes object with a copy of memory pointed by ptr, having given size. The problem is that such a format string gets size of type int. I wrongly assumed that on 64-bit machines int is a 64-bit number. It's not true, int has only 32 bits. Because of that, when size of the memory area was larger than 2GB, strange things happen, as shown in the table below.

64-bit size outcome
range 0 to 0x7fffffff (up to 2GB) no errors
from 0x80000000 to 0xffffffff int is negative, empty buffer created but no error is reported
anything larger than 4GB, but bit 32th equals zero created buffer of size & 0x7fffffff
anything larger than 4GB, but bit 32th one int is negative, empty buffer created but no error is reported

The solution was pretty simple: a huge memory area is split into several smaller regions and the list of such regions is pickled. The size of single region is limited to a few megabytes, it will never be close to the 2GB boundary (although all data still can be larger than 2GB).

Problems

The module is written in ANSI C, thus obviously it suffers from common problems known in this language; also some python-specific issues are there. Here is a brief list:

Cure #1 — unit tests

It might sound obvious, but unit tests are crucial when one develops a programming library. In pyahocorasick unit tests have been existing from the very start. They are written in Python using its standard library unittest; there are no C-level tests.

Firstly, unit tests have to cover basic functions — i.e. prevent from logic errors. For instance, pyahocorasick has functionality of a word dictionary, thus unit tests check if adding a word really adds it or if looking up a non-existing word fails, etc. Most tests are really simple, but surprisingly detected a lot of mistakes.

Secondly, unit tests can check some very specific scenarios, also serve as a regression tests repository. For example, I wrote a very detailed tests for the unpickling code, where some binary data is manually constructed and then handled by the module. This set of tests is useful during development, regular users don't need to run them.

Unit tests are the base for any serious work and further refactoring.

Cure #2 — code coverage

Getting code coverage for a Python extension is really simple. Just compile the module with -coverage option:

$ export CFLAGS="-coverage"
$ python setup.py build_ext --inplace

Then run unit tests, or other programs which use the module, and get a coverage report using — for instance — gcovr:

$ gcovr --html-details -o DIRECTORY

The coverage report reveals which parts of C code are untested. We instantly see whether we're able to trigger that untested part via a regular Python test or maybe a branch relies on the result of external function call. Then, we might consider intercepting this function and force returning certain value — this approach is described in following sections.

Cure #3 — memory leaks detection

A problem with a C extension for Python is that your C code is run within the same process as the interpreter. The standard compilation of C Python interpreter enables a custom allocator. That allocator makes the valgrind output unreadable, as valgrind reports tons of leaks or wrong writes/reads, while in fact none of them are bugs.

It's possible to compile python interpreter with special valgrind-friendly flags. While compilation of a Python interpreter is not a big deal, I decided to wrap malloc, free and (recently) realloc procedures. A special compile-time definition enables memory debug mode, thus in normal compilation there's no overhead.

There are two modes of debugging I implemented:

  • dump all memory-related events to a file;
  • inject memory allocation faults — described in the next section.

The output of dump is a list of actions given in plain text, like this:

A 1 0x7fafd9741148 40
A 2 0x7fafd9741120 40
R 1 (nil) 0x7fafd9d89728 8
A 3 0x7fafd97410f8 40
R 2 (nil) 0x7fafd9d89738 8
F 0x7fafd9da51a0

"A 1 0x7fafd9741148 40" is read as: allocation #1 of 40 bytes retuned address 0x7fafd9741148; "R 2 (nil) 0x7fafd9d89738 8" is read as: "reallocation #1 of 8 bytes, old address was NULL, returned address 0x7fafd9d89738"; finally "F 0x7fafd9da51a0" means "free address 0x7fafd9da51a0".

A simple python script parses such file and points which allocations have no corresponding free. Since each allocation/reallocation has unique id, we might simply figure out where exactly in the code the leaked object was created. I considered dumping also backtraces (there's nice backtrace function in GNU libc), but in practise it was way easier to run program in gdb and obtain backtrace from debugger — I introduced an additional environment variable that triggers __builtin_trap() on allocation failure.

Dumping allocation events helped several times in last two years, and this solution proved to be really helpful.

Cure #4 — injecting memory allocation faults

For given allocation id we can force returning NULL in runtime; it's done by setting an environment variable, thanks to that no recompilation is needed when we want to check different scenarios.

It appeared that some users of pyahocorasick can allocate all memory they have on their computers, thus failures of malloc are likely. In such case, the expected behaviour of C module is to properly report the error to a python interpreter (usually by setting MemoryError), not to segfault.

Injecting memory faults appeared to be an amazing solution. I know the word "amazing" is overused, but in this context is highly appropriate. While the cost of introducing this was minimal, it revealed many latent problems in the existing code and prevented other problems after introducing the technique.

Before each release I get the number of allocations for all unit tests and then check failure of each allocation. There are roughly 3000 allocation, so it's not a big problem, just time consuming.

Cure #5 — injecting Python API faults

Since injecting of memory allocation faults was so successful, I decided to do similar thing with Python API calls. Although the number of public functions one can find in /usr/include/python is huge, the module itself uses a small portion of them. Right now there are only ~25 functions used across the code.

All wrapped calls share the same call id and we can set, also via environment variable, which call is expected to fail.

The wrappers are also enabled with a preprocessor definition. In a C code a wrapped invocation looks like this:

py_value = F(PyTuple_GetItem)(args, 1);

Here the "F" is a preprocessor definition which either expands to the original name of argument (PyTuple_GetItem) or into its wrapped counterpart (PyTuple_GetItem_custom). The definitions of wrapped functions are similar, and looks like this:

#define PyTuple_GetItem_custom(...) (check_and_set_error() ? NULL : PyTuple_GetItem(__VA_ARGS__))

The only C code is function check_and_set_error, which is the same for all wrapped functions. Depending on the type of API function, the failure result can be the NULL pointer (like in case of PyTuple_GetItem), or a negative number or the zero — this distinction is done within #define.

A side note: in most cases returning NULL is a sign of exception. Python expects that a C extension sets also an error using one of designated API functions. If NULL is returned but no error is set, the interpreter unconditionally aborts execution! As you might guess, I found a few cases when this problem appeared.

This solution also helped to solve many issues, fortunately not reported yet by the users. Not to mention that the code coverage went up.

Unresolvable #1 — string representation

The existence of Python 2 and Python 3 adds a lot of problems. As I mentioned earlier, internal string representations differs between major python versions.

There are also problems with Python 3 string representation. Depending on a minor version, platform, compilation setting and other factors I have no idea about, strings can be saved in UCS-2 or UCS-4, and perhaps in UTF-8 — I got lost now.

All these issues mean that the module code has to deal with three different data types and solve nuisances — whether a string is temporary (and we need to keep its copy) or is a reference. Of course it's doable, but costs dozens of defines, which lowers testability of code.

Unresolvable #2 — Windows build

The python official builds for Windows are done with MSVC 2010. This compiler is maybe not an ancient one, but does not support the C11 standard. In other words, declaring variables at any place in code is not possible — all variable declarations have to be placed before the first statement.

Well, it's not a real problem, it's rather... a constraint. I just got used to it, but wish I didn't have to.

Random bits

I described the most important methods, here's a list of things that also help:

See also