Benefits from the obsession

Author:	Wojciech Muła
Added on:	2015-12-13

Everything has started few years ago when I found John Regher's blog. If you don't know the blog I highly recommend it. Among other things (I like the photos!) the author studies bugs in compilers, undefined behaviours and similar things. The word "overflow" appears quite often in his posts due to the great number of errors related to improper use of the integer arithmetic. Well, I don't know when the obsession has exactly started, but recently I realized that I am alert of all integer operations in my programs. If a variable has type int and later it is incremented then I'm pretty sure that the application will eventually blow up. When I can't prevent it, I left comments like: "XXX: possible overflow". This post shows three cases when I haven't left a comment, but fixed the issue.

Case 1

On the other day I saw such innocent code:

string decode_base64(string& base64) {

    // some validation

    string result;
    const size_t n = 3*base64.size() / 4;
    result.resize(n);

    // decoding stuff

    return result;
}

The size of result is three fouth of the base64's size, so it is obviously smaller than the maximum value of size_t which is used to store the string size. However, the overflow can occur during multiplying by 3. On 64-bit machines the size of base64 have to be really huge to trigger the error — it is 1.23e+19. But on 32-bit machines it's "merely" 28 GB (try to imagine base64-encoded movie sent via e-mail...) Despite the CPU architecture, the problem still exists. And the solution is not very complicated.

Expression 3/4 * x is equivalent to x - 1/4 * x. Dividing x by 4 never cause an overflow, but since we operate on integers these two expressions are not equal. The latter expression have to be corrected (rounded up) with following conditional expression (x % 4 != 0) ? 1 : 0. Thus the final expression is:

const size_t x = base64.size();
const size_t n = x - x/4 + ((x % 4 != 0) ? 1 : 0);

The expression requires: 2 additions, 1 right shift, 1 comparison and 1 condition expression. And it's perfectly safe.

Case 2

Later I spotted a similar mistake in a procedure which encodes data to base64. The expression calculating the size of an output buffer:

string base64_encode(const string& input) {

    const size_t x = input.size();
    const size_t n = 4*x / 3;

    string result;
    result.resize(n);

    // ...

    return result;
}

Here the overflow is clearly visible, because value 4/3 * x could be greater than the maximum value of an integer. Again, to trigger the error on 64-bit CPUs the input size have to be greater than 4.61e+18. But on 32-bit CPUs the overflow will occur for 1GB inputs. OK, so how to detect the overflow in this case? After rewriting expression to x + x/3 we can easily detect overflow during addition:

size_t add(size_t x, size_t y) {
    if (std::limits<size_t>::max() - x < y) {
        throw "overflow";
    }

    return x + y;
}

string base64_encode(const string& input) {

    const size_t x = input.size();
    const size_t n = add(x, x/3);

    // ...
}

In this case the only additional cost is a single comparison before the addition. And, what is more important, the procedure detects the overflows, while the original version silently ignored them.

Case 3

The last example: joining a list of strings. The naive implementation looks like this:

string join(const string& sep, const vector<string>& items) {

    if (items.empty()) {
        return "";
    }

    string result;

    result = items[0];
    for (size_t i=1; i < items.size(); i++) {
        result.append(sep)
        result.append(items[i]);
    }

    return result;
}

In my version, before actual appending the items to the results, there is an extra loop which calculates the size of the result string, like this:

string join(const string& sep, const vector<string>& items) {

    // ...

    size_t size = items[0].size();
    for (size_t i=1; i < items.size(); i++) {
        size = add(size, sep.size());
        size = add(size, items[i].size());
    }

    string result;
    result.reserve(size);

    // building the result
}

This case is maybe not the most interesting (this is why is the last one). When joining a really huge number of items we will probably run out of memory before any overflow might occur. However, there are some arguments in favor of the solution. The first positive thing is that we detect the overflow at the early stage, before allocating any memory. The other good point is that there is single call to the reserve method, i.e. one allocation. Thanks to that later appending strings to the result only copies a data. Please remember that most of string implementations grow the memory by factor from 1.5 to 2.0, in result you can run out of memory even without arithmetic overflow.

Summary

To conclude the text: I showed how paying attention on minor details, considering possible but unlikely conditions led to eliminate some problems in the real code. I'm pessimist and I do believe that ignored problems will eventually came out. Remember the famous bug in the binary search implementation — everything was OK for many years, up to the day when somebody was butted.