Tuesday, March 27, 2007

Beat me, whip me, make me use uninitialized pointers

Well, the title is just a catchy line I "borrowed" from a friend's custom message. The problem I am about to discuss does not have to do with pointers, but it indeed has to do with uninitialized variables.

The software product that I work on, is supported on three different UNIX platforms (Solaris, AIX, Linux), on different flavors of each of these. When a test cases starts failing on some of the platforms, especially on a random basis, it is fairly safe to assume that a memory corruption has happened. The primary software tool that we use to analyze memory corruptions is IBM Rational Purify.

A few days back some testcases in our test suite started failing due to missing messages from the log file - the failures were random, mostly on Solaris 9 and 10, and some times on AIX (almost never on Solaris 8 and Linux EE and OEE). I was almost certain that a memory corruption had been introduced in the code. What was surprising was that there was one particular message that went missing, and that the failures existed only in one stream, though it was not very different from two other streams, on which no such occurences were reported. But such is the nature of memory corruptions.

So, I ran Purify on one such testcase, but it reported no error.
Then, since I was fairly confident that it was nothing but a corruption, I tried Valgrind as well. Valgrind is a free software from GNU, available only on Linux (Purify is available for both Solaris and Linux), and it does not have a fancy GUI like Purify. But then, it does not have a fancy price tag either. [My primary development platform is Solaris, and the company buys Purify licenses, so my first preference is to use Purify, rather than any other tool.]
Valgrind did point out read of uninitialized memory - the value of a bit-field was tested to issue the message under analysis, and this bit-field was not initialized in some scenarios.

The interesting part to note here is why was the problem not reported by Purify, which is usually quite accurate - it owes to the way bit-fields are stored in a structure or a class object, and retrieved from the memory. When a structure (or an object) declares some bit-fields, these are packed together, and padded with empty bits to align the object at the word boundary. When the value of a bit-field is read, the OS reads the complete word, rather than the individual field. Purify works on the granularity of a word, so it will report an uninitialized memory read if some of the bits of the word are not initialized. Now, the empty bits that were padded for alignment will obviously ALWAYS be uninitialized; so to avoid false warnings, in the default mode Purify suppresses the uninitialized read messages in case of bit-fields.

For those who are familiar with Purify, the Purify error code for uninitialized read is UMR [Uninitialized Memory Read]. For bit-fields, the warning that is issued (and which is suppressed by dfault) is UMC [Uninitialized Memory Copy].

What not to do in C programming

I have been programming for many years now, but I have occasionally claimed that I am an artist at heart, and a software professional by chance. Having said that, I must add that this chance has given me the dubious opportunity to encounter many interesting problems, which required hours of debugging, to find an amazingly simple solution.

I have primarily worked with C/CPP on UNIX so far, where I have come across problems that baffled many, for a long time, but the final solution turned out to be trivial, even plain common sense in many cases. These might interest [or perhaps even help] people who are programming in C or in any other language. In this blog, I am going to post problems as and when I come across them. And I would like to invite the readers [as and when they "find" this blog] to share their experiences, or be a contributor, if they like to.