Thursday, November 1, 2007

Add-venture?

On my previous post, I was asked if there was any specific reason to get the 20 year old program compiled on the latest compiler? It should be expected that the older executable would continue to run even on new versions of the OS.

Here is the answer.

There are many reasons for porting:

1) In the industry, there is a standard set of the development tools (eg OS, compilers, etc) used in any organization, and these standards are revised from time to time. By ensuring that we have a well-defined standard, we can establish a consistency within the development teams, as well as with customers, to avoid build/run time issues. When the organization moves to the new standard, all [or at least, a defined set] of the products comply with it. One can say that this is an organizational policy matter.
In our case, we moved from gcc 3.2.3 to gcc4.1.1 company-wide, for Linux platforms.

2) In large product based environments, each product is not an independent entity in itself, but has several dependencies. While a particular product may not see many changes, the dependencies may change.
In our case, dependencies include few libraries, which are changing actively. To remain compatible, we have to port this old product on to new platforms/compilers.

3) Although this specific code is quite old, and newer products have since been launched, there still are some loyal customers, who need bug fixes in it. So, to deliver these fixes, the product has to be compiled anew. For new compilation, one must use the new standards of compilers.

Wednesday, October 31, 2007

Venturing Into Unchartered Territory

By default, the C compilers place string constants and (other constant data) into protected memory (read-only section), where they cannot be modified during the runtime.

When the option "-fwritable-strings" provided by many C compilers on different platforms (gcc, Solaris, xlc) is used, the compilers allocate strings in the [writable] data segment, so that the program can write to them. This option is provided for backward compatibility for old applications [pre-ANSI, I think] that rely on writing to strings.

However, it is quite intuitive that writing into string constants is a very bad idea - "constants" should be constant. I believe it is for this reason, that most of the compilers are now deprecating or discontinuing the "-fwritable-strings" option. Till some time back, we had been using the 3.2.3 version of gcc on Linux platforms. Recently, we moved to 4.1.1, in which this option is no longer supported. The option had actully been deprecated in 3.3, and discontinued in 3.4, but at my workplace, the next standard that we adopted after 3.2.3 is 4.1.1. There are a number of other changes that 4.1.1 has enforced, but I'll discuss that some other time.

Now, we have a product, which is almost 20 years old, and though there is no active development in it for last 5-6 years, it continues to be supported. So, we need to support it on the new platform/compiler standards, and therefore, we undertook the task of porting it on gcc 4.1.1. There are numerous examples in this code where constant strings are written to [this, and some other typical usages led me to believe that this code was written when the ANSI standard for C had not yet arrived.] One of the reasons to allow wiritng to string literals in old programs is given to be that earlier memory was severly limited, and allocation/deallocation were expensive. Apart from the build issues we faced due to stricter checks in the newer version, there were a large number of failures in the regression - when you attempt to write to a constant string literal, you get a segmentation fault.

Here is a small program to illustrate the problem:

====================================

#include
#include
#include

static char * keys [] =
{ "abc", "def", "ghi", "jkl" };

int my_sm_hash(char *strup)
{
int modified = 0;
char ch = '\0';

while ((ch = *strup) != 0)
{
if(islower(ch))
{
*strup = toupper(ch);
modified = 1;
}
++strup;
}
return modified;
}

void main()
{
int i, j;

for(i = 0; i <= 3; i++)
printf("\nkeys[i] : %s", keys[i]);

for(i = 0; i<= 3; i++)
{
j = my_sm_hash(keys[i]);
printf("\nWord no. %d, modified = %d", i, j);
}

for(i = 0; i <= 3; i++)
printf("\nkeys[i] : %s", keys[i]);

printf("\nDone\n");
}
====================================

When I compile it with gcc (older version, 3.2.3), here is the output I get:
====================================
keys[i] : abc
keys[i] : def
keys[i] : ghi
keys[i] : jklSegmentation Fault
====================================

When I compile the same program with the option -fwritable-strings, the output is as follows:
====================================
keys[i] : abc
keys[i] : def
keys[i] : ghi
keys[i] : jkl
Word no. 0, modified = 1
Word no. 1, modified = 1
Word no. 2, modified = 1
Word no. 3, modified = 1
keys[i] : ABC
keys[i] : DEF
keys[i] : GHI
keys[i] : JKL
Done
====================================

In the stand alone program, the solution does not look difficult, but in the context of a much larger program, when the function my_sm_hash may be called from multiple places, and there may be underlying assumptions elsewhere in the code, there are associated disadvantages with the possible solutions.

Solution 1: Use arrays instead of constant string literals: The documentation of gcc states "Use named character arrays when you need a writable string." What this means in practice is that you need to change lines like
char* myStr = \"Hello, world\";
to
char myStr[] = {\"Hello, world\"};
We could not use this:
1) Since the string constants in our code were actually a part of another array of structures (with constant values), this approach was not feasible for us.

Solution 2: Create, modify and return a new string: allocate a new string, and copy the existing string in it. Then modify this new string. This may be done inside the function, or from the point of call to this function.
There were two problems we discovered with this mechanism:
1) We could not determine when and where to free the newly allocated strings, so it could result in a considerable increase in the memory footprint.
2) Some other parts of the programs relied on this part of the string to be changed, so even though the crashes were avoided, testcases continued to fail.

Eventually, we realized we will have to live with memory leaks unless we want to rearchitect a huge codebase which is not supported, and not understood by the present team. We could at the most try to control the extent of damage. So, we used different mechanisms for specific scenarios:
  • There were some global tables [array of structures with constant data], that had constant strings. We reallocated the string fields of such structures once, during initializaion stage. This ensured only one-time leak of small amount of memory [instead of repeated leaks, which would have resulted from solution 2 above.]
  • At places, where static strings were defined, modified and used within a local scope, we used a static variable to control the allocation - so that the string was allocated only the first time this function was called.
  • We had cases, where the char pointer was assigned a string constant as a default value. And this default value was overwridden in different cases using a switch/case statement. This was modified so that the pointer was not preassigned, and if any of the enumerated cases were not encounterd [i.e. an unexpected scenario], the pointer was assigned the constant string.
  • There were even cases, where a static string was declared, only to be modified [capitalized, in this case], and used immediately. This was trivial - use the upper case in the first place!

Wednesday, October 3, 2007

Identity Crisis

Early this week, I got crashes in many testcases, on 64-bit mode, and only on Linux platforms (Solaris was fine in both 32- and 64-bit modes). Debugging such issues is a big problem, because they are almost certainly tricky memory corruptions, and debugging in 64-bit environment is seriously hampered by unavailability of proper support in debugging tools.

There are some specific techniques I follow to debug such corruptions or environment specific issues.

1) Reproduce the crash in an environment, where it is easier to debug.

Now, Workshop, the graphical debugging tool that I use on Solaris supports both 32- and 64- bit modes well. I like it for its user-friendly interface, and certain features like pop stack, that are not provided by other debuggers. But since I could not reproduce the crash on Solaris, I had to start my investigation on Linux. On linux, I use DDD, which is a GUI based on GDB.
I was in for a surprise, since the DDD did not load my executable. I checked the usual suspects - Build, Run arguments, paths, library paths etc. This did not help.
After asking around, a helpful team-member told me that there is a separate 64-bit version of GDB. I set my path and library path to pick this version, and was able to load the executable in DDD.

2) Compare the behavior on two platforms. Start with a top-level analysis, followed by a step-by-step comparison.

DDD showed me the point of the crash, but did not allow me to access the object, a call to whose method generated the crash. Morover, DDD does not honour the default values of function arguments, that one can provide in CPP code.

The line of code was something like:

if (obj && obj->hasProp1())
{
info = obj->getInfo(PROP1); // => the crash occured here
}

I ran the testcase on Solaris, but the execution did not reach the offending line!

3) Quick, high-level debugging through "printf" statements.
I put a "printf" statement before the "if" statement, to display the name of the object 'obj' and the value returned by hasProp1() call. I compared the output on Solaris and Linux, and the output was exactly same! This value was '0' on both platforms, and how did the code entered the "if" condition, is a mystery I could not solve till the end.

4) Use memory analysis tools. Try different tools, as it frequently happens that one tool catches the problem that the other one coud not.

I ran Valgrind on Linux, but it did not show any errors. Valgrind is available only on Linux, but it has no build-time requirements.

One point to note is that one can catch corruptions equally well on any platform, since the problem exists everywhere, even if it does not manifest in certain scenarios.
Purify was initially available on Solaris only, though now it is available on Linux as well. So I'm more comfortable using Purify on Solaris, and also in the past, I've had problems using Purify on Linux. So, I tried to run Purify on Solaris. However, I was not able to run the 64-bit Purify build - it crashed even before executing the first line. I tried build/run several times a number of times, with the same outcome. After struggling with it for a long while, a thought struck me - I was running it on S10. Our buid platform is S8, but with the same build, we support run on S9 and S10 as well. I had been running non-purify build on S10, and I continued to do so with the purify build. So, in yet another attempt, I ran the testcase on S8. This time it did run! 64-bit purify build does not run on S10!
However, the run still did not help me much - it kept delivering SIGBUS infinitely. I loaded the purify build in Workshop, and put a breakpoint in the function purify_stop_here - this function is a debugging hook provided by Purify - it is called before Purify reports a memory violation. I got the point where the SIGBUS was reported, and analyzed that part of the code, but it only led me a wild goose chase. But well, that is life!

5) When all else fails, ask around!

I asked many people in the group if they have debugged 64-bit binaries on Linux. Finally, one person told me he had done so, but he does not have much faith in DDD. So he used GDB on the shell. He also gave me a pointer to the version of GDB he had (successfully) used. This was another version of 64-bit GDB!

6) Revisit steps

Meanwhile, I had also made a Purify build on Linux. I ran this Purify build with the new version of GDB. Purify reported an ABR (Array Bound Error) on the same line that DDD had reported earlier. I put a breakpoint in the function purify_stop_here, and continued till I reached the point of ABR error. This time, the version of GDB correctly reported the name of the object. I moved to Solaris Workshop once again, and looked for this particular object. Having the name of the object made it quite easy this time. And what I discovered was absolutely stunning - the object was actually of the base class, while the properties we were querying on it were functions and members of the derived class. The object had been return by call to a function, which could return either kind of object. The fix was, of course, extremely simple - I modified the "if" condition to:
if (obj && obj->isDrvCls && obj->hasProp1())
{
...
}

The class declarations were as follows:

class baseCls
{
private:
...
char *name;
public:
...
char *getName() { return name; }
int isDrvCls() { return TRUE; }

}

class drvCls : baseCls
{

public:
...
int hasProp1() { return type_ & PROP1_MASK; }
listCls *info() { return lst1_; }
infoCls *getInfo(int propType) { return info()->findInfo(propType; }
int isDrvCls() { return TRUE; }
private:
...
int type_;
listCls *lst1_;
}

Wednesday, June 27, 2007

My 64-bit porting experiences - VI

6. Excessive code-reuse

We have always been instructed to reuse code, as it makes the code more maintainable. But it can sometimes be taken too far (too much of a good thing ?!).

6.1 Objects of different sizes in a union

Class A{
int getVal {return bVal;}
exp *getExpr {return bExpr;}
int hasExpr {return isExpr;}
union {
int bVal;
exp *bExpr;
}
int isExpr;
};

The object of this class ‘A’ can have a data-member that is either a constant value (bVal), or an expression (bExpr) if the value is not constant. When the value is not constant, ‘isExpr’ is set to ‘1’, and the user is expected to use the expression returned by ‘getExpr’ call.

In the following piece of code , ‘obj_a’ is an object of class ‘A’:

int tmp = 0;
exp *tmpExpr = NULL;
if (cond)
tmp = 1;
else
tmp = obj_a->getVal();

if(obj_a->hasExpr())
{
if(tmp == 1)
tmpExpr = createExpr(1);
else
tmpExpr = obj_a->getExpr();
}
else
tmpExpr = createExpr(tmp);

Incorrect values in a testcase, in 64-bit mode, were traced to this piece of code. Even though ‘obj_a’ contained expression (which represented a value other than ‘1’), and ‘cond’ was ‘0’, this code inferred the value represented by ‘obj_a’ to be ‘1’.

Reason: ‘obj_a’ represented an expression, so obj_getVal() should not have been used. But it was called, and it returned ‘1’. In 64-bit mode, the members of the union – int and pointer – have sizes 32 and 64 bits respectively. When the integer value was queried, the higher 32 bits were returned, which constituted ‘1’ in integer value [the pointer was 0x1_hhhh_hhhh]. This code worked without problems in 32-bit, because int and pointer have same size, and a pointer cannot be 0x1.


6.2 Overriding functions through implicit casting

This problem was encountered at many places in the code, in different forms.

A hash-table to hash pointers had been implemented, and widely used. The interface functions to insert new records, or find existing ones, accepted char** arguments (and dereferenced them to obtain the pointer). Later on, there was a requirement to hash integer values as well. Instead of implementing a new hash-table (or even writing new interface functions), the previous table and its interface was reused – by passing int* on the actual. When the interface functions dereferenced the pointer passed at the actual, the resulting value was int, and therefore not a valid address.

Sunday, June 17, 2007

My 64-bit porting experiences -V

5. Use of compatible, but incorrect datatypes

An API function, ‘func1’ returns int*, which actually points to an array of two integers. The part of code that uses this function is as follows:

long *lp = func1(…);
int lval = lp[0];
int rval = lp[1];

The results were correct in 32-bit, because int and long are same size. On 64-bit, the return values were incorrect, because of the difference in size of int and long. ‘lp[0]’ represents a 64-bit value, which contains both the int values that are returned by ‘func1’. When it is assigned to ‘lval’, the lower 32 bits are assigned to ‘lval’, which was the value that needed to be assigned to ‘rval’. ‘lp[1]’ reads next 64 bits from memory and assigns the lower 32 bits of it to ‘rval’. So, ‘lval’ gets incorrect value, while ‘rval’ gets junk.

Sunday, June 10, 2007

My 64-bit porting experiences - IV

4. Masking operations
The following may be considered as example of oversight, or assumption on datatype size. I have made this an independent section, because masking operations are frequently used in C code for fast storage and access of data. Masks are usually implemented through macros, and problems in these macros are difficult to debug.

4.1 Storing additional information in a pointer
It may be considered that manipulation of pointer addresses will be very prone to errors. But given the fact that addresses are always word-aligned, the last bit in a pointer value is always 0. To take advantage of this fact, applications sometimes store a boolean property in the last bit of the address, and save runtime memory. As long as it takes care to reset the last bit of such pointers before dereferencing them, the applications are safe. The set and reset operations are usually coded as macros in ‘C’ code, in the interest of runtime.

In the following code snippet, ‘x’ is a pointer under manipulation. M_IS_LIST queries the property on the pointer, M_SET_LIST sets the property on the pointer, and M_GET_LIST retrieves the pointer when before it is de-referenced.
#define M_IS_LIST(x) (((unsigned long)(x)) & 0x1U)
#define M_SET_LIST(x) ((x) |= 0x1U)
#define M_GET_LIST(x) (((unsigned long)(x)) & ~0x1U)

This works for 32-bit mode. In 64-bit, I was getting corrupt pointers at some point in code, which was traced to these macros after considerable effort.

Reason: In its own context, a constant is treated as integer value. So, 0x1U is considered as a 32 bit value. Bit-wise AND with a pointer results in a 32-bit quantity (the smaller of the values being AND’ed).

Solution: To make the mask the same size as the variable:
#define M_IS_LIST(x) (((unsigned long)(x)) & 0x1UL)
#define M_SET_LIST(x) ((x) |= 0x1UL)
#define M_GET_LIST(x) (((unsigned long)(x)) & ~0x1UL)


4.2 Manipulating addresses

Yet another memory corruption was caused by the following code snippet, which purported to provide the next pointer to be processed:

if (((unsigned long)pstr) & 0x3U)
pstr = (char *)(((unsigned long)(pstr + 4)) & ~0x3U);

It was very difficult to understand the function of this simple line of code, due to use of constants, and absence of comments. I cannot stress enough the necessity of documentation in code, especially with tricky calculations like this.

Well, it re-aligns a pointer, based on the understanding that the addresses are aligned at 4 bytes (last two bits are 0). However, in 64-bit mode, the addresses are aligned at 8 bytes (last 3 bits are 0). Therefore, calculations needed to be modified [though I wish it could be more generic], as follows:

#ifdef 64_BIT_BUILD
#define addrMask 0x7UL
#else
#define addrMask 0x3UL
#endif

if (((unsigned long)pstr) & addrMask)
pstr = (char *)(((unsigned long)(pstr + sizeof(long))) & ~addrMask);

Thursday, May 24, 2007

Who moved my cheese?

... er ... pointer ?

Here is another real-life example of buggy code that I found.

This piece of code aims to create a copy of a doubly linked list
// - localHead is the head of the list of blocks being copied
// - new_blk is the block next to the tail the list being copied
// - bodyTail is the tail of the current list,
// at the end of which the newly copied list is appended

listBlk *tmpBlk = localHead;
listBlk *nxtBlk = NULL;
listBlk *dupFirstBlk = NULL;

while (tmpBlk && (tmpBlk != new_block))
{
nxtBlk = tmpBlk->copy();
if (dupFirstBlk)
{
nxtBlk->back(tmpBlk->back());
nxtBlk->next(tmpBlk->next());
}
else
{
nxtBlk->setBackNext(bodyTail);
nxtBlk->next(new_block);
dupFirstBlk = nxtBlk;
}
tmpBlk = tmpBlk->next();
}
nxtBlk->next(new_block);


This has beeing lying around dormant for years .... (just because the list never had more than one blocks, owing to the way the code is structured)

And once again, what is the problem here ?

Wednesday, May 16, 2007

It's a bug!

[It is not a feature]

I was working on simulating break and continue statements in loops.

First I worked with for loops, and got them working. Then I started with while. I wrote a simple while loop, and a simple break statement in it:
j = 0;
while(j <= 7)
begin
if(j==4)
break;
out1[j] = 1;
j = j + 1;
end

This worked fine. So, to check out continue, I simply replaced break with continue [as I had done with for] and expected it to work straighforward.
j = 0;
while(j <= 7)
begin
if(j==4)
continue;
out1[j] = 1;
j = j + 1;
end

But I was in for a shock! The tool got stuck! I looked at the testcase, and at the code, twice. Concluding that there was a bug in the implementation in the tool, I consulted a colleague. He agreed that there is a problem in the tool. After playing around with the test for a few minutes I realized that the tool was smart enough. We were the dumb party!!

[OK, so what is the bug?!]

Wednesday, May 9, 2007

My 64-bit porting experiences - III

3. Assumptions on datatypes

3.1 Assumptions on size of built-in datatypes

I traced crashes on some platforms to a corruption in hash-tables (though I am still wondering why these testcases did not crash on the rest of the platforms). In the hash-table designated for a certain kind of records, a different kind of record was stored.

The reason, which proved to be very difficult to determine [I spent almost three days on this!], was an assumption on size of built-in datatypes in the hash function (computation for the bin number):

int hash_func(long key)
{
int size = NUM_BINS;

long hi = key >> 16, lo = key;
return (int)(hi ^ lo) % size;
}

The long type argument ‘key’ is derived from a pointer by casting it to long. On 32-bit systems, XOR-ed the higher and lower 16 bits, cast the result to int and determined the bin number by applying modulus function. Since both long and int are 32-bits in size, the cast operation did not affect the size in actual.

In 64-bit mode, when ‘key’ (which actually referred to a pointer address) was cast to int after XOR operation, the 64-bit long variable was truncated to 32 bits, and the 32nd MSB was interpreted as the sign-bit of the int. This MSB was ‘1’, so the resulting integer was a negative number. The modulus function retains the sign of the first operand, therefore the result of the modulus was also a negative number. This resulted in a negative bin number! When the record was stored into this bin, it was stored into the space of another variable, which was incidentally also a hash-table, but of a different kind of record.

The fix was simple:
return (int)((hi ^ lo) % size);

Instead of casting the result of XOR operation to int, I cast the result of modulus to int.


3.2 Assumptions on size and format of user-defined datatypes

Assumption on size and format of user-defined dataypes can be equally fatal. Consider a structure defined as follows:

typedef struct
{
long value;
char *name;
int type;
} mystruct;

In 32-bit architecture, the size of ‘mystruct’ is 12 bytes, as expected by adding the size of the individual members. In 64-bit architecture, however, the size is 24 bytes, rather 20 bytes (because of the 8-byte alignment).

Let n = sizeof(mystruct) [n = 12 on 32-bit and n = 24 on 64-bit].

Now, suppose a sequence of long-pointer-int values is stored in a chunk of memory, and the user wants to populate an array of ‘mystruct’ from this chunk. The user reads first ‘n’ bytes from the chunk and assigns it to an element of the array, reads next ‘n’ bytes from the chunk and assigns it to the next element, and so on. This works fine on 32-bit architecture, because the numbers of bytes read from the memory is exactly the size of the structure. However, on 64-bit architecture, when the user reads first ‘n’ bytes and assigns it to memory, he actually retrieves 4 bytes more than required, misaligning the next set of values to be read.


3.3 Use of non-standard calls on API datastructures

APIs frequently need to provide interface functions to external users to allow them to manipulate the API data. If the API is coded using C++, the internal implementation of the API can be hidden from the users very effectively, using data encapsulation and function overriding capabilities of Object-oriented programming. However, if the API code is written in C, sometimes it becomes inevitable that the internal data-structures are exposed to the user. In this case, if the data-structures need to be changed because of new requirements, the users may be required API is assumed to function in a certain way, instead of using the API interface.

Consider that the API defines a type ‘a_type’, which can be initialized with a value ‘a_null’, or through an interface function ‘a_init’. However, ‘a_null’ is equivalent to NULL, and so the users frequently use NULL to initialize variables of type ‘a_type’. Now, if the definition of ‘a_type’ is changed so that it is no longer compatible with NULL or ‘0’, and the definition of ‘a_null’ and ‘a_init’ is also changed accordingly, the users’ code will still fail to compile, because it was not using the API interface.

Friday, April 27, 2007

My 64-bit porting experiences - II

2. Assumptions on size of predefined datatypes

2.1 Explicit assumption

Ideally, any implementation should not place an assumption on the size of pre-defined datatypes, because these sizes are not standard, and are usually at the discretion of OS implementation provided some boundary conditions are satisfied (e.g., an int shall at least be 16 bits). Most of the code that I was porting, was based on the premise that int, long and pointer are all of the same size, and long long was twice this size. Assertions were placed in the code to ensure this:

assert( sizeof(long) == sizeof(int) );

assert( 2 * sizeof(char *) == sizeof(unsigned long long));

On 64-bit, these assertions were violated. In fact, I had to rearchitecture a significant part of the code.

2.2 Implicit assumption

Implicit assumption often takes the form of casting between different datatypes.

For a hash-table that hashes pointers, the key computation uses the pointer address:

unsigned int key = (unsigned)ptr >> 2;

This specific usage (casting pointer to integer) was treated in a different manner on different platforms. On AIX, there was no warning message; on Linux, the compiler issued a warning; on Solaris, the C compiler gave no warning, while C++ compiler issued an error.

Tuesday, April 24, 2007

Memory, memory everywhere ...

... and not a block to link!!

A junior developer came to me in the morning, seeking help on calloc. I briefly described calloc and malloc to her, but I was somewhat doubtful of her requirements, so I asked what was the specific problem she was facing.

She had allocated a string using the malloc call, and was appending to the string using strcat. The result she got was junk characters.
str = (char *)malloc(N * sizeof(char));
for( i = 0; i <= m; i++)
strcat(str, arr_of_str[i]);


The answer lies in the behavior of malloc and strcat. The memory allocated my malloc is not initialized. So, when 'str' is allocated, it is filled with junk characters. The function strcat appends a new string to an existing string, and to identify the end of the existing string it searches for the null character ['\0']. In this case, strcat appended the new string [arr_of_str[i]] wherever it found a null character in the string 'str' - the initial characters remained as junk, and this is what she saw. In fact she was lucky to get away with a garbled string. Had there been no null character in 'str', strcat function would have written into the memory of some other variable [wherever it found a null character in the memory space adjoining that of 'str'], and caused a crash.

The fix was simply to initialize the newly allocated 'str':
str = (char *)malloc(N * sizeof(char));
strcpy(str,""); /* or alternatively, str[0] = `\0` */

Now, this sounds very obvious. But how often the obvious is overlooked, will perhaps be borne out by the fact that I had come across the very same problem not long back.

Tuesday, April 17, 2007

My 64-bit porting experiences - I

1. Oversight

1.1 Implicit functions

Following code examples functioned harmlessly while the code was compiled and run on 32-bit, but crashed on 64-bit. Many of them could have been avoided, if compiler and/or lint warnings were attended to.

char *str = (char *)malloc(n * sizeof(char));

This is a perfectly valid way to allocate a string of ‘n’ characters dynamically, and assign it to ‘str’, which is a pointer of the appropriate type. Strangely, on 64-bit, the pointer address returned was 32-bit, and crashed with SIGBUS. Several hits and trials yielded that the ‘C’ file that contained this code did not include stdlib.h, the standard library that provides the definition of ‘malloc’.

ptr = func_ret_ptr(…);

The function, ‘func_ret_ptr’ returns a pointer of a specific type, which is assigned to a pointer variable of the same type. The function comes from another library file, which also has a corresponding header file. The header file is included in the file that contains the above line of code. In this case also, on 64-bit, the value assigned to ‘ptr’ was 32-bit. Using another similar function correctly returned a 64-bit value. It took hours of debugging to notice that the declaration of this particular function was missing from the header file.

In yet another case, a function that returned a pointer was defined in one file, and used in another. An extern declaration provided the declaration to the code file that used the function. However, the extern declaration incorrectly declared the return type of the function to be int.

The explanation is rather obvious. When the declaration of a function (or variable) is not found, the ‘C’ compiler implicitly defines them as int. In 32-bit mode, int and pointer are of same size, therefore an integer value assigned to a pointer represents a valid address. But in 64-bit mode, a 32-bit value stored by an int is not a valid address and therefore a pointer assigned this value cannot be successfully de-referenced. [It is only now, when I am writing this document, it has become obvious to me that all these cases are similar. Earlier, when I worked on these problems, I had thought that the malloc issue was due to difference in compiler behavior with the standard libraries.]

1.2 Needless casting

Type * ptr_var = (Type *)(int)(func_ret_ptr(…));

There is no need to cast the value returned by the function to int. It functioned in 32-bit mode, but in 64-bit, caused 32 MSBs to be lost from the pointer returned by the function.

The 64-bit saga

I have spent the better part of last six months in porting the product(s) that I work on, to 64-bit architectures of the supported platforms. In this effort, I learnt a lot of new things, though (unfortunately or fortunately) I did not have to deal with the endian-ness issues, due to the nature of the code and its applications. One of the most interesting parts was to see things coming into practice that I had only read in the books (but I'll still swear by Kerninghan and Ritchie!).

I have tried to document the problems that I found interesting (the 'interesting' adjective is only in retrospect; at the time I encountered them, they were just plain nasty!). The current focus is on hidden problems in existing code, most of which are what I call Casting Ouches, but usually refer to all of them as Coding Malpractice.

Many issues that I am going to cite are very straightforward, yet it is surprising that how often they are embedded in code. Most of these examples (and none of these is hypothetical) are simply bad coding practices. But they can be suicidal when a change in scenario occurs – it happened with me, when I compiled the code on 64-bit architecture. The code that had been working fine on 32-bit architecture, started throwing up problems the moment it was run on 64-bit. I was seeing crashes and incorrect results – I cannot say which is worse.

It goes without saying that on 32-bit architectures, the pointers (addresses) are 32-bit long, and on 64-bit architectures, pointers are 64-bit long. Further, on the UNIX flavors (Solaris, Linux, AIX, HP) that the code is support on, both int and long are 32-bit on 32-bit architectures. On 64-bit, int is 32-bit, while long is 64-bit.

P.S. Most of this work was done on Sun Solaris 8, using Workshop graphical debugging tool. In few cases, Purify and Valgrind lent a helping hand.

Tuesday, April 3, 2007

To be or not to be

... that is the question.

This is a problem that I encountered long time back, and one that re-iterated an important lesson.

As I have mentioned before, our software products are supported on different UNIX flavors - Solaris, Linux, AIX and HP. At one time, we were perplexed by a difference in report on HP than the other platforms [it was either additional or missing messages, I do not remember the case now; but the behavior on HP was incorrect for sure]. Suspecting memory corruption, we ran Purify, but it did not reported any errors. And neither did valgrind.

Then began the tedious process of debugging - in such cases the way we usually work is to start two debugging sessions in parallel - one on the port on which the behavior is correct, and one on the port with the incorrect behavior, and compare step-by-step execution on the two platforms [which is a cumbersome process, specially if the testcase is non-trivial]. Me and a colleague had spent a few hours on this problem, when we finally noticed the difference - a variable was compared for inequality with 0 [variable > 0] - on other platforms the result of this operation was TRUE, while on HP the result was FALSE. This was strange, as the variable seemed to have the same value in both the places. And then, the inspiration struck us - the variable in question was a single-bit integer bit-field, and the value was '1'.

Now the question is - what should the value of such a variable be ? For integers, the MSB [most significant bit] is the sign-bit. In case of single-bit integers, should the only bit [which is also the MSB] be treated as the value-bit, or the sign-bit ??

In normal case [platforms other than HP], the single bit was treated as the value, and value of the bit-field was interpreted as '1'. On HP, the single bit was treated as the sign bit, and the value of the bit-field was interpreted as '-1'.

Moral of the Story : Always declare the bit-fields as 'unsigned' [by default they will be signed]. And if the bit-field is expected to take negative values, specify an additional bit explicitly for the signedness informaiton.

Tuesday, March 27, 2007

Beat me, whip me, make me use uninitialized pointers

Well, the title is just a catchy line I "borrowed" from a friend's custom message. The problem I am about to discuss does not have to do with pointers, but it indeed has to do with uninitialized variables.

The software product that I work on, is supported on three different UNIX platforms (Solaris, AIX, Linux), on different flavors of each of these. When a test cases starts failing on some of the platforms, especially on a random basis, it is fairly safe to assume that a memory corruption has happened. The primary software tool that we use to analyze memory corruptions is IBM Rational Purify.

A few days back some testcases in our test suite started failing due to missing messages from the log file - the failures were random, mostly on Solaris 9 and 10, and some times on AIX (almost never on Solaris 8 and Linux EE and OEE). I was almost certain that a memory corruption had been introduced in the code. What was surprising was that there was one particular message that went missing, and that the failures existed only in one stream, though it was not very different from two other streams, on which no such occurences were reported. But such is the nature of memory corruptions.

So, I ran Purify on one such testcase, but it reported no error.
Then, since I was fairly confident that it was nothing but a corruption, I tried Valgrind as well. Valgrind is a free software from GNU, available only on Linux (Purify is available for both Solaris and Linux), and it does not have a fancy GUI like Purify. But then, it does not have a fancy price tag either. [My primary development platform is Solaris, and the company buys Purify licenses, so my first preference is to use Purify, rather than any other tool.]
Valgrind did point out read of uninitialized memory - the value of a bit-field was tested to issue the message under analysis, and this bit-field was not initialized in some scenarios.

The interesting part to note here is why was the problem not reported by Purify, which is usually quite accurate - it owes to the way bit-fields are stored in a structure or a class object, and retrieved from the memory. When a structure (or an object) declares some bit-fields, these are packed together, and padded with empty bits to align the object at the word boundary. When the value of a bit-field is read, the OS reads the complete word, rather than the individual field. Purify works on the granularity of a word, so it will report an uninitialized memory read if some of the bits of the word are not initialized. Now, the empty bits that were padded for alignment will obviously ALWAYS be uninitialized; so to avoid false warnings, in the default mode Purify suppresses the uninitialized read messages in case of bit-fields.

For those who are familiar with Purify, the Purify error code for uninitialized read is UMR [Uninitialized Memory Read]. For bit-fields, the warning that is issued (and which is suppressed by dfault) is UMC [Uninitialized Memory Copy].

What not to do in C programming

I have been programming for many years now, but I have occasionally claimed that I am an artist at heart, and a software professional by chance. Having said that, I must add that this chance has given me the dubious opportunity to encounter many interesting problems, which required hours of debugging, to find an amazingly simple solution.

I have primarily worked with C/CPP on UNIX so far, where I have come across problems that baffled many, for a long time, but the final solution turned out to be trivial, even plain common sense in many cases. These might interest [or perhaps even help] people who are programming in C or in any other language. In this blog, I am going to post problems as and when I come across them. And I would like to invite the readers [as and when they "find" this blog] to share their experiences, or be a contributor, if they like to.