Friday, April 27, 2007

My 64-bit porting experiences - II

2. Assumptions on size of predefined datatypes

2.1 Explicit assumption

Ideally, any implementation should not place an assumption on the size of pre-defined datatypes, because these sizes are not standard, and are usually at the discretion of OS implementation provided some boundary conditions are satisfied (e.g., an int shall at least be 16 bits). Most of the code that I was porting, was based on the premise that int, long and pointer are all of the same size, and long long was twice this size. Assertions were placed in the code to ensure this:

assert( sizeof(long) == sizeof(int) );

assert( 2 * sizeof(char *) == sizeof(unsigned long long));

On 64-bit, these assertions were violated. In fact, I had to rearchitecture a significant part of the code.

2.2 Implicit assumption

Implicit assumption often takes the form of casting between different datatypes.

For a hash-table that hashes pointers, the key computation uses the pointer address:

unsigned int key = (unsigned)ptr >> 2;

This specific usage (casting pointer to integer) was treated in a different manner on different platforms. On AIX, there was no warning message; on Linux, the compiler issued a warning; on Solaris, the C compiler gave no warning, while C++ compiler issued an error.

Tuesday, April 24, 2007

Memory, memory everywhere ...

... and not a block to link!!

A junior developer came to me in the morning, seeking help on calloc. I briefly described calloc and malloc to her, but I was somewhat doubtful of her requirements, so I asked what was the specific problem she was facing.

She had allocated a string using the malloc call, and was appending to the string using strcat. The result she got was junk characters.
str = (char *)malloc(N * sizeof(char));
for( i = 0; i <= m; i++)
strcat(str, arr_of_str[i]);


The answer lies in the behavior of malloc and strcat. The memory allocated my malloc is not initialized. So, when 'str' is allocated, it is filled with junk characters. The function strcat appends a new string to an existing string, and to identify the end of the existing string it searches for the null character ['\0']. In this case, strcat appended the new string [arr_of_str[i]] wherever it found a null character in the string 'str' - the initial characters remained as junk, and this is what she saw. In fact she was lucky to get away with a garbled string. Had there been no null character in 'str', strcat function would have written into the memory of some other variable [wherever it found a null character in the memory space adjoining that of 'str'], and caused a crash.

The fix was simply to initialize the newly allocated 'str':
str = (char *)malloc(N * sizeof(char));
strcpy(str,""); /* or alternatively, str[0] = `\0` */

Now, this sounds very obvious. But how often the obvious is overlooked, will perhaps be borne out by the fact that I had come across the very same problem not long back.

Tuesday, April 17, 2007

My 64-bit porting experiences - I

1. Oversight

1.1 Implicit functions

Following code examples functioned harmlessly while the code was compiled and run on 32-bit, but crashed on 64-bit. Many of them could have been avoided, if compiler and/or lint warnings were attended to.

char *str = (char *)malloc(n * sizeof(char));

This is a perfectly valid way to allocate a string of ‘n’ characters dynamically, and assign it to ‘str’, which is a pointer of the appropriate type. Strangely, on 64-bit, the pointer address returned was 32-bit, and crashed with SIGBUS. Several hits and trials yielded that the ‘C’ file that contained this code did not include stdlib.h, the standard library that provides the definition of ‘malloc’.

ptr = func_ret_ptr(…);

The function, ‘func_ret_ptr’ returns a pointer of a specific type, which is assigned to a pointer variable of the same type. The function comes from another library file, which also has a corresponding header file. The header file is included in the file that contains the above line of code. In this case also, on 64-bit, the value assigned to ‘ptr’ was 32-bit. Using another similar function correctly returned a 64-bit value. It took hours of debugging to notice that the declaration of this particular function was missing from the header file.

In yet another case, a function that returned a pointer was defined in one file, and used in another. An extern declaration provided the declaration to the code file that used the function. However, the extern declaration incorrectly declared the return type of the function to be int.

The explanation is rather obvious. When the declaration of a function (or variable) is not found, the ‘C’ compiler implicitly defines them as int. In 32-bit mode, int and pointer are of same size, therefore an integer value assigned to a pointer represents a valid address. But in 64-bit mode, a 32-bit value stored by an int is not a valid address and therefore a pointer assigned this value cannot be successfully de-referenced. [It is only now, when I am writing this document, it has become obvious to me that all these cases are similar. Earlier, when I worked on these problems, I had thought that the malloc issue was due to difference in compiler behavior with the standard libraries.]

1.2 Needless casting

Type * ptr_var = (Type *)(int)(func_ret_ptr(…));

There is no need to cast the value returned by the function to int. It functioned in 32-bit mode, but in 64-bit, caused 32 MSBs to be lost from the pointer returned by the function.

The 64-bit saga

I have spent the better part of last six months in porting the product(s) that I work on, to 64-bit architectures of the supported platforms. In this effort, I learnt a lot of new things, though (unfortunately or fortunately) I did not have to deal with the endian-ness issues, due to the nature of the code and its applications. One of the most interesting parts was to see things coming into practice that I had only read in the books (but I'll still swear by Kerninghan and Ritchie!).

I have tried to document the problems that I found interesting (the 'interesting' adjective is only in retrospect; at the time I encountered them, they were just plain nasty!). The current focus is on hidden problems in existing code, most of which are what I call Casting Ouches, but usually refer to all of them as Coding Malpractice.

Many issues that I am going to cite are very straightforward, yet it is surprising that how often they are embedded in code. Most of these examples (and none of these is hypothetical) are simply bad coding practices. But they can be suicidal when a change in scenario occurs – it happened with me, when I compiled the code on 64-bit architecture. The code that had been working fine on 32-bit architecture, started throwing up problems the moment it was run on 64-bit. I was seeing crashes and incorrect results – I cannot say which is worse.

It goes without saying that on 32-bit architectures, the pointers (addresses) are 32-bit long, and on 64-bit architectures, pointers are 64-bit long. Further, on the UNIX flavors (Solaris, Linux, AIX, HP) that the code is support on, both int and long are 32-bit on 32-bit architectures. On 64-bit, int is 32-bit, while long is 64-bit.

P.S. Most of this work was done on Sun Solaris 8, using Workshop graphical debugging tool. In few cases, Purify and Valgrind lent a helping hand.

Tuesday, April 3, 2007

To be or not to be

... that is the question.

This is a problem that I encountered long time back, and one that re-iterated an important lesson.

As I have mentioned before, our software products are supported on different UNIX flavors - Solaris, Linux, AIX and HP. At one time, we were perplexed by a difference in report on HP than the other platforms [it was either additional or missing messages, I do not remember the case now; but the behavior on HP was incorrect for sure]. Suspecting memory corruption, we ran Purify, but it did not reported any errors. And neither did valgrind.

Then began the tedious process of debugging - in such cases the way we usually work is to start two debugging sessions in parallel - one on the port on which the behavior is correct, and one on the port with the incorrect behavior, and compare step-by-step execution on the two platforms [which is a cumbersome process, specially if the testcase is non-trivial]. Me and a colleague had spent a few hours on this problem, when we finally noticed the difference - a variable was compared for inequality with 0 [variable > 0] - on other platforms the result of this operation was TRUE, while on HP the result was FALSE. This was strange, as the variable seemed to have the same value in both the places. And then, the inspiration struck us - the variable in question was a single-bit integer bit-field, and the value was '1'.

Now the question is - what should the value of such a variable be ? For integers, the MSB [most significant bit] is the sign-bit. In case of single-bit integers, should the only bit [which is also the MSB] be treated as the value-bit, or the sign-bit ??

In normal case [platforms other than HP], the single bit was treated as the value, and value of the bit-field was interpreted as '1'. On HP, the single bit was treated as the sign bit, and the value of the bit-field was interpreted as '-1'.

Moral of the Story : Always declare the bit-fields as 'unsigned' [by default they will be signed]. And if the bit-field is expected to take negative values, specify an additional bit explicitly for the signedness informaiton.