Thursday, May 24, 2007
Who moved my cheese?
Here is another real-life example of buggy code that I found.
This piece of code aims to create a copy of a doubly linked list
// - localHead is the head of the list of blocks being copied
// - new_blk is the block next to the tail the list being copied
// - bodyTail is the tail of the current list,
// at the end of which the newly copied list is appended
listBlk *tmpBlk = localHead;
listBlk *nxtBlk = NULL;
listBlk *dupFirstBlk = NULL;
while (tmpBlk && (tmpBlk != new_block))
{
nxtBlk = tmpBlk->copy();
if (dupFirstBlk)
{
nxtBlk->back(tmpBlk->back());
nxtBlk->next(tmpBlk->next());
}
else
{
nxtBlk->setBackNext(bodyTail);
nxtBlk->next(new_block);
dupFirstBlk = nxtBlk;
}
tmpBlk = tmpBlk->next();
}
nxtBlk->next(new_block);
This has beeing lying around dormant for years .... (just because the list never had more than one blocks, owing to the way the code is structured)
And once again, what is the problem here ?
Wednesday, May 16, 2007
It's a bug!
I was working on simulating break and continue statements in loops.
First I worked with for loops, and got them working. Then I started with while. I wrote a simple while loop, and a simple break statement in it:
j = 0;
while(j <= 7)
begin
if(j==4)
break;
out1[j] = 1;
j = j + 1;
end
This worked fine. So, to check out continue, I simply replaced break with continue [as I had done with for] and expected it to work straighforward.
j = 0;
while(j <= 7)
begin
if(j==4)
continue;
out1[j] = 1;
j = j + 1;
end
But I was in for a shock! The tool got stuck! I looked at the testcase, and at the code, twice. Concluding that there was a bug in the implementation in the tool, I consulted a colleague. He agreed that there is a problem in the tool. After playing around with the test for a few minutes I realized that the tool was smart enough. We were the dumb party!!
[OK, so what is the bug?!]
Wednesday, May 9, 2007
My 64-bit porting experiences - III
3. Assumptions on datatypes
3.1 Assumptions on size of built-in datatypes
I traced crashes on some platforms to a corruption in hash-tables (though I am still wondering why these testcases did not crash on the rest of the platforms). In the hash-table designated for a certain kind of records, a different kind of record was stored.
The reason, which proved to be very difficult to determine [I spent almost three days on this!], was an assumption on size of built-in datatypes in the hash function (computation for the bin number):
int hash_func(long key)
{
int size = NUM_BINS;
long hi = key >> 16, lo = key;
return (int)(hi ^ lo) % size;
}
The long type argument ‘key’ is derived from a pointer by casting it to long. On 32-bit systems, XOR-ed the higher and lower 16 bits, cast the result to int and determined the bin number by applying modulus function. Since both long and int are 32-bits in size, the cast operation did not affect the size in actual.
In 64-bit mode, when ‘key’ (which actually referred to a pointer address) was cast to int after XOR operation, the 64-bit long variable was truncated to 32 bits, and the 32nd MSB was interpreted as the sign-bit of the int. This MSB was ‘1’, so the resulting integer was a negative number. The modulus function retains the sign of the first operand, therefore the result of the modulus was also a negative number. This resulted in a negative bin number! When the record was stored into this bin, it was stored into the space of another variable, which was incidentally also a hash-table, but of a different kind of record.
The fix was simple:
return (int)((hi ^ lo) % size);
Instead of casting the result of XOR operation to int, I cast the result of modulus to int.
3.2 Assumptions on size and format of user-defined datatypes
Assumption on size and format of user-defined dataypes can be equally fatal. Consider a structure defined as follows:
typedef struct
{
long value;
char *name;
int type;
} mystruct;
In 32-bit architecture, the size of ‘mystruct’ is 12 bytes, as expected by adding the size of the individual members. In 64-bit architecture, however, the size is 24 bytes, rather 20 bytes (because of the 8-byte alignment).
Let n = sizeof(mystruct) [n = 12 on 32-bit and n = 24 on 64-bit].
Now, suppose a sequence of long-pointer-int values is stored in a chunk of memory, and the user wants to populate an array of ‘mystruct’ from this chunk. The user reads first ‘n’ bytes from the chunk and assigns it to an element of the array, reads next ‘n’ bytes from the chunk and assigns it to the next element, and so on. This works fine on 32-bit architecture, because the numbers of bytes read from the memory is exactly the size of the structure. However, on 64-bit architecture, when the user reads first ‘n’ bytes and assigns it to memory, he actually retrieves 4 bytes more than required, misaligning the next set of values to be read.
3.3 Use of non-standard calls on API datastructures
APIs frequently need to provide interface functions to external users to allow them to manipulate the API data. If the API is coded using C++, the internal implementation of the API can be hidden from the users very effectively, using data encapsulation and function overriding capabilities of Object-oriented programming. However, if the API code is written in C, sometimes it becomes inevitable that the internal data-structures are exposed to the user. In this case, if the data-structures need to be changed because of new requirements, the users may be required API is assumed to function in a certain way, instead of using the API interface.
Consider that the API defines a type ‘a_type’, which can be initialized with a value ‘a_null’, or through an interface function ‘a_init’. However, ‘a_null’ is equivalent to NULL, and so the users frequently use NULL to initialize variables of type ‘a_type’. Now, if the definition of ‘a_type’ is changed so that it is no longer compatible with NULL or ‘0’, and the definition of ‘a_null’ and ‘a_init’ is also changed accordingly, the users’ code will still fail to compile, because it was not using the API interface.