Wednesday, May 9, 2007

My 64-bit porting experiences - III

3. Assumptions on datatypes

3.1 Assumptions on size of built-in datatypes

I traced crashes on some platforms to a corruption in hash-tables (though I am still wondering why these testcases did not crash on the rest of the platforms). In the hash-table designated for a certain kind of records, a different kind of record was stored.

The reason, which proved to be very difficult to determine [I spent almost three days on this!], was an assumption on size of built-in datatypes in the hash function (computation for the bin number):

int hash_func(long key)
{
int size = NUM_BINS;

long hi = key >> 16, lo = key;
return (int)(hi ^ lo) % size;
}

The long type argument ‘key’ is derived from a pointer by casting it to long. On 32-bit systems, XOR-ed the higher and lower 16 bits, cast the result to int and determined the bin number by applying modulus function. Since both long and int are 32-bits in size, the cast operation did not affect the size in actual.

In 64-bit mode, when ‘key’ (which actually referred to a pointer address) was cast to int after XOR operation, the 64-bit long variable was truncated to 32 bits, and the 32nd MSB was interpreted as the sign-bit of the int. This MSB was ‘1’, so the resulting integer was a negative number. The modulus function retains the sign of the first operand, therefore the result of the modulus was also a negative number. This resulted in a negative bin number! When the record was stored into this bin, it was stored into the space of another variable, which was incidentally also a hash-table, but of a different kind of record.

The fix was simple:
return (int)((hi ^ lo) % size);

Instead of casting the result of XOR operation to int, I cast the result of modulus to int.


3.2 Assumptions on size and format of user-defined datatypes

Assumption on size and format of user-defined dataypes can be equally fatal. Consider a structure defined as follows:

typedef struct
{
long value;
char *name;
int type;
} mystruct;

In 32-bit architecture, the size of ‘mystruct’ is 12 bytes, as expected by adding the size of the individual members. In 64-bit architecture, however, the size is 24 bytes, rather 20 bytes (because of the 8-byte alignment).

Let n = sizeof(mystruct) [n = 12 on 32-bit and n = 24 on 64-bit].

Now, suppose a sequence of long-pointer-int values is stored in a chunk of memory, and the user wants to populate an array of ‘mystruct’ from this chunk. The user reads first ‘n’ bytes from the chunk and assigns it to an element of the array, reads next ‘n’ bytes from the chunk and assigns it to the next element, and so on. This works fine on 32-bit architecture, because the numbers of bytes read from the memory is exactly the size of the structure. However, on 64-bit architecture, when the user reads first ‘n’ bytes and assigns it to memory, he actually retrieves 4 bytes more than required, misaligning the next set of values to be read.


3.3 Use of non-standard calls on API datastructures

APIs frequently need to provide interface functions to external users to allow them to manipulate the API data. If the API is coded using C++, the internal implementation of the API can be hidden from the users very effectively, using data encapsulation and function overriding capabilities of Object-oriented programming. However, if the API code is written in C, sometimes it becomes inevitable that the internal data-structures are exposed to the user. In this case, if the data-structures need to be changed because of new requirements, the users may be required API is assumed to function in a certain way, instead of using the API interface.

Consider that the API defines a type ‘a_type’, which can be initialized with a value ‘a_null’, or through an interface function ‘a_init’. However, ‘a_null’ is equivalent to NULL, and so the users frequently use NULL to initialize variables of type ‘a_type’. Now, if the definition of ‘a_type’ is changed so that it is no longer compatible with NULL or ‘0’, and the definition of ‘a_null’ and ‘a_init’ is also changed accordingly, the users’ code will still fail to compile, because it was not using the API interface.

No comments: