Monday, November 26, 2012

... and that's the way it is!

One of the errors reported by Coverity on our code was this:
Logical vs. bitwise operator (CONSTANT_EXPRESSION_RESULT)
~sense is always 1/true regardless of the values of its operand. This occurs as the bool operand of assignment. Did you intend to use '!' rather than '~'


The code it referred to was something like:

bool sense;
...
if(~sense)
...


Note: If we write:
bool result = ~sense;
instead, we get the same result and same error.

Now this was quite surprising. The size of bool is 1-bit, and for years now, we have understood that though it is not a good practice to use bit-negation instead of logical negation for a one-bit quantity, the result is same (it's obviously wrong for multi-bit).
Refusing to believe coverity, we wrote a small CPP code:
======================

  bool sense = true;

  cout << "Size: " << sizeof(sense) << endl;

  if(~sense)
    cout << "Always true" << endl;
  else
    cout << "Can be false" << endl;

  bool result = ~sense;
  if(result)
    cout << "Always true" << endl;
  else
    cout << "Can be false" << endl;


======================
 


The result was totally unexpected, and stumped three of us who saw it:

Size: 1
Always true
Always true


Of course, any variable will be aligned to the word (8-bit boundary), but we expected any 0-padding to take place after the negation operation, and not before.

Our understanding that the bit-negation and logical negation yield same result for single bit, has perhaps resulted from years of conditioning of wroking with Verilog. It appears that C/C++ does not work that way. and we just have to accept it.

Thursday, November 1, 2007

Add-venture?

On my previous post, I was asked if there was any specific reason to get the 20 year old program compiled on the latest compiler? It should be expected that the older executable would continue to run even on new versions of the OS.

Here is the answer.

There are many reasons for porting:

1) In the industry, there is a standard set of the development tools (eg OS, compilers, etc) used in any organization, and these standards are revised from time to time. By ensuring that we have a well-defined standard, we can establish a consistency within the development teams, as well as with customers, to avoid build/run time issues. When the organization moves to the new standard, all [or at least, a defined set] of the products comply with it. One can say that this is an organizational policy matter.
In our case, we moved from gcc 3.2.3 to gcc4.1.1 company-wide, for Linux platforms.

2) In large product based environments, each product is not an independent entity in itself, but has several dependencies. While a particular product may not see many changes, the dependencies may change.
In our case, dependencies include few libraries, which are changing actively. To remain compatible, we have to port this old product on to new platforms/compilers.

3) Although this specific code is quite old, and newer products have since been launched, there still are some loyal customers, who need bug fixes in it. So, to deliver these fixes, the product has to be compiled anew. For new compilation, one must use the new standards of compilers.

Wednesday, October 31, 2007

Venturing Into Unchartered Territory

By default, the C compilers place string constants and (other constant data) into protected memory (read-only section), where they cannot be modified during the runtime.

When the option "-fwritable-strings" provided by many C compilers on different platforms (gcc, Solaris, xlc) is used, the compilers allocate strings in the [writable] data segment, so that the program can write to them. This option is provided for backward compatibility for old applications [pre-ANSI, I think] that rely on writing to strings.

However, it is quite intuitive that writing into string constants is a very bad idea - "constants" should be constant. I believe it is for this reason, that most of the compilers are now deprecating or discontinuing the "-fwritable-strings" option. Till some time back, we had been using the 3.2.3 version of gcc on Linux platforms. Recently, we moved to 4.1.1, in which this option is no longer supported. The option had actully been deprecated in 3.3, and discontinued in 3.4, but at my workplace, the next standard that we adopted after 3.2.3 is 4.1.1. There are a number of other changes that 4.1.1 has enforced, but I'll discuss that some other time.

Now, we have a product, which is almost 20 years old, and though there is no active development in it for last 5-6 years, it continues to be supported. So, we need to support it on the new platform/compiler standards, and therefore, we undertook the task of porting it on gcc 4.1.1. There are numerous examples in this code where constant strings are written to [this, and some other typical usages led me to believe that this code was written when the ANSI standard for C had not yet arrived.] One of the reasons to allow wiritng to string literals in old programs is given to be that earlier memory was severly limited, and allocation/deallocation were expensive. Apart from the build issues we faced due to stricter checks in the newer version, there were a large number of failures in the regression - when you attempt to write to a constant string literal, you get a segmentation fault.

Here is a small program to illustrate the problem:

====================================

#include
#include
#include

static char * keys [] =
{ "abc", "def", "ghi", "jkl" };

int my_sm_hash(char *strup)
{
int modified = 0;
char ch = '\0';

while ((ch = *strup) != 0)
{
if(islower(ch))
{
*strup = toupper(ch);
modified = 1;
}
++strup;
}
return modified;
}

void main()
{
int i, j;

for(i = 0; i <= 3; i++)
printf("\nkeys[i] : %s", keys[i]);

for(i = 0; i<= 3; i++)
{
j = my_sm_hash(keys[i]);
printf("\nWord no. %d, modified = %d", i, j);
}

for(i = 0; i <= 3; i++)
printf("\nkeys[i] : %s", keys[i]);

printf("\nDone\n");
}
====================================

When I compile it with gcc (older version, 3.2.3), here is the output I get:
====================================
keys[i] : abc
keys[i] : def
keys[i] : ghi
keys[i] : jklSegmentation Fault
====================================

When I compile the same program with the option -fwritable-strings, the output is as follows:
====================================
keys[i] : abc
keys[i] : def
keys[i] : ghi
keys[i] : jkl
Word no. 0, modified = 1
Word no. 1, modified = 1
Word no. 2, modified = 1
Word no. 3, modified = 1
keys[i] : ABC
keys[i] : DEF
keys[i] : GHI
keys[i] : JKL
Done
====================================

In the stand alone program, the solution does not look difficult, but in the context of a much larger program, when the function my_sm_hash may be called from multiple places, and there may be underlying assumptions elsewhere in the code, there are associated disadvantages with the possible solutions.

Solution 1: Use arrays instead of constant string literals: The documentation of gcc states "Use named character arrays when you need a writable string." What this means in practice is that you need to change lines like
char* myStr = \"Hello, world\";
to
char myStr[] = {\"Hello, world\"};
We could not use this:
1) Since the string constants in our code were actually a part of another array of structures (with constant values), this approach was not feasible for us.

Solution 2: Create, modify and return a new string: allocate a new string, and copy the existing string in it. Then modify this new string. This may be done inside the function, or from the point of call to this function.
There were two problems we discovered with this mechanism:
1) We could not determine when and where to free the newly allocated strings, so it could result in a considerable increase in the memory footprint.
2) Some other parts of the programs relied on this part of the string to be changed, so even though the crashes were avoided, testcases continued to fail.

Eventually, we realized we will have to live with memory leaks unless we want to rearchitect a huge codebase which is not supported, and not understood by the present team. We could at the most try to control the extent of damage. So, we used different mechanisms for specific scenarios:
  • There were some global tables [array of structures with constant data], that had constant strings. We reallocated the string fields of such structures once, during initializaion stage. This ensured only one-time leak of small amount of memory [instead of repeated leaks, which would have resulted from solution 2 above.]
  • At places, where static strings were defined, modified and used within a local scope, we used a static variable to control the allocation - so that the string was allocated only the first time this function was called.
  • We had cases, where the char pointer was assigned a string constant as a default value. And this default value was overwridden in different cases using a switch/case statement. This was modified so that the pointer was not preassigned, and if any of the enumerated cases were not encounterd [i.e. an unexpected scenario], the pointer was assigned the constant string.
  • There were even cases, where a static string was declared, only to be modified [capitalized, in this case], and used immediately. This was trivial - use the upper case in the first place!

Wednesday, October 3, 2007

Identity Crisis

Early this week, I got crashes in many testcases, on 64-bit mode, and only on Linux platforms (Solaris was fine in both 32- and 64-bit modes). Debugging such issues is a big problem, because they are almost certainly tricky memory corruptions, and debugging in 64-bit environment is seriously hampered by unavailability of proper support in debugging tools.

There are some specific techniques I follow to debug such corruptions or environment specific issues.

1) Reproduce the crash in an environment, where it is easier to debug.

Now, Workshop, the graphical debugging tool that I use on Solaris supports both 32- and 64- bit modes well. I like it for its user-friendly interface, and certain features like pop stack, that are not provided by other debuggers. But since I could not reproduce the crash on Solaris, I had to start my investigation on Linux. On linux, I use DDD, which is a GUI based on GDB.
I was in for a surprise, since the DDD did not load my executable. I checked the usual suspects - Build, Run arguments, paths, library paths etc. This did not help.
After asking around, a helpful team-member told me that there is a separate 64-bit version of GDB. I set my path and library path to pick this version, and was able to load the executable in DDD.

2) Compare the behavior on two platforms. Start with a top-level analysis, followed by a step-by-step comparison.

DDD showed me the point of the crash, but did not allow me to access the object, a call to whose method generated the crash. Morover, DDD does not honour the default values of function arguments, that one can provide in CPP code.

The line of code was something like:

if (obj && obj->hasProp1())
{
info = obj->getInfo(PROP1); // => the crash occured here
}

I ran the testcase on Solaris, but the execution did not reach the offending line!

3) Quick, high-level debugging through "printf" statements.
I put a "printf" statement before the "if" statement, to display the name of the object 'obj' and the value returned by hasProp1() call. I compared the output on Solaris and Linux, and the output was exactly same! This value was '0' on both platforms, and how did the code entered the "if" condition, is a mystery I could not solve till the end.

4) Use memory analysis tools. Try different tools, as it frequently happens that one tool catches the problem that the other one coud not.

I ran Valgrind on Linux, but it did not show any errors. Valgrind is available only on Linux, but it has no build-time requirements.

One point to note is that one can catch corruptions equally well on any platform, since the problem exists everywhere, even if it does not manifest in certain scenarios.
Purify was initially available on Solaris only, though now it is available on Linux as well. So I'm more comfortable using Purify on Solaris, and also in the past, I've had problems using Purify on Linux. So, I tried to run Purify on Solaris. However, I was not able to run the 64-bit Purify build - it crashed even before executing the first line. I tried build/run several times a number of times, with the same outcome. After struggling with it for a long while, a thought struck me - I was running it on S10. Our buid platform is S8, but with the same build, we support run on S9 and S10 as well. I had been running non-purify build on S10, and I continued to do so with the purify build. So, in yet another attempt, I ran the testcase on S8. This time it did run! 64-bit purify build does not run on S10!
However, the run still did not help me much - it kept delivering SIGBUS infinitely. I loaded the purify build in Workshop, and put a breakpoint in the function purify_stop_here - this function is a debugging hook provided by Purify - it is called before Purify reports a memory violation. I got the point where the SIGBUS was reported, and analyzed that part of the code, but it only led me a wild goose chase. But well, that is life!

5) When all else fails, ask around!

I asked many people in the group if they have debugged 64-bit binaries on Linux. Finally, one person told me he had done so, but he does not have much faith in DDD. So he used GDB on the shell. He also gave me a pointer to the version of GDB he had (successfully) used. This was another version of 64-bit GDB!

6) Revisit steps

Meanwhile, I had also made a Purify build on Linux. I ran this Purify build with the new version of GDB. Purify reported an ABR (Array Bound Error) on the same line that DDD had reported earlier. I put a breakpoint in the function purify_stop_here, and continued till I reached the point of ABR error. This time, the version of GDB correctly reported the name of the object. I moved to Solaris Workshop once again, and looked for this particular object. Having the name of the object made it quite easy this time. And what I discovered was absolutely stunning - the object was actually of the base class, while the properties we were querying on it were functions and members of the derived class. The object had been return by call to a function, which could return either kind of object. The fix was, of course, extremely simple - I modified the "if" condition to:
if (obj && obj->isDrvCls && obj->hasProp1())
{
...
}

The class declarations were as follows:

class baseCls
{
private:
...
char *name;
public:
...
char *getName() { return name; }
int isDrvCls() { return TRUE; }

}

class drvCls : baseCls
{

public:
...
int hasProp1() { return type_ & PROP1_MASK; }
listCls *info() { return lst1_; }
infoCls *getInfo(int propType) { return info()->findInfo(propType; }
int isDrvCls() { return TRUE; }
private:
...
int type_;
listCls *lst1_;
}

Wednesday, June 27, 2007

My 64-bit porting experiences - VI

6. Excessive code-reuse

We have always been instructed to reuse code, as it makes the code more maintainable. But it can sometimes be taken too far (too much of a good thing ?!).

6.1 Objects of different sizes in a union

Class A{
int getVal {return bVal;}
exp *getExpr {return bExpr;}
int hasExpr {return isExpr;}
union {
int bVal;
exp *bExpr;
}
int isExpr;
};

The object of this class ‘A’ can have a data-member that is either a constant value (bVal), or an expression (bExpr) if the value is not constant. When the value is not constant, ‘isExpr’ is set to ‘1’, and the user is expected to use the expression returned by ‘getExpr’ call.

In the following piece of code , ‘obj_a’ is an object of class ‘A’:

int tmp = 0;
exp *tmpExpr = NULL;
if (cond)
tmp = 1;
else
tmp = obj_a->getVal();

if(obj_a->hasExpr())
{
if(tmp == 1)
tmpExpr = createExpr(1);
else
tmpExpr = obj_a->getExpr();
}
else
tmpExpr = createExpr(tmp);

Incorrect values in a testcase, in 64-bit mode, were traced to this piece of code. Even though ‘obj_a’ contained expression (which represented a value other than ‘1’), and ‘cond’ was ‘0’, this code inferred the value represented by ‘obj_a’ to be ‘1’.

Reason: ‘obj_a’ represented an expression, so obj_getVal() should not have been used. But it was called, and it returned ‘1’. In 64-bit mode, the members of the union – int and pointer – have sizes 32 and 64 bits respectively. When the integer value was queried, the higher 32 bits were returned, which constituted ‘1’ in integer value [the pointer was 0x1_hhhh_hhhh]. This code worked without problems in 32-bit, because int and pointer have same size, and a pointer cannot be 0x1.


6.2 Overriding functions through implicit casting

This problem was encountered at many places in the code, in different forms.

A hash-table to hash pointers had been implemented, and widely used. The interface functions to insert new records, or find existing ones, accepted char** arguments (and dereferenced them to obtain the pointer). Later on, there was a requirement to hash integer values as well. Instead of implementing a new hash-table (or even writing new interface functions), the previous table and its interface was reused – by passing int* on the actual. When the interface functions dereferenced the pointer passed at the actual, the resulting value was int, and therefore not a valid address.

Sunday, June 17, 2007

My 64-bit porting experiences -V

5. Use of compatible, but incorrect datatypes

An API function, ‘func1’ returns int*, which actually points to an array of two integers. The part of code that uses this function is as follows:

long *lp = func1(…);
int lval = lp[0];
int rval = lp[1];

The results were correct in 32-bit, because int and long are same size. On 64-bit, the return values were incorrect, because of the difference in size of int and long. ‘lp[0]’ represents a 64-bit value, which contains both the int values that are returned by ‘func1’. When it is assigned to ‘lval’, the lower 32 bits are assigned to ‘lval’, which was the value that needed to be assigned to ‘rval’. ‘lp[1]’ reads next 64 bits from memory and assigns the lower 32 bits of it to ‘rval’. So, ‘lval’ gets incorrect value, while ‘rval’ gets junk.

Sunday, June 10, 2007

My 64-bit porting experiences - IV

4. Masking operations
The following may be considered as example of oversight, or assumption on datatype size. I have made this an independent section, because masking operations are frequently used in C code for fast storage and access of data. Masks are usually implemented through macros, and problems in these macros are difficult to debug.

4.1 Storing additional information in a pointer
It may be considered that manipulation of pointer addresses will be very prone to errors. But given the fact that addresses are always word-aligned, the last bit in a pointer value is always 0. To take advantage of this fact, applications sometimes store a boolean property in the last bit of the address, and save runtime memory. As long as it takes care to reset the last bit of such pointers before dereferencing them, the applications are safe. The set and reset operations are usually coded as macros in ‘C’ code, in the interest of runtime.

In the following code snippet, ‘x’ is a pointer under manipulation. M_IS_LIST queries the property on the pointer, M_SET_LIST sets the property on the pointer, and M_GET_LIST retrieves the pointer when before it is de-referenced.
#define M_IS_LIST(x) (((unsigned long)(x)) & 0x1U)
#define M_SET_LIST(x) ((x) |= 0x1U)
#define M_GET_LIST(x) (((unsigned long)(x)) & ~0x1U)

This works for 32-bit mode. In 64-bit, I was getting corrupt pointers at some point in code, which was traced to these macros after considerable effort.

Reason: In its own context, a constant is treated as integer value. So, 0x1U is considered as a 32 bit value. Bit-wise AND with a pointer results in a 32-bit quantity (the smaller of the values being AND’ed).

Solution: To make the mask the same size as the variable:
#define M_IS_LIST(x) (((unsigned long)(x)) & 0x1UL)
#define M_SET_LIST(x) ((x) |= 0x1UL)
#define M_GET_LIST(x) (((unsigned long)(x)) & ~0x1UL)


4.2 Manipulating addresses

Yet another memory corruption was caused by the following code snippet, which purported to provide the next pointer to be processed:

if (((unsigned long)pstr) & 0x3U)
pstr = (char *)(((unsigned long)(pstr + 4)) & ~0x3U);

It was very difficult to understand the function of this simple line of code, due to use of constants, and absence of comments. I cannot stress enough the necessity of documentation in code, especially with tricky calculations like this.

Well, it re-aligns a pointer, based on the understanding that the addresses are aligned at 4 bytes (last two bits are 0). However, in 64-bit mode, the addresses are aligned at 8 bytes (last 3 bits are 0). Therefore, calculations needed to be modified [though I wish it could be more generic], as follows:

#ifdef 64_BIT_BUILD
#define addrMask 0x7UL
#else
#define addrMask 0x3UL
#endif

if (((unsigned long)pstr) & addrMask)
pstr = (char *)(((unsigned long)(pstr + sizeof(long))) & ~addrMask);