Wednesday, October 31, 2007

Venturing Into Unchartered Territory

By default, the C compilers place string constants and (other constant data) into protected memory (read-only section), where they cannot be modified during the runtime.

When the option "-fwritable-strings" provided by many C compilers on different platforms (gcc, Solaris, xlc) is used, the compilers allocate strings in the [writable] data segment, so that the program can write to them. This option is provided for backward compatibility for old applications [pre-ANSI, I think] that rely on writing to strings.

However, it is quite intuitive that writing into string constants is a very bad idea - "constants" should be constant. I believe it is for this reason, that most of the compilers are now deprecating or discontinuing the "-fwritable-strings" option. Till some time back, we had been using the 3.2.3 version of gcc on Linux platforms. Recently, we moved to 4.1.1, in which this option is no longer supported. The option had actully been deprecated in 3.3, and discontinued in 3.4, but at my workplace, the next standard that we adopted after 3.2.3 is 4.1.1. There are a number of other changes that 4.1.1 has enforced, but I'll discuss that some other time.

Now, we have a product, which is almost 20 years old, and though there is no active development in it for last 5-6 years, it continues to be supported. So, we need to support it on the new platform/compiler standards, and therefore, we undertook the task of porting it on gcc 4.1.1. There are numerous examples in this code where constant strings are written to [this, and some other typical usages led me to believe that this code was written when the ANSI standard for C had not yet arrived.] One of the reasons to allow wiritng to string literals in old programs is given to be that earlier memory was severly limited, and allocation/deallocation were expensive. Apart from the build issues we faced due to stricter checks in the newer version, there were a large number of failures in the regression - when you attempt to write to a constant string literal, you get a segmentation fault.

Here is a small program to illustrate the problem:

====================================

#include
#include
#include

static char * keys [] =
{ "abc", "def", "ghi", "jkl" };

int my_sm_hash(char *strup)
{
int modified = 0;
char ch = '\0';

while ((ch = *strup) != 0)
{
if(islower(ch))
{
*strup = toupper(ch);
modified = 1;
}
++strup;
}
return modified;
}

void main()
{
int i, j;

for(i = 0; i <= 3; i++)
printf("\nkeys[i] : %s", keys[i]);

for(i = 0; i<= 3; i++)
{
j = my_sm_hash(keys[i]);
printf("\nWord no. %d, modified = %d", i, j);
}

for(i = 0; i <= 3; i++)
printf("\nkeys[i] : %s", keys[i]);

printf("\nDone\n");
}
====================================

When I compile it with gcc (older version, 3.2.3), here is the output I get:
====================================
keys[i] : abc
keys[i] : def
keys[i] : ghi
keys[i] : jklSegmentation Fault
====================================

When I compile the same program with the option -fwritable-strings, the output is as follows:
====================================
keys[i] : abc
keys[i] : def
keys[i] : ghi
keys[i] : jkl
Word no. 0, modified = 1
Word no. 1, modified = 1
Word no. 2, modified = 1
Word no. 3, modified = 1
keys[i] : ABC
keys[i] : DEF
keys[i] : GHI
keys[i] : JKL
Done
====================================

In the stand alone program, the solution does not look difficult, but in the context of a much larger program, when the function my_sm_hash may be called from multiple places, and there may be underlying assumptions elsewhere in the code, there are associated disadvantages with the possible solutions.

Solution 1: Use arrays instead of constant string literals: The documentation of gcc states "Use named character arrays when you need a writable string." What this means in practice is that you need to change lines like
char* myStr = \"Hello, world\";
to
char myStr[] = {\"Hello, world\"};
We could not use this:
1) Since the string constants in our code were actually a part of another array of structures (with constant values), this approach was not feasible for us.

Solution 2: Create, modify and return a new string: allocate a new string, and copy the existing string in it. Then modify this new string. This may be done inside the function, or from the point of call to this function.
There were two problems we discovered with this mechanism:
1) We could not determine when and where to free the newly allocated strings, so it could result in a considerable increase in the memory footprint.
2) Some other parts of the programs relied on this part of the string to be changed, so even though the crashes were avoided, testcases continued to fail.

Eventually, we realized we will have to live with memory leaks unless we want to rearchitect a huge codebase which is not supported, and not understood by the present team. We could at the most try to control the extent of damage. So, we used different mechanisms for specific scenarios:
  • There were some global tables [array of structures with constant data], that had constant strings. We reallocated the string fields of such structures once, during initializaion stage. This ensured only one-time leak of small amount of memory [instead of repeated leaks, which would have resulted from solution 2 above.]
  • At places, where static strings were defined, modified and used within a local scope, we used a static variable to control the allocation - so that the string was allocated only the first time this function was called.
  • We had cases, where the char pointer was assigned a string constant as a default value. And this default value was overwridden in different cases using a switch/case statement. This was modified so that the pointer was not preassigned, and if any of the enumerated cases were not encounterd [i.e. an unexpected scenario], the pointer was assigned the constant string.
  • There were even cases, where a static string was declared, only to be modified [capitalized, in this case], and used immediately. This was trivial - use the upper case in the first place!

Wednesday, October 3, 2007

Identity Crisis

Early this week, I got crashes in many testcases, on 64-bit mode, and only on Linux platforms (Solaris was fine in both 32- and 64-bit modes). Debugging such issues is a big problem, because they are almost certainly tricky memory corruptions, and debugging in 64-bit environment is seriously hampered by unavailability of proper support in debugging tools.

There are some specific techniques I follow to debug such corruptions or environment specific issues.

1) Reproduce the crash in an environment, where it is easier to debug.

Now, Workshop, the graphical debugging tool that I use on Solaris supports both 32- and 64- bit modes well. I like it for its user-friendly interface, and certain features like pop stack, that are not provided by other debuggers. But since I could not reproduce the crash on Solaris, I had to start my investigation on Linux. On linux, I use DDD, which is a GUI based on GDB.
I was in for a surprise, since the DDD did not load my executable. I checked the usual suspects - Build, Run arguments, paths, library paths etc. This did not help.
After asking around, a helpful team-member told me that there is a separate 64-bit version of GDB. I set my path and library path to pick this version, and was able to load the executable in DDD.

2) Compare the behavior on two platforms. Start with a top-level analysis, followed by a step-by-step comparison.

DDD showed me the point of the crash, but did not allow me to access the object, a call to whose method generated the crash. Morover, DDD does not honour the default values of function arguments, that one can provide in CPP code.

The line of code was something like:

if (obj && obj->hasProp1())
{
info = obj->getInfo(PROP1); // => the crash occured here
}

I ran the testcase on Solaris, but the execution did not reach the offending line!

3) Quick, high-level debugging through "printf" statements.
I put a "printf" statement before the "if" statement, to display the name of the object 'obj' and the value returned by hasProp1() call. I compared the output on Solaris and Linux, and the output was exactly same! This value was '0' on both platforms, and how did the code entered the "if" condition, is a mystery I could not solve till the end.

4) Use memory analysis tools. Try different tools, as it frequently happens that one tool catches the problem that the other one coud not.

I ran Valgrind on Linux, but it did not show any errors. Valgrind is available only on Linux, but it has no build-time requirements.

One point to note is that one can catch corruptions equally well on any platform, since the problem exists everywhere, even if it does not manifest in certain scenarios.
Purify was initially available on Solaris only, though now it is available on Linux as well. So I'm more comfortable using Purify on Solaris, and also in the past, I've had problems using Purify on Linux. So, I tried to run Purify on Solaris. However, I was not able to run the 64-bit Purify build - it crashed even before executing the first line. I tried build/run several times a number of times, with the same outcome. After struggling with it for a long while, a thought struck me - I was running it on S10. Our buid platform is S8, but with the same build, we support run on S9 and S10 as well. I had been running non-purify build on S10, and I continued to do so with the purify build. So, in yet another attempt, I ran the testcase on S8. This time it did run! 64-bit purify build does not run on S10!
However, the run still did not help me much - it kept delivering SIGBUS infinitely. I loaded the purify build in Workshop, and put a breakpoint in the function purify_stop_here - this function is a debugging hook provided by Purify - it is called before Purify reports a memory violation. I got the point where the SIGBUS was reported, and analyzed that part of the code, but it only led me a wild goose chase. But well, that is life!

5) When all else fails, ask around!

I asked many people in the group if they have debugged 64-bit binaries on Linux. Finally, one person told me he had done so, but he does not have much faith in DDD. So he used GDB on the shell. He also gave me a pointer to the version of GDB he had (successfully) used. This was another version of 64-bit GDB!

6) Revisit steps

Meanwhile, I had also made a Purify build on Linux. I ran this Purify build with the new version of GDB. Purify reported an ABR (Array Bound Error) on the same line that DDD had reported earlier. I put a breakpoint in the function purify_stop_here, and continued till I reached the point of ABR error. This time, the version of GDB correctly reported the name of the object. I moved to Solaris Workshop once again, and looked for this particular object. Having the name of the object made it quite easy this time. And what I discovered was absolutely stunning - the object was actually of the base class, while the properties we were querying on it were functions and members of the derived class. The object had been return by call to a function, which could return either kind of object. The fix was, of course, extremely simple - I modified the "if" condition to:
if (obj && obj->isDrvCls && obj->hasProp1())
{
...
}

The class declarations were as follows:

class baseCls
{
private:
...
char *name;
public:
...
char *getName() { return name; }
int isDrvCls() { return TRUE; }

}

class drvCls : baseCls
{

public:
...
int hasProp1() { return type_ & PROP1_MASK; }
listCls *info() { return lst1_; }
infoCls *getInfo(int propType) { return info()->findInfo(propType; }
int isDrvCls() { return TRUE; }
private:
...
int type_;
listCls *lst1_;
}