Early this week, I got crashes in many testcases, on 64-bit mode, and only on Linux platforms (Solaris was fine in both 32- and 64-bit modes). Debugging such issues is a big problem, because they are almost certainly tricky memory corruptions, and debugging in 64-bit environment is seriously hampered by unavailability of proper support in debugging tools.
There are some specific techniques I follow to debug such corruptions or environment specific issues.
1) Reproduce the crash in an environment, where it is easier to debug.
Now, Workshop, the graphical debugging tool that I use on Solaris supports both 32- and 64- bit modes well. I like it for its user-friendly interface, and certain features like pop stack, that are not provided by other debuggers. But since I could not reproduce the crash on Solaris, I had to start my investigation on Linux. On linux, I use DDD, which is a GUI based on GDB.
I was in for a surprise, since the DDD did not load my executable. I checked the usual suspects - Build, Run arguments, paths, library paths etc. This did not help.
After asking around, a helpful team-member told me that there is a separate 64-bit version of GDB. I set my path and library path to pick this version, and was able to load the executable in DDD.
2) Compare the behavior on two platforms. Start with a top-level analysis, followed by a step-by-step comparison.
DDD showed me the point of the crash, but did not allow me to access the object, a call to whose method generated the crash. Morover, DDD does not honour the default values of function arguments, that one can provide in CPP code.
The line of code was something like:
if (obj && obj->hasProp1())
{
info = obj->getInfo(PROP1); // => the crash occured here
}
I ran the testcase on Solaris, but the execution did not reach the offending line!
3) Quick, high-level debugging through "printf" statements.
I put a "printf" statement before the "if" statement, to display the name of the object 'obj' and the value returned by hasProp1() call. I compared the output on Solaris and Linux, and the output was exactly same! This value was '0' on both platforms, and how did the code entered the "if" condition, is a mystery I could not solve till the end.
4) Use memory analysis tools. Try different tools, as it frequently happens that one tool catches the problem that the other one coud not.
I ran Valgrind on Linux, but it did not show any errors. Valgrind is available only on Linux, but it has no build-time requirements.
One point to note is that one can catch corruptions equally well on any platform, since the problem exists everywhere, even if it does not manifest in certain scenarios.
Purify was initially available on Solaris only, though now it is available on Linux as well. So I'm more comfortable using Purify on Solaris, and also in the past, I've had problems using Purify on Linux. So, I tried to run Purify on Solaris. However, I was not able to run the 64-bit Purify build - it crashed even before executing the first line. I tried build/run several times a number of times, with the same outcome. After struggling with it for a long while, a thought struck me - I was running it on S10. Our buid platform is S8, but with the same build, we support run on S9 and S10 as well. I had been running non-purify build on S10, and I continued to do so with the purify build. So, in yet another attempt, I ran the testcase on S8. This time it did run! 64-bit purify build does not run on S10!
However, the run still did not help me much - it kept delivering SIGBUS infinitely. I loaded the purify build in Workshop, and put a breakpoint in the function purify_stop_here - this function is a debugging hook provided by Purify - it is called before Purify reports a memory violation. I got the point where the SIGBUS was reported, and analyzed that part of the code, but it only led me a wild goose chase. But well, that is life!
5) When all else fails, ask around!
I asked many people in the group if they have debugged 64-bit binaries on Linux. Finally, one person told me he had done so, but he does not have much faith in DDD. So he used GDB on the shell. He also gave me a pointer to the version of GDB he had (successfully) used. This was another version of 64-bit GDB!
6) Revisit steps
Meanwhile, I had also made a Purify build on Linux. I ran this Purify build with the new version of GDB. Purify reported an ABR (Array Bound Error) on the same line that DDD had reported earlier. I put a breakpoint in the function purify_stop_here, and continued till I reached the point of ABR error. This time, the version of GDB correctly reported the name of the object. I moved to Solaris Workshop once again, and looked for this particular object. Having the name of the object made it quite easy this time. And what I discovered was absolutely stunning - the object was actually of the base class, while the properties we were querying on it were functions and members of the derived class. The object had been return by call to a function, which could return either kind of object. The fix was, of course, extremely simple - I modified the "if" condition to:
if (obj && obj->isDrvCls && obj->hasProp1())
{
...
}
The class declarations were as follows:
class baseCls
{
private:
...
char *name;
public:
...
char *getName() { return name; }
int isDrvCls() { return TRUE; }
}
class drvCls : baseCls
{
public:
...
int hasProp1() { return type_ & PROP1_MASK; }
listCls *info() { return lst1_; }
infoCls *getInfo(int propType) { return info()->findInfo(propType; }
int isDrvCls() { return TRUE; }
private:
...
int type_;
listCls *lst1_;
}