Wednesday, October 31, 2007

Venturing Into Unchartered Territory

By default, the C compilers place string constants and (other constant data) into protected memory (read-only section), where they cannot be modified during the runtime.

When the option "-fwritable-strings" provided by many C compilers on different platforms (gcc, Solaris, xlc) is used, the compilers allocate strings in the [writable] data segment, so that the program can write to them. This option is provided for backward compatibility for old applications [pre-ANSI, I think] that rely on writing to strings.

However, it is quite intuitive that writing into string constants is a very bad idea - "constants" should be constant. I believe it is for this reason, that most of the compilers are now deprecating or discontinuing the "-fwritable-strings" option. Till some time back, we had been using the 3.2.3 version of gcc on Linux platforms. Recently, we moved to 4.1.1, in which this option is no longer supported. The option had actully been deprecated in 3.3, and discontinued in 3.4, but at my workplace, the next standard that we adopted after 3.2.3 is 4.1.1. There are a number of other changes that 4.1.1 has enforced, but I'll discuss that some other time.

Now, we have a product, which is almost 20 years old, and though there is no active development in it for last 5-6 years, it continues to be supported. So, we need to support it on the new platform/compiler standards, and therefore, we undertook the task of porting it on gcc 4.1.1. There are numerous examples in this code where constant strings are written to [this, and some other typical usages led me to believe that this code was written when the ANSI standard for C had not yet arrived.] One of the reasons to allow wiritng to string literals in old programs is given to be that earlier memory was severly limited, and allocation/deallocation were expensive. Apart from the build issues we faced due to stricter checks in the newer version, there were a large number of failures in the regression - when you attempt to write to a constant string literal, you get a segmentation fault.

Here is a small program to illustrate the problem:

====================================

#include
#include
#include

static char * keys [] =
{ "abc", "def", "ghi", "jkl" };

int my_sm_hash(char *strup)
{
int modified = 0;
char ch = '\0';

while ((ch = *strup) != 0)
{
if(islower(ch))
{
*strup = toupper(ch);
modified = 1;
}
++strup;
}
return modified;
}

void main()
{
int i, j;

for(i = 0; i <= 3; i++)
printf("\nkeys[i] : %s", keys[i]);

for(i = 0; i<= 3; i++)
{
j = my_sm_hash(keys[i]);
printf("\nWord no. %d, modified = %d", i, j);
}

for(i = 0; i <= 3; i++)
printf("\nkeys[i] : %s", keys[i]);

printf("\nDone\n");
}
====================================

When I compile it with gcc (older version, 3.2.3), here is the output I get:
====================================
keys[i] : abc
keys[i] : def
keys[i] : ghi
keys[i] : jklSegmentation Fault
====================================

When I compile the same program with the option -fwritable-strings, the output is as follows:
====================================
keys[i] : abc
keys[i] : def
keys[i] : ghi
keys[i] : jkl
Word no. 0, modified = 1
Word no. 1, modified = 1
Word no. 2, modified = 1
Word no. 3, modified = 1
keys[i] : ABC
keys[i] : DEF
keys[i] : GHI
keys[i] : JKL
Done
====================================

In the stand alone program, the solution does not look difficult, but in the context of a much larger program, when the function my_sm_hash may be called from multiple places, and there may be underlying assumptions elsewhere in the code, there are associated disadvantages with the possible solutions.

Solution 1: Use arrays instead of constant string literals: The documentation of gcc states "Use named character arrays when you need a writable string." What this means in practice is that you need to change lines like
char* myStr = \"Hello, world\";
to
char myStr[] = {\"Hello, world\"};
We could not use this:
1) Since the string constants in our code were actually a part of another array of structures (with constant values), this approach was not feasible for us.

Solution 2: Create, modify and return a new string: allocate a new string, and copy the existing string in it. Then modify this new string. This may be done inside the function, or from the point of call to this function.
There were two problems we discovered with this mechanism:
1) We could not determine when and where to free the newly allocated strings, so it could result in a considerable increase in the memory footprint.
2) Some other parts of the programs relied on this part of the string to be changed, so even though the crashes were avoided, testcases continued to fail.

Eventually, we realized we will have to live with memory leaks unless we want to rearchitect a huge codebase which is not supported, and not understood by the present team. We could at the most try to control the extent of damage. So, we used different mechanisms for specific scenarios:
  • There were some global tables [array of structures with constant data], that had constant strings. We reallocated the string fields of such structures once, during initializaion stage. This ensured only one-time leak of small amount of memory [instead of repeated leaks, which would have resulted from solution 2 above.]
  • At places, where static strings were defined, modified and used within a local scope, we used a static variable to control the allocation - so that the string was allocated only the first time this function was called.
  • We had cases, where the char pointer was assigned a string constant as a default value. And this default value was overwridden in different cases using a switch/case statement. This was modified so that the pointer was not preassigned, and if any of the enumerated cases were not encounterd [i.e. an unexpected scenario], the pointer was assigned the constant string.
  • There were even cases, where a static string was declared, only to be modified [capitalized, in this case], and used immediately. This was trivial - use the upper case in the first place!

2 comments:

Anonymous said...

What a nightmare! Was there any reason to get the 20 year old program compiled on the latest compiler? I would guess that the older executable would continue to run even on new versions of the OS.

Sigma said...

Hello Shuva,
Thanks for coming here. We are not talking of new OS here, but new compiler. We were moving to the new version of gcc, on same linux port(s).
I have discussed the reasons of such an exercise in the next post.