Monday, July 18, 2005

Smash stack attack!

Warning: Arcane programming terms on the way. Duck to avoid the shrapnel.

This blog started out as a tech/hack blog for me. But then I decided to get up on a soapbox and hold forth.

This is sort of a "return to basics". Non-programming enthusiasts may pass this on. I expect to post more such 'insights'. Interesting little nuggets of programming wisdom gained the hard way - stripping away all proprietary employer-related info. Writing style elements heavily borrowed from Alhad and Larry Ostermann.

So, I was working on this multi-threaded tool at my previous job. I was using the standard UNIX pthreads library. The standard call for creating a thread is:

int pthread_create(pthread_t * thread, const pthread_attr_t * attr, void *(* start_routine) (void), void * arg);

I decided to look it up on my favorite systems-related quick reference book. The example code was something like this:

pthread_create(&tid, NULL, thread, NULL);

Which should have been fine for most purposes. But this is where things get interesting.

The tool I was working on had a set of arguments like this

tool -i inputfile0 -i inputfile1 -i inputfile2 ....

If you entered one input file, the tool would execute just one thing and exit. If you enter multiple input files, the tool would spawn off a thread for each one, and perform a task based on it. It would wait on the threads, and exit when the last one was done. So, inputfile0 causes task0 to be executed, inputfile1 causes task1 to be executed, and so on.

For each task singly, and some combinations of tasks, the tool would work just fine.

For instance, task0 and task1 together worked fine. Task0 and task2, ditto. Task0 and task3, however, busted the thing - segfault.

Trying everything individually worked wonderfully. Task3 was perfect individually. But task3 in a multi-threaded environment just blew up. Everytime. I am happy that the bug at least was reproducible.(Heisenbugs are the worst)

A lot of data on multi-threading was dug up. I went back to the basics, checking to see if all the APIs and objects I was calling were thread-safe. External advice was sought. After having debugged this (IBM dbx, not fun to debug at all. Why can't they use gdb now?) for almost 2 1/2 days, I finally gave up, asking a colleague (the resident dbx guru ) for serious help.

Finally, the solution turned up.

The API uses these large strings of objects to store things. (How large? 64K bytes sound big enough for you? It's actually required. I can't tell you more.) It wasn't something I spent sleepless nights over - I didn't know they used char arrays for such things anymore. What's wrong with pointers? It wasn't my API, and I know I should have asked, but it never struck me. So, every time I passed an object of that in the new thread I was spawning, I was effectively passing in an object of 64K + some other integer and float-type variables.

There were some other API calls involved, which meant my stack was growing large - into 150K+ for sure.

Now, this is not a problem normally. On 32-bit IBM AIX, the stack can grow to 4G without issues.(I'm not so sure about 64-bit)

But, we are in multi-threaded mode here. Aha! The default stack size per thread for a multi-threaded application on IBM AIX is 96K. So, my multi-threaded, multi-call 96K stack was getting smashed - neither for fun nor for profit.

I eat crow. It was my fault. Still, I have to point out - no one I worked with even knew of this gotcha. (At least two other people looked at my code and found it OK). Everyone passes in null as the argument for pthread_attr_t.

In fact, my colleague who helped me debug the issue spent the better part of a half hour debugging it, using arcane debugging commands that I'm sure he conjured out of thin air (I sure didn't see anything like that in the online help)

The solution was of course using a larger sized stack (I think I used 512 K) using pthread_attr_setstacksize. The argument was changed, and voila! Everything was hunky-dory again. Life was good, and there was peace on Middle Earth again.

Lesson Learnt: Never assume any default arguments unless you know pretty well what they imply. And yes, don't trust a student's textbook for the real thing. Real men use Stevens for their programming needs. Even more macho people use man pages only.

Links:

1. Threads library options

2. pthread_attr_setstacksize

2 comments:

alhad said...

These things happen. I spent the better part of 3 days debugging an issue where after a certain sequence of operations led to a data abort.
If you are working on ARM, you can see the contents of a particular register (R14) to look at the physical address. Then you can subtract 8 from that address and you have the physical address of the instruction which aborted.
Yet, everything seemed fine with that instruction.. after trying everything, finally came up the solution: we were running out of filehandles in the system!! (Its an embedded device, it only has about 50-60 max file handles allowed). Lesson to be learnt here: Look at error codes for file handles (there is a special error code that gets returned when the system runs out of filehandles, and the API was eating it up - it was not passing enough detail to the application to know what happened).
Shit happens, the only problem being that worse shit also happens :-)

Ajay said...

Sumeet, talk of code obfuscation :-). Potentially very dicey, esp. if some unexpected guy turns up for the presentation or something.

Alhad, your example kind of re-inforces my point. Real programming is slightly different compared to assignments at school - you are likely to reach limits sooner than you think, and when there are other people coding other modules along with you, losing a 'handle' on the amount of resources you have is quite easy.