It was a curious thing. A website had been deployed to powerful production servers. These servers had many CPU cores.
Yet every morning the system administrators would notice several processes using 100% cpu utilisation. At first it was just one or two rogue web application processes which could be easily killed off. But as the days went by more and more processes would be spinning at 100% CPU utilisation each morning.
Immediately I became suspicious of the application code; and I requested to see any and all code with while
loops in them.
What is risky about while loops?
Let me digress for a moment. Why would I be suspicious of while
loops?
The issue is that, unlike other kinds of loops, such as the for
loop, a while
loop relies on the programmer to update the condition variable inside the loop code.
A simple while
loop is difficult to get wrong. For example:
/* iterate through users */ counter = 0; while ( counter < numusers ) { printf( "User %s\n", user[counter] ); counter++; }
This code will always complete because counter
will always reach numusers
eventually and cause the loop to end.
The counter
variable is the “condition variable” for this loop because the while
loop has a condition dependent on the counter
variable. And this loop does, indeed, update the condition variable every iteration (by the counter++;
statement).
The problem lies in larger organisations when multiple people are making changes to source code – perhaps when making a fix in response to a bug report. They may add some code like this:
/* iterate through users */ counter = 0; while ( counter < numusers ) { /* bug report 14132 - ignore user "secretuser" */ if ( strcmp( user[counter], "secretuser" ) == 0 ) { /* match! loop again to avoid printing secretuser */ continue; } printf( "User %s\n", user[counter] ); counter++; }
Now we have a problem. If "secretuser"
appears in the user[]
array then counter
will no longer be incremented and the while loop will continue forever.
A for
wouldn’t have this issue. Imagine if the loop had started out as:
/* iterate through users */ for ( counter = 0; counter < numusers; counter++ ) { printf( "User %s\n", user[counter] ); }
and then the maintainer added the bug fix:
/* iterate through users */ for ( counter = 0; counter < numusers; counter++ ) { /* bug report 14132 - ignore user "secretuser" */ if ( strcmp( user[counter], "secretuser" ) == 0 ) { /* match! loop again to avoid printing secretuser */ continue; } printf( "User %s\n", user[counter] ); }
The fact that counter
was incremented in the for
loop declaration ensures this loop will complete even though the maintainer added a continue
statement inside the loop.
Thus for
loops offer greater protection against accidental infinite looping.
A tight while loop is bad
If a CPU finds itself in the situation where it is in a tight while loop things can be worse than one might ordinarily expect. That’s because it might never find itself calling an operating system function – which would give the operating system a chance to consider whether anything else needs to be done (outside the application) – before returning control to the program.
Instead the CPU is executing instructions furiously over and over again – until a scheduled timer interrupt forces the CPU to visit the operating system and something else is given a chance to run.
So things are not utterly dire – other code does get a chance to run on the processor. But it’s not well behaved and polite like most programs are. Still, this kind of impolite hogging of the CPU does have consequences and is worth being aware of.
Back to the story
My suspicion was confirmed whereupon I found a while
loop that had a rogue continue
statement without any accompanying code to update the condition variable.
Quite simply I started grepping for the string continue
and closely reviewed any while
loop that contained the keyword. And that is how I found the fault.
Indeed it had been a bug fix. And was only triggered occasionally because it added an exception to the normal operation. In testing this exception had not been encountered.
But in production the exception was being triggered on occasion by real customers – and when it did their web session froze (never responded) and the process locked up at 100% CPU utilisation each time.
It was fortunate this issue was identified, and fixed, before whole servers became overwhelmed with CPU load.