The case of CPUs becoming 100% busy

It was a curious thing. A website had been deployed to powerful production servers. These servers had many CPU cores.

Yet every morning the system administrators would notice several processes using 100% cpu utilisation. At first it was just one or two rogue web application processes which could be easily killed off. But as the days went by more and more processes would be spinning at 100% CPU utilisation each morning.

Immediately I became suspicious of the application code; and I requested to see any and all code with while loops in them.

What is risky about while loops?

Let me digress for a moment. Why would I be suspicious of while loops?

The issue is that, unlike other kinds of loops, such as the for loop, a while loop relies on the programmer to update the condition variable inside the loop code.

A simple while loop is difficult to get wrong. For example:

/* iterate through users */
counter = 0;
while ( counter < numusers ) {
  printf( "User %s\n", user[counter] );
  counter++;
}

This code will always complete because counter will always reach numusers eventually and cause the loop to end.

The counter variable is the “condition variable” for this loop because the while loop has a condition dependent on the counter variable. And this loop does, indeed, update the condition variable every iteration (by the counter++; statement).

The problem lies in larger organisations when multiple people are making changes to source code – perhaps when making a fix in response to a bug report. They may add some code like this:

/* iterate through users */
counter = 0;
while ( counter < numusers ) {
  /* bug report 14132 - ignore user "secretuser" */
  if ( strcmp( user[counter], "secretuser" ) == 0 ) {
    /* match! loop again to avoid printing secretuser */
    continue;
  }

  printf( "User %s\n", user[counter] );
  counter++;
}

Now we have a problem. If "secretuser" appears in the user[] array then counter will no longer be incremented and the while loop will continue forever.

A for wouldn’t have this issue. Imagine if the loop had started out as:

/* iterate through users */
for ( counter = 0; counter < numusers; counter++ ) {
  printf( "User %s\n", user[counter] );
}

and then the maintainer added the bug fix:

/* iterate through users */
for ( counter = 0; counter < numusers; counter++ ) {
  /* bug report 14132 - ignore user "secretuser" */
  if ( strcmp( user[counter], "secretuser" ) == 0 ) {
    /* match! loop again to avoid printing secretuser */
    continue;
  }

  printf( "User %s\n", user[counter] );
}

The fact that counter was incremented in the for loop declaration ensures this loop will complete even though the maintainer added a continue statement inside the loop.

Thus for loops offer greater protection against accidental infinite looping.

A tight while loop is bad

If a CPU finds itself in the situation where it is in a tight while loop things can be worse than one might ordinarily expect. That’s because it might never find itself calling an operating system function – which would give the operating system a chance to consider whether anything else needs to be done (outside the application) – before returning control to the program.

Instead the CPU is executing instructions furiously over and over again – until a scheduled timer interrupt forces the CPU to visit the operating system and something else is given a chance to run.

So things are not utterly dire – other code does get a chance to run on the processor. But it’s not well behaved and polite like most programs are. Still, this kind of impolite hogging of the CPU does have consequences and is worth being aware of.

Back to the story

My suspicion was confirmed whereupon I found a while loop that had a rogue continue statement without any accompanying code to update the condition variable.

Quite simply I started grepping for the string continue and closely reviewed any while loop that contained the keyword. And that is how I found the fault.

Indeed it had been a bug fix. And was only triggered occasionally because it added an exception to the normal operation. In testing this exception had not been encountered.

But in production the exception was being triggered on occasion by real customers – and when it did their web session froze (never responded) and the process locked up at 100% CPU utilisation each time.

It was fortunate this issue was identified, and fixed, before whole servers became overwhelmed with CPU load.

What is risky about while loops?

A tight while loop is bad

Back to the story

Leave a Reply Cancel reply