Risk Analysis of the Pentium Bug

Vaughan Pratt Computer Science Department Stanford University

Summary

In his Newsweek article (Newsweek 9/2/96 p.60), Andy Grove neglects to mention IBM's stated reason for halting Pentium shipments, namely IBM's risk analysis showing that typical financial data is far more vulnerable to the bug than the purely random data studied by Intel. My own study (http://boole.stanford.edu/bug1), predating IBM's, found that the Pentium miscalculates once every two hundred divisions, not nine billion, when dividing small integers ``bruised'' by rounding error such as 4.999999/14.999999, for which the Pentium gets 0.333329 instead of 0.33333329. Intel has declined to debate publicly or even acknowledge any risk analysis but its own. Intel needs to analyze the risks in the inevitable flaws in its future products more accurately and openly than it has done with the Pentium.

This web page describes how the Pentium divides incorrectly, shows how small bruised integers cause an extraordinarily high error rate, recounts some of the history of the bug, criticizes Intel's irresponsible head-in-the-sand attitude towards the bug's unacceptably bad statistical properties, and points the interested reader at more detailed reading.

How the Pentium Divides Incorrectly

The Pentium (of the buggy kind generating all the fuss, which quickly vanished from the market at the beginning of 1995) computes the quotient x/y using long division of a kind similar to that taught in school. Although the Pentium performs its division-related calculations in radix four, meaning that it leaves out the six digits from 4 to 9, this and all other differences from ordinary high school long division are immaterial to a basic understanding of the bug, and the story will be just as clear if you picture the Pentium performing division with ordinary decimal numbers.

First let us introduce the cast of characters. We are given a dividend x and a nonzero divisor y and are asked to produce a quotient q = x/y such that qy approximates x to high accuracy, equivalent to about 20 decimal digits in the case of the Pentium.

Long division produces the quotient q one digit at a time starting with the high order or leftmost digits and working steadily down to the low order or rightmost end. Each quotient digit d is selected on the basis of the first couple of digits of x and y, which the Pentium accomplishes by consulting a table of quotient digits. The digit d found in the table is multiplied by the divisor y, and the product dy, suitably aligned, is then subtracted from what remains of the dividend x, called the partial remainder. Each dy that is subtracted is aligned one further position to the right than at the previous such subtraction.

Crucial to the success of the method is that the partial remainder shrink at a rate that keeps pace with the steadily shifting dy.

The error committed by the Pentium is that five table entries were inadvertently programmed at the factory to contain zero instead of two. If any one of these entries is used, the quotient digit that is output is 0 instead of 2, making the quotient wrong by that much.

This is all you need to know about the bug in order to move on to the next section, which treats error rates. Nevertheless there is an insteresting story to be told about just what the Pentium does for the rest of the division after it gets that quotient digit wrong, which you can read or skip as you wish.

The Pentium, not realizing that the quotient is now toast, presses on. Zero is subtracted from the partial remainder instead of 2y, which therefore fails to shrink properly. At the next step dy has shifted further to the right and the partial remainder, now bloated relative to the shifted dy, is too large for the further subtractions of the shifting dy's to shrink it appreciably.

Not expecting anything so large to be left over, the Pentium reclaims the storage used by the digit to the left of the subtracted dy on the ground that it must by now be zero. This erases the unexpectedly nonzero digit at the left of the partial remainder, which the Pentium thereafter loses track of. (To be more precise, the whole partial remainder is shifted one place left and the digit falls off the end.) The Pentium then continues on as though nothing had happened, oblivious to this lost digit.

At the end of the division we have a quotient whose first erroneous digit is 0 when it should have been 2. However the error throws the rest of the process off track, messing up the subsequent quotient digits in a complicated way. This complexity notwithstanding, the resulting error can be described more simply than one might expect. If at the end of the division you multiply the final quotient q by y expecting to recover the original dividend x, you obtain instead a quantity that is less than x by exactly the partial remainder digit that was lost, which as it turns out is either 1 in the position where the error happened or, in one case where the erasure is delayed by one step, 3 in the next radix-four digit to the right. This exactly characterizes the error.

Size and Frequency of the Errors

The arrhythmia produced by the error is equally likely to strike at any time in the process after the point where dy and the partial remainder have shrunk by a factor of about 16,000 (corresponding to seven radix-four digits). Hence the error is just as likely to strike a quotient digit near the beginning as near the end, subject only to the restriction that the largest error possible is one part in 16,000. This maximal error is observed for the quotient 4195835/3145727 found in November 1994 by Vitesse Semiconductor's floating point designer Tim Coe, which the Pentium computes as 1.333739 instead of 1.33382045.

Now it so happens that the five zero table entries are very rarely encountered when dividing completely random numbers, which is why testing failed to bring them to light. Thus one could reasonably hope that their infrequent nature would render the problem nearly harmless. If the few erroneous combinations were all as random as Coe's example would seem to indicate, the severity of the bug would be just one error every nine billion divisions, which would be too small a rate to bother anyone in practice except those perversely computing 4195835/3145727 over and over to make the obvious but meaningless point that higher rates can happen.

Unfortunately it is not the case that an "average spreadsheet user" will only encounter random data. The true situation is usually quite different. Typical data exhibits patterns characteristic of both its origins and how it is further operated on by the program consuming it. These patterns are often highly nonrandom.

As a simple example, every accountant whether working for Bank of America or a mom-and-pop store generates data that in a perfect world would be an integer number of pennies. This makes the fraction part of that data highly nonrandom, in fact zero.

As luck would have it, the Pentium performs flawlessly on integers provided the total number of decimal digits in both operands together is at most 12. (Coe's example, a quotient of two 7-digit integers, requires 14 digits.) So this kind of nonrandomness would seem to render the Pentium bug harmless for divisions of this nature.

Now one might imagine that typical rounding errors that convert dollar amounts like 14.57 to 14.56999999 would randomize at least the lowest digits of the fractional part, thereby slightly improving the statistical fit to the Intel model of purely random data. However as we shall see, these little rounding errors actually make the likelihood of encountering the bug worse rather than better. And not just by 20%, or even by the risk factor of 17 on which the administration's present antitobacco drive is based, but by a factor of more than a million over the estimate based on the assumption of purely random data!

Repeated subtraction of dy from the partial remainder with varying d and at varying positions tends to randomize the partial remainder, and we therefore have no simple characterization of those dividends x that are likely to trigger the bug. The divisor y however remains unchanged throughout the process, and the error occurs only with certain divisors that are very scarce. Their extreme scarcity would save the day were it not for the fact that they are of a very special form.

Had these error-prone divisors been distributed at random, we would expect all reasonable computational processes to encounter the bug once every nine billion divisions. Unfortunately the bad divisors instead cluster exceedingly tightly just below each of the five integer multiples of 3 between 16 and 32, namely 18, 21, 24, 27, and 30, and all numbers obtainable from these by multiplying or dividing by powers of two (1, 2, 4, 8, 16, etc.), such as 27*8 = 216 and 21/4 = 5.25. (Incidentally, the erased quotient digit, which we saw much earlier is either 3 or 4, is 3 in just one of the five cases, namely when the divisor is 18.)

The effect of this clustering is felt with extraordinary strength for numbers of a kind commonly encountered in computation, which I have for want (remarkably) of a prior term called "bruised" integers. These are numbers such as 4.999999 which you will have often seen on ordinary calculators, and which are as often encountered on powerful computers as on tiny ones.

While you are highly unlikely to ever encounter the division problem 4195835/3145727, the problem 5/15 involves simple numbers that are encountered in practice far more commonly than purely random numbers. When the operands of this example are bruised so as to turn the problem into 4.999999/14.999999, the $500 Pentium calculates this as 0.333329 instead of the 0.33333329 that any $5 calculator will give you.

Here now is an astounding statistic. If one chooses at random any two integers from one to 100, subtracts 0.000001 (one millionth) from each, and divides the two results one by the other, the probability of encountering an error is not one in nine billion as the random model predicts but one in two hundred.

The exact amount of bruising is not critical, but it should be between one in 10,000 and one in a hundred million in order for the effect to be felt most strongly. With larger bruising the divisor leaves the critical region and errors cease altogether. With smaller but still nonzero bruising, while the probability of an error remains unchanged its size decreases. A rough rule of thumb for very slight bruising is that the quotient error, when divided by the correct quotient, called the relative error, is proportional to the amount of bruising. With no bruising at all however there are no errors.

As examples of errors of this kind where the relative quotient error is close to being as bad it ever gets, the Pentium computes the quotients in the following list incorrectly "in the fifth decimal digit," more precisely, with a relative error of one part in 100,000, when 0.00001 is subtracted from each numerator and denominator.

18/27, 20/30, 33/144, 36/108, 40/120, 44/192, 72/432, 72/864, 80/480, 80/960.

For example the Pentium evaluates 17.99999/26.99999 as 0.6666575 when it should have obtained 0.66666654,

A fair question is, how long does a computation have to run before errors in the fifth position show up in more damaging positions such as the third or even second?

The simple answer is, the error can propagate to the first position in one step. Moreever that one step uses only arithmetic that you can do in your head, as shown by the following variant of my 4.999999/14.999999 example (for which I am indebted to Norman Hirsch).

Which is larger, 4.999999/15 or 4.999999/14.999999? Obviously the latter, since it has the smaller denominator. But the Pentium computes the former correctly, namely as .33333327, whereas it computes the latter as 0.333329, which is smaller when it should have been larger. Hence in computing the truth of 4.999999/15 < 4.999999/14.999999, the Pentium makes an error in the "leading bit" of the answer!

This example also demonstrates failure of "monotonicity" of division. If you plot x/y for very slowly decreasing y, the quotient very slowly increases except when a bug is encountered, where it takes a sharp momentary dive back down before continuing slowly upwards again. If a buggy Pentium were used to control an elevator, one could with the right scale of parameters encounter occasional mysterious jolts in an otherwise smooth ride.

A Short History of the Bug

The public's attention was drawn to the bug on October 30 by Thomas Nicely, a mathematics professor at Lynchburg College, who sent an email message to Windows author Andrew Schulman asking whether anyone had experienced errors with floating point division on the Pentium. Schulman contacted Richard Smith, and on November 1 Smith posted Nicely's message to Compuserve's Canopus forum. EE Times reporter Alex Wolfe picked up on this, researched the bug, and posted his account to Norwegian programmer Terje Mathisen, who performed his own investigation on a Pentium he had access to, and posted his results to comp.sys.intel on November 3 followed by a series of followup postings. Meanwhile Wolfe's story appeared in EE Times on November 7.

Intel confirmed for Wolfe that there was indeed a bug. Intel later stated that they had known about the bug since summer, had studied its severity, and concluded that it would never be encountered in ordinary use and therefore did not require notifying users of its existence. In particular Intel scientists had calculated that the error would only occur once every nine billion dvisions, and that therefore an "average spreadsheet user would run into the problem only once every 27,000 years of spreadsheet use." [Andy Grove, Newsweek 9/2/96, p.61.]

On November 15 Tim Coe posted a reverse engineering of the bug to the Internet.

On November 27, in an unprecedented move by a president of a major company, Andy Grove posted a very conciliatory "Internet Message to Scientific Users".

Reading Coe's account two weeks later, I saw from its details that the bug would occur with very high probability on small bruised integers, and posted my findings about bruised integers to the Internet newsgroups comp.sys.intel and comp.arch on December 3, with a series of followup articles during the next six weeks.

On December 12 IBM announced that it was halting further sales of all IBM computers based on the Pentium, citing IBM's Study of the bug. I subsequently learned from the IBM team investigating the bug at the T.J. Watson Research Center in Yorktown Heights NY that their study was the main basis for IBM's decision to withdraw Pentium-based computers from sale.

Intel's immediate response to this was its press release cited in the previous section dismissing IBM's analysis as merely "standing under a meteor."

Intel's Faulty Risk Analysis

One would expect that the difference between Intel's estimate of 27,000 years between occurrences of the bug and IBM's figure of 12 days would force Intel to address this enormous discrepancy between the respective risk analyses by at least acknowledging that there was a disagreement, and preferably debating the question in public so that people's minds could be set at rest.

Intel had the following to say about IBM's December 12 announcement that it would stop shipping Pentium computers.

SANTA CLARA, CA, December 12, 1994 -- In response to an IBM press release, Intel reiterated that it has studied the Pentium(tm) processor flaw for months and has concluded that the frequency of encountering reduced precision in floating point divide operations is once in every nine billion random divide operations. Intel said it regards IBM's decision to halt shipments of its Pentium processor-based systems as unwarranted.

"Based upon the work of our scientists analyzing real world applications, and the experience of millions of users of Pentium processor-based systems, we have no evidence of increased probability of encountering the flaw," said Andrew S. Grove, president and chief executive officer.

"You can always contrive situations that force this error. In other words, if you know where a meteor will land, you can go there and get hit," Grove said.

Intel does not agree with the conclusions reached by IBM, but reiterates nonetheless that any customer who might encounter the problem with the Pentium processor in the course of their applications will be sent a replacement at any time during the life of their PC.

Intel, the world's largest chip maker, is also a leading manufacturer of personal computer networking and communications products.

Neither this press release nor Grove's Newsweek article cited earlier contains the slightest acknowledgement that the bug might in fact have undesirable statistical properties not noticed in Intel's initial analysis. Instead Intel has taken the position, both then and now, that the only relevant story about the statistics of the bug and its implications for risk analysis are exactly those originally determined by Intel, and that all other analyses are as irrelevant as the analysis that achieves a high rate by repeatedly computing 4195835/3145727, "standing under a meteor" as Grove puts it.

Although Intel has acknowledged the furor created by the Pentium bug, it has attributed that furor not to any problem with its original risk analysis but to other causes such as Intel's growing public visibility, new rules, and forces created by the emerging Internet. Grove makes clear in his Newsweek article the complete faith Intel placed, then and now, in the soundness of its own research staff's risk analysis, to the exclusion of all other assessments. Intel has really stuck its heels into the ground on this issue of risk analysis, in stark contrast to the immense and expensive amount of ground they have given on just about every other front in connection with the bug.

Despite those features of the bug that so enormously increase the likelihood of encountering it over Intel's estimate, that likelihood for many users would in practice still be too low for concern. That probability however does not excuse Intel's refusal to even consider the risk analyses performed by others. That the bug has the potentially damaging statistical characteristics described above is disputed by no one outside Intel. That Intel refuses to acknowledge even that much is the height of corporate irresponsibility.

The unintended result of Intel's hiding its head in the sand in this way seems to have been to stampede Intel's customers into demanding replacement processors. By refusing to acknowledge the statistical properties of the bug, it became impossible for Intel to provide its customers with meaningful advice on the practical significance of those properties. This left concerned customers with litle alternative but to avail themselves of Intel's unprecedented and exceedingly generous offer of a replacement chip. This offer cost Intel $475 million as a onetime writeoff, to say nothing of the ongoing expense of major overhauls to how Intel does business with its customers.

Had Intel engaged in constructive dialog about the statistical properties of the bug, a consistent story might have emerged. IBM, academics, Intel's competition, and Intel's customers could then have arrived at some sort of consensus, no doubt with some residual points of disagreement. IBM would probably have maintained its stand on not selling buggy Pentiums since any consensus that took into adequate account the anomalous statistics of the bug would have left IBM's rationale for withdrawal largely unaltered. However it is a distinct possibility that many customers who availed themselves of Intel's offer would have found this consensus sufficiently reassuring in their situations that they would have been satisfied with the status quo and not insisted as strenuously on a replacement CPU. Industry norms are such that a customer buys a new computer "as is," bugs and all, and does not expect a replacement computer every time a new bug is found, except in cases that are typically much more extreme than with the Pentium bug.

Instead, Intel's fortress mentality concerning the risk analysis of the bug unintentionally sowed the sort of fear, doubt, and uncertainty that IBM is legendarily accused of doing intentionally. This self-imposed isolation from the mainstream of discussion about the bug forced Intel to find other far more expensive ways to placate its many unhappy customers, including exceeding industry norms with their generous but expensive replacement program.

This inappropriate behavior on Intel's part raises the question of Intel's competence to perform adequate risk analysis of the bugs that will inevitably appear in future products, to which no manufacturer is immune. Although there is no such thing as a bug-free computer, there is such a thing as responsible evaluation of the severity of bugs taking all factors and opinions into unbiased account. Will Intel in future be more open-minded and realistic about the analysis of future bugs in its products than it was with the Pentium, or will it continue to err on the side of wild optimism and heed only its own risk analyses, even when they are so dramatically contradicted by the analyses of other technically qualified parties?

If Andy Grove's Newsweek article is any indication, not only can Intel not analyze risk reliably, it is not even aware of risk analysis as an issue in the Pentium affair.

Summary

How the Pentium Divides Incorrectly

Size and Frequency of the Errors

A Short History of the Bug

Intel's Faulty Risk Analysis

Other Reading