Original Announcement Concerning Small Bruised Integers

Vaughan Pratt
Stanford University

The following is my original announcement, posted to Internet newsgroups comp.arch and comp.sys.intel on December 3, 1994, concerning an extraordinarily high error rate by the Pentium floating point division hardware when dividing small bruised integers. Several days later IBM announced that they too had observed this remarkable property of the Pentium division bug; shortly thereafter IBM published a study of their findings and simultaneously announced on the basis of those findings that they were halting further shipment of their Pentium-based products.

It should be noted that this message was in response to a message by Joe Buck asking why all the fuss and defending Intel's position. In response to my post, Buck had the following to say. ``From Vaughan Pratt's analysis, it appears that the Pentium bug is far more serious than Prof. Nicely's analysis, or Intel's remarks, would indicate, because the quotients that cause the bugs are far more likely to occur than a random distribution would indicate. ... I withdraw my previous remarks that the fuss has been overblown.''

From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: A natural scenario with high FDIV bug probability (was: In Intel's Defense...)
Date: 3 Dec 1994 15:20:17 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 194
Message-ID: <3bq2bh$ohc@Radon.Stanford.EDU>
References: <3bdieq$lqf@engnews2.Eng.Sun.COM> <3bf666$6k4@hpsystem1.informatik.tu-muenchen.de>  <3bnrgo$4gf@hermes.synopsys.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15209 comp.sys.intel:20065

In this message I give a scenario in which FDIV bugs are encountered once every three milliseconds or so, in contrast to Intel's default scenario in which they are encountered every 27.000 years or so. Furthermore none of the encountered bugs involve obscure numbers: all of them take the form of small "bruised" rationals like 5/15, which when bruised as 4.999999/14.999999 yields 0.333329 on the Pentium and 0.33333329 elsewhere. Moreover quite a few of the errors encountered are very large, as with this example. We leave the plausibility of our scenario for the reader to judge; my intent is that it come across as something that could well happen.

By way of motivation: In article <3bnrgo$4gf@hermes.synopsys.com>, Joe Buck wrote:

>I am really quite amazed at all of this fuss.  In one case out of nine
>billion, you get what in essence is a single-precision division instead of
>a double-precision division.  The vast majority of Pentium users never do
>anything that requires even full single-precision: they run spreadsheets,
>write documents, and play games.  Why should Intel pay about 1 billion
>(that's nine zeros) dollars to get all these lusers a new chip?

One in nine billion, NOT. It is a common fallacy that real numbers arising in a user's program are uniformly distributed. Their actual distribution depends critically on both where the program gets its data and what it does with it, as with the scenario described below.

Correct to within single-precision, NOT. Unless of course you had in mind a 16-bit word length; other examples of ~16-bit errors have already been given, but the form of the ones we give here such as the above-cited 4.999999/14.999999 makes them particularly dramatic.

In this message I give a simple and plausible scenario, not involving number theory, cryptography, differential equations, or matrix inversions, but simply divisions by integers of one to three digits, represented as reals, that have been slightly "bruised" by a process all of us have seen many times. In this scenario the FDIV bug reveals itself, not once every 27,000 years, but rather once every three milliseconds or so, the rate at which my Pentium was encountering them.

The critical postulate here is that the integers encountered in this scenario have been "bruised" very slightly due to other processing and then truncated, for whatever reason (e.g. the numbers might have been obtained automatically from a decimal calculator), to some preordained number of decimal digits, the *precision*. All integers are subjected uniformly to this treatment, for a fixed precision. Thus if the precision is say 6, and we are dividing say 7 by 18, then 7 is actually 6.999999 and 18 is 17.999999, that is, the same quantity, here 10^-6, is subtracted from both operands.

There are a million pairs of integers i,j with 1 <= i,j <= 1000. The following table shows, as a function of the decimal precision going down, and the *tolerance* (how far off a division must be to count as a wrong answer) going across, how many of the one million possible divisions i/j are wrong. For these quotients the IEEE-correct answers can be expected to have a relative error of better than 10^-17. The table below defines "wrong" as having a relative error of at least 10^-15 for the first column, at least 10^-13 for the second, etc. (That is, going across the table we become less fussy and hence recognize fewer quotients as wrong.)

        Tolerance (10^-): 15    13    11     9     7     5     4
  --------------------------------------------------------------
  Decimal precision           Number wrong, out of a million
  --------------------------------------------------------------
	  15               4     4     0     0     0     0     0
	  14             101   101    29     0     0     0     0
	  13             871   864   252     0     0     0     0
	  12             876   867   573     0     0     0     0
	  11             871   868   822    52     0     0     0
	  10             875   871   852   211     0     0     0
	   9             886   885   875   527     6     0     0
	   8             875   875   873   762    69     0     0
	   7             833   833   833   818   258     0     0
	   6             627   627   627   623   426    14     0
	   5             233   233   233   232   205    10     0
	   4              43    43    43    42    41     2     0
	   3               0     0     0     0     0     0     0

With 6 digits of precision the truncation error is one millionth, an easily remembered quantity. For this case, the reader may wish to memorize two or three of the worst offenders, if only for production at cocktail parties.

My favorite is 5/15, that is, 4.999999/14.999999, appraised by the Pentium at 0.33332922 when other appraisers would insist on 0.33333329 as a fairer answer.

Another bad case is 7/48 (6.999999/47.999999), for which the Pentium guesses 0.14583204 when a more inspired guess would be 0.14583332.

One more candidate: 9/54 (8.999999/53.999999), where the Pentium gambles on 0.16666439 and loses petulantly to 0.16666665.

Let me emphasize the three essential features of examples such as these. First, the basic arithmetic is with quite small integers (albeit slightly "bruised"), greatly increasing the likelihood that you will encounter these exact errors, whether or not you ever notice them. Second, the bruising can be by any simple amount, like a millionth or a ten-millionth, it does not have to be an obscure quantity like 0.000000142883, greatly increasing the likelihood that the kind of bruising arising in your situation will be bad for you. Third, the error can be applied uniformly to both operands of FDIV, one does not need to tweak the errors of the operands separately, giving you an even better chance of encountering one of these errors.

An odd feature of these three examples that for want of time I have not pursued and that may or may not have a uniform explanation is that in every case the error can be described approximately by saying that the Pentium omitted two of the repeating digits, underlined with ^^ in the above. This yields a simple heuristic for emulating a Pentium on these three examples: calculate 4.999999/14.999999 or whatever on your pocket calculator, then delete two of the repeating digits; the result is good to one digit beyond the repeating string (actually two in the 5/15 case).

It should be noted that the relative error in these three examples is considerably greater than the precision.

Only 26 pairs survive at tolerance 10^-5. Here for the record are the 14 wrong'uns for precision 6: 5/15, 5/30, 5/60, 5/120, 9/54, 9/108, 9/216, 9/432, 9/864, 10/60, 10/120, 10/240, 10/480, and 10/960. (7/48 is off by only 0.9*10^-5, just barely missing out on tolerance 10^-5.) For precision 5 the 10 bad guys are 18/27, 20/30, 33/144, 36/108, 40/120, 44/192, 72/432, 72/864, 80/480, and 80/960. And for precision 4 the two divisions are 82/96 and 120/288; specifically, 81.9999/95.9999 yields 0.854156 when we expected 0.8541665, while 119.9999/287.9999 yields 0.416656 and we wanted 0.4166665. (Hmm, the delete-two-repeats rule works here too; very interesting...)

Although we have been assuming radix ten for our notion of precision, it is not special, and any other radix should yield similar results, albeit with some other assortment of small rational erring divisions. In particular if we had evolved thirteen fingers instead of ten, the corresponding table going from 11 digits down to 3 digits at tolerance 10^-9 would have been 0, 31, 211, 547, 802, 784, 417, 109, 0.

While this scenario puts the FDIV bug in what looks at least to me like a pretty bad light, I do not claim it is anything like the worst case scenario. The two essential factors for any scenario are the rate at which that scenario triggers the FDIV bug, and how often the essence of that scenario arises in the real world. The "damage index" of a scenario is the product of those two factors. A scenario that accounts for thousands of hours of actual Pentium time but that only triggers the bug say every ten minutes may well have a higher damage index than my scenario above, whose strength is that it triggers the bug every few milliseconds but whose weakness is the uncertainty as to how likely it really is in practice. You must admit however that it is not a completely implausible scenario.

-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o

I append here the code that prints the above table. For those who think if not harmonize in C, this is where to look for the exact meanings of "precision" and "tolerance".

#include 
#include 

main(argc, argv)
	char *argv[];
{
	int i, j, cnt, digits;
	double prec, tol, ibruis, jbruis, pent, true;
	for (digits = 15, prec = 1e-15; digits > 2; digits--, prec *= 10) {
		printf("%4.d\t\t", digits);
		for (tol = 1e-15; tol < 1e-4; tol *= 100) {
			cnt = 0;
			for (i = 1; i <= 1000; i++)
				for (j = 1; j <= 1000; j++) {
					ibruis = i - prec, jbruis = j - prec;
					pent = ibruis/jbruis;
					true = (1.11*ibruis)/(1.11*jbruis);
					cnt += (fabs(pent - true) > tol*true);
				}
			printf("%4d  ", cnt);
		}
		printf("\n");
	}
}

My heuristic for getting the true quotient, namely true = (1.11*ibruis)/(1.11*jbruis), is the quick and dirty one of randomizing the operands just enough to move the probabilities of hitting the FDIV bug at least somewhat closer to Intel's estimate of once in 27,000 years. When I replaced 1.11 by 1.01 the results were identical, which along with the Intel estimate seemed good enough for a program not intended for building planes.

Vaughan Pratt