=================================================================
-------------------------PENTIUM REPORT # bug1----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: A natural scenario with high FDIV bug probability (was: In Intel's Defense...)
Date: 3 Dec 1994 15:20:17 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 194
Message-ID: <3bq2bh$ohc@Radon.Stanford.EDU>
References: <3bdieq$lqf@engnews2.Eng.Sun.COM> <3bf666$6k4@hpsystem1.informatik.tu-muenchen.de> <BURLEY.94Nov29090419@churchy.gnu.ai.mit.edu> <3bnrgo$4gf@hermes.synopsys.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15209 comp.sys.intel:20065

In this message I give a scenario in which FDIV bugs are encountered
once every three milliseconds or so, in contrast to Intel's default
scenario in which they are encountered every 27.000 years or so.
Furthermore none of the encountered bugs involve obscure numbers: all
of them take the form of small "bruised" rationals like 5/15, which
when bruised as 4.999999/14.999999 yields 0.333329 on the Pentium and
0.33333329 elsewhere.  Moreover quite a few of the errors encountered
are very large, as with this example.  We leave the plausibility of our
scenario for the reader to judge; my intent is that it come across as
something that could well happen.

By way of motivation:

In article <3bnrgo$4gf@hermes.synopsys.com>,
Joe Buck <jbuck@synopsys.com> wrote:
>I am really quite amazed at all of this fuss.  In one case out of nine
>billion, you get what in essence is a single-precision division instead of
>a double-precision division.  The vast majority of Pentium users never do
>anything that requires even full single-precision: they run spreadsheets,
>write documents, and play games.  Why should Intel pay about 1 billion
>(that's nine zeros) dollars to get all these lusers a new chip?

One in nine billion, NOT.  It is a common fallacy that real numbers
arising in a user's program are uniformly distributed.  Their actual
distribution depends critically on both where the program gets its data
and what it does with it, as with the scenario described below.

Correct to within single-precision, NOT.  Unless of course you had in
mind a 16-bit word length; other examples of ~16-bit errors have
already been given, but the form of the ones we give here such as the
above-cited 4.999999/14.999999 makes them particularly dramatic.

In this message I give a simple and plausible scenario, not involving
number theory, cryptography, differential equations, or matrix
inversions, but simply divisions by integers of one to three digits,
represented as reals, that have been slightly "bruised" by a process
all of us have seen many times.  In this scenario the FDIV bug reveals
itself, not once every 27,000 years, but rather once every three
milliseconds or so, the rate at which my Pentium was encountering
them.

The critical postulate here is that the integers encountered in this
scenario have been "bruised" very slightly due to other processing and
then truncated, for whatever reason (e.g. the numbers might have been
obtained automatically from a decimal calculator), to some preordained
number of decimal digits, the *precision*.  All integers are subjected
uniformly to this treatment, for a fixed precision.  Thus if the
precision is say 6, and we are dividing say 7 by 18, then 7 is actually
6.999999 and 18 is 17.999999, that is, the same quantity, here 10^-6,
is subtracted from both operands.

There are a million pairs of integers i,j with 1 <= i,j <= 1000.  The
following table shows, as a function of the decimal precision going
down, and the *tolerance* (how far off a division must be to count as a
wrong answer) going across, how many of the one million possible
divisions i/j are wrong.  For these quotients the IEEE-correct answers
can be expected to have a relative error of better than 10^-17.  The
table below defines "wrong" as having a relative error of at least
10^-15 for the first column, at least 10^-13 for the second, etc.
(That is, going across the table we become less fussy and hence
recognize fewer quotients as wrong.)

        Tolerance (10^-): 15    13    11     9     7     5     4
  --------------------------------------------------------------
  Decimal precision           Number wrong, out of a million
  --------------------------------------------------------------
	  15               4     4     0     0     0     0     0
	  14             101   101    29     0     0     0     0
	  13             871   864   252     0     0     0     0
	  12             876   867   573     0     0     0     0
	  11             871   868   822    52     0     0     0
	  10             875   871   852   211     0     0     0
	   9             886   885   875   527     6     0     0
	   8             875   875   873   762    69     0     0
	   7             833   833   833   818   258     0     0
	   6             627   627   627   623   426    14     0
	   5             233   233   233   232   205    10     0
	   4              43    43    43    42    41     2     0
	   3               0     0     0     0     0     0     0

With 6 digits of precision the truncation error is one millionth, an
easily remembered quantity.  For this case, the reader may wish to
memorize two or three of the worst offenders, if only for production at
cocktail parties.

My favorite is 5/15, that is, 4.999999/14.999999, appraised by the Pentium at
0.33332922 when other appraisers would insist on
0.33333329 as a fairer answer.
      ^^
Another bad case is 7/48 (6.999999/47.999999), for which the Pentium guesses
0.14583204 when a more inspired guess would be
0.14583332.
       ^^
One more candidate: 9/54 (8.999999/53.999999), where the Pentium gambles on
0.16666439 and loses petulantly to
0.16666665.
       ^^

Let me emphasize the three essential features of examples such as
these.  First, the basic arithmetic is with quite small integers
(albeit slightly "bruised"), greatly increasing the likelihood that you
will encounter these exact errors, whether or not you ever notice
them.  Second, the bruising can be by any simple amount, like a
millionth or a ten-millionth, it does not have to be an obscure
quantity like 0.000000142883, greatly increasing the likelihood that
the kind of bruising arising in your situation will be bad for you.
Third, the error can be applied uniformly to both operands of FDIV, one
does not need to tweak the errors of the operands separately, giving
you an even better chance of encountering one of these errors.

An odd feature of these three examples that for want of time I have not
pursued and that may or may not have a uniform explanation is that in
every case the error can be described approximately by saying that the
Pentium omitted two of the repeating digits, underlined with ^^ in the
above.  This yields a simple heuristic for emulating a Pentium on these
three examples: calculate 4.999999/14.999999 or whatever on your pocket
calculator, then delete two of the repeating digits; the result is good
to one digit beyond the repeating string (actually two in the 5/15
case).

It should be noted that the relative error in these three examples is
considerably greater than the precision.

Only 26 pairs survive at tolerance 10^-5.  Here for the record are the
14 wrong'uns for precision 6:  5/15, 5/30, 5/60, 5/120, 9/54, 9/108,
9/216, 9/432, 9/864, 10/60, 10/120, 10/240, 10/480, and 10/960.  (7/48
is off by only 0.9*10^-5, just barely missing out on tolerance 10^-5.)
For precision 5 the 10 bad guys are 18/27, 20/30, 33/144, 36/108,
40/120, 44/192, 72/432, 72/864, 80/480, and 80/960.  And for precision
4 the two divisions are 82/96 and 120/288; specifically,
81.9999/95.9999 yields 0.854156 when we expected 0.8541665, while
119.9999/287.9999 yields 0.416656 and we wanted 0.4166665.  (Hmm, the
delete-two-repeats rule works here too; very interesting...)

Although we have been assuming radix ten for our notion of precision,
it is not special, and any other radix should yield similar results,
albeit with some other assortment of small rational erring divisions.
In particular if we had evolved thirteen fingers instead of ten, the
corresponding table going from 11 digits down to 3 digits at tolerance
10^-9 would have been 0, 31, 211, 547, 802, 784, 417, 109, 0.

While this scenario puts the FDIV bug in what looks at least to me like
a pretty bad light, I do not claim it is anything like the worst case
scenario.  The two essential factors for any scenario are the rate at
which that scenario triggers the FDIV bug, and how often the essence of
that scenario arises in the real world.  The "damage index" of a
scenario is the product of those two factors.  A scenario that accounts
for thousands of hours of actual Pentium time but that only triggers
the bug say every ten minutes may well have a higher damage index than
my scenario above, whose strength is that it triggers the bug every few
milliseconds but whose weakness is the uncertainty as to how likely it
really is in practice.  You must admit however that it is not a
completely implausible scenario.

-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o

I append here the code that prints the above table.  For those who
think if not harmonize in C, this is where to look for the exact
meanings of "precision" and "tolerance".

#include <stdlib.h>
#include <math.h>

main(argc, argv)
	char *argv[];
{
	int i, j, cnt, digits;
	double prec, tol, ibruis, jbruis, pent, true;
	for (digits = 15, prec = 1e-15; digits > 2; digits--, prec *= 10) {
		printf("%4.d\t\t", digits);
		for (tol = 1e-15; tol < 1e-4; tol *= 100) {
			cnt = 0;
			for (i = 1; i <= 1000; i++)
				for (j = 1; j <= 1000; j++) {
					ibruis = i - prec, jbruis = j - prec;
					pent = ibruis/jbruis;
					true = (1.11*ibruis)/(1.11*jbruis);
					cnt += (fabs(pent - true) > tol*true);
				}
			printf("%4d  ", cnt);
		}
		printf("\n");
	}
}

My heuristic for getting the true quotient, namely true =
(1.11*ibruis)/(1.11*jbruis), is the quick and dirty one of randomizing
the operands just enough to move the probabilities of hitting the FDIV
bug at least somewhat closer to Intel's estimate of once in 27,000
years.  When I replaced 1.11 by 1.01 the results were identical, which
along with the Intel estimate seemed good enough for a program not
intended for building planes.
-- 
Vaughan Pratt			http://boole.stanford.edu/boole.html

=================================================================
-------------------------PENTIUM REPORT # bug1.ans----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15289 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!agate!howland.reston.ans.net!swrinde!sgiblab!chronos.synopsys.com!news.synopsys.com!jbuck
From: jbuck@synopsys.com (Joe Buck)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: A natural scenario with high FDIV bug probability (was: In Intel's Defense...)
Date: 5 Dec 1994 18:45:07 GMT
Organization: Synopsys Inc., Mountain View, CA 94043-4033
Lines: 26
Message-ID: <3bvn3j$nbf@hermes.synopsys.com>
References: <3bdieq$lqf@engnews2.Eng.Sun.COM> <BURLEY.94Nov29090419@churchy.gnu.ai.mit.edu> <3bnrgo$4gf@hermes.synopsys.com> <3bq2bh$ohc@radon.stanford.edu>
NNTP-Posting-Host: deerslayer.synopsys.com
Xref: Radon.Stanford.EDU comp.arch:15289 comp.sys.intel:20638

pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes:
>In this message I give a scenario in which FDIV bugs are encountered
>once every three milliseconds or so, in contrast to Intel's default
>scenario in which they are encountered every 27.000 years or so.
>Furthermore none of the encountered bugs involve obscure numbers: all
>of them take the form of small "bruised" rationals like 5/15, which
>when bruised as 4.999999/14.999999 yields 0.333329 on the Pentium and
>0.33333329 elsewhere.  Moreover quite a few of the errors encountered
>are very large, as with this example.

From Vaughan Pratt's analysis, it appears that the Pentium bug is far
more serious than Prof. Nicely's analysis, or Intel's remarks, would
indicate, because the quotients that cause the bugs are far more likely to
occur than a random distribution would indicate.

I withdraw my previous remarks that the fuss has been overblown.  I was
basing them on Nicely's remarks rather than on Intel's, though both simply
computed the fraction of divisor pairs that yield wrong answers and both
underestimated the maximum possible error considerably.

I think that the likelihood of hitting these values is definitely going to
be application-dependent, but it certainly seems that these near-integer
values are going to be far more common than random.
-- 
-- Joe Buck 	<jbuck@synopsys.com>	(not speaking for Synopsys, Inc)
Phone: +1 415 694 1729


=================================================================
-------------------------PENTIUM REPORT # bug2----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: A natural scenario with high FDIV bug probability (was: In Intel's Defense...)
Summary: 
Expires: 
References: <3bdieq$lqf@engnews2.Eng.Sun.COM> <BURLEY.94Nov29090419@churchy.gnu.ai.mit.edu> <3bnrgo$4gf@hermes.synopsys.com> <3bq2bh$ohc@Radon.Stanford.EDU>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

In article <3bq2bh$ohc@Radon.Stanford.EDU>,
Vaughan R. Pratt <pratt@Sunburn.Stanford.EDU> wrote:
>In this message I give a scenario in which FDIV bugs are encountered

In that message I showed that, by dividing "bruised" rationals with
numerator and denominator at most 1000, the probability that a randomly
chosen such rational would generate the FDIV bug was just under one in
a thousand.  It later occurred to me to ask for the corresponding
proportion when the limit was set to 100.  It turns out to be just
under one in a hundred.  However as the limit is further decreased the
rate does not increase further but stays roughly constant.  When the
limit is changed from 20 to 10 it drops to exactly zero.

Here is the table for the 10,000 bruised divisions i/j, see my previous
message under this heading for the meanings of the entries.

			   Tolerance
		  15    13    11     9     7     5
Precision  ---------------------------------------
  15               4     4     0     0     0     0  
  14              86    86    24     0     0     0  
  13              78    78    52     0     0     0  
  12              78    78    76     0     0     0  
  11              78    78    77    17     0     0  
  10              78    78    78    52     0     0  
   9              80    80    80    73     2     0  
   8              78    78    78    78    23     0  
   7              71    71    71    71    51     0  
   6              54    54    54    54    51     5  <---
   5              10    10    10    10    10     2  
   4               1     1     1     1     1     1  
   3               0     0     0     0     0     0  

To illustrate, the 5/15 example is one of the 5 items the arrow is
pointing to, because the Pentium computes 4.999999/14.999999 (precision
6) with a relative error worse than the tolerance for that column of
10^-5.  It shows up in both this table and the previous table because
both the numerator 5 and the denominator 15 are below 100, and hence
below 1000.

At precision 6, for which the typical division considered for this
table is 43.999999/87.999999 (that one is among the innocent majority),
51 of the 10,000 divisions are off by worse than one part in ten
million, while 5 of them are off by worse than one part in a hundred
thousand.  The former statistic means that if you do a lot of divisions
involving very small (less than 100) bruised integers, you have more
than a half a percent chance *at each division* of getting an error
worse than that expected from using single precision arithmetic.  If
you were doing mainly divisions, this would mean that you would see the
FDIV bug every few hundred microseconds, a far cry from Intel's 27,000
years.  (With the limit on numerator and denominator set at 1000, my
previous message had reported FDIV bugs coming at a rate of one every
three milliseconds for randomly chosen small bruised rationals.)

This bug is maximally insidious: it is about as large as it could
possibly be without actually triggering warning bells when people
review their columns of data.  In this way tiny errors of one part in a
hundred thousand can over a long period of time sneak into the
trillions of calculations done around the world and there is no
practical way to spot them short of doing a massively detailed audit of
a kind that would be have to be tailored to the Pentium FDIV bug and
would be entirely unnecessary for a reliable floating point unit.
Intel may argue that errors of one in a hundred thousand are too small
for the average business to care about.  I would argue on the basis of
the statistics of my last two messages that the Pentium has the ability
to do to the information superhighway what the Loma Prieta earthquake
did to San Francisco's highways.  That many of us did not witness that
damage at first hand does not mean that it was not serious.
=================================================================
-------------------------PENTIUM REPORT # bug3----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 20292 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch
Subject: The Bad Scenario for FDIV--Simplified Account (was: Intel's own description)
Date: 4 Dec 1994 07:56:12 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 55
Message-ID: <3brsms$dae@Radon.Stanford.EDU>
References: <3bq8re$sck@nntp.Stanford.EDU> <D09Bv0.23G@utnetw.utoledo.edu>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:20292 comp.arch:15232

In article <D09Bv0.23G@utnetw.utoledo.edu>,
Tim Arheit AKA Fool Boy <tarheit@cse.utoledo.edu> wrote:
>The problem is that my software doesn't use random number as data.  One
>of the postes in this group shows that simple division of ~integer numbers
>shows a significant error (something on the order of 100 times less accurate)
>4.99999999/14.999999  (5/15)  (I think these are some of the example
>numbers given, it's in this group somewhere) generated significant error.
>The point is that integer constants in a floating point equation are far
>from random.

Let me correct a minor detail and at the same time repackage what I
said in two earlier messages in a hopefully more digestible and
memorizable form.

The exact example here subtracted one millionth (easy to remember) from
each of 5 and 15, i.e. 4.999999/14.999999.  The correct value is
0.33333329, Pentium math makes it 0.333329, wrong in the fifth decimal
place, specifically a relative error of 1.2 in 10^5.

Other such rationals are 7/48 and 9/54.  For each of these, if you
subtract one millionth from both numerator and denominator before
dividing, you again obtain a relative error of about one in a hundred
thousand.

Among rationals with numerator and denominator both bounded by 1000,
each decremented by a millionth, there are 14 that behave as badly as
the three examples above, and an additional 412 that the Pentium
evaluates with a relative error between 10^-5 and 10^-7.  (For
comparison, using single precision instead of double yields a maximum
relative error of 6*10^-8, so all these errors are worse than would
result merely from using single precision instead of double.)

Hence if you compute with randomly chosen approximate (in the above
sense) rationals from that population, the probability of an error in
the fifth place is one in 70,000, while the probability of an error in
the sixth or seventh place is one in 2,000.

The P90 can do better than two divisions per microsecond, at which
rate, assuming the above population of approximate rationals, the small
error (sixth or seventh place) occurs once a millisecond on average
while the large error (fifth place) occurs once every 35 milliseconds.

When the division operands are bounded by 100 instead of 1000, the
1/2,000 figure for the small error drops to 1/200 (a factor of 10
worse) while the 1/70,000 rate for the large error drops to 1/2,000 (a
factor of 35 worse), i.e.  the large errors now happen once a
millisecond.

Detailed tables of such errors, along with a program for verifying my
conclusions for yourself, can be found in my two earlier messages on
this subject on comp.arch,comp.sys.intel.  Search for the phrase
"ibruis" (a variable in my program) or the subject "A natural scenario
with high FDIV bug probability (was: In Intel's Defense...)".
-- 
Vaughan Pratt			http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug4----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.sys.ibm.pc.hardware.chips,comp.sys.ibm.pc.hardware.systems,comp.sys.ibm.pc.hardware.misc,comp.sys.intel,sci.engr,sci.engr.civil
Subject: Re: Intel Pentium Bug: Scary Business?
Summary: 
Expires: 
References: <3bfclk$q28@rc1.vub.ac.be> <3boqfn$crj@Radon.Stanford.EDU> <1994Dec04.044853.177619@zeus.aix.calpoly.edu> <hpa.39b10000.Heja.Sverige@ahab.eecs.nwu.edu>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

In article <hpa.39b10000.Heja.Sverige@ahab.eecs.nwu.edu>,
H. Peter Anvin <hpa@nwu.edu> wrote:
>Followup to:  <1994Dec04.044853.177619@zeus.aix.calpoly.edu>
>By author:    jcenter@tuba.aix.calpoly.edu (The Free Radical)
>In newsgroup: comp.sys.intel
>
>> Or is the case one of necessary precision.  I only use about 3 or 4 sig 
>> figs in nearly all of my calculations. If this is the case, can anyone 
>> (in any branch of engineering) describe what they are doing and why they 
>> need precision to 9 or so figures.
>
>You *use* 3-4 significant figures.  I would suggest taking a course in
>numerical methods which cover issues like stability, roundoff-error
>propagation etc, and you'll see why the number of significant figures
>matter.

The point was made earlier that lots of good science was done with
slide rules in the precalculator days.  If you only ever use the first
four digits of your *rational* calculations (those involving only the
four basic functions, +-*/) then the P90 will not let you down, in the
sense that there are no known 4th-place errors, and it is very unlikely
there are any.

There are *two* premises here, (i) the precision of the result and (ii)
how it is obtained.  Change *either* and you leave the safety of the
slide rule world and become vulnerable to the Pentium's FDIV bug.

(i) If you ever pay attention to the fifth digit of your answer then
you are at risk.  How much at risk?  Intel says you will see their bug
once every 27,000 years.  This is preposterous.  In recent posts on
comp.sys.intel,comp.arch I showed that divisions involving numbers of
the form a small integer (<1000) minus 10^-6 (or any other fixed
quantity of that general size) were at extraordinary risk.  For a
program doing only divisions of randomly chosen such numbers, 5TH PLACE
ERRORS OCCUR EVERY 50 MILLISECONDS at least.  If "small integer" means
<100 then that error rate for this sort of division jumps dramatically
by more than an order of magnitude.  (So if all Pentiums did this sort
of work that would total more than a billion 5th place errors a
second!)

Are reals that are very close to integers any more likely than other
reals?  Absolutely: (a) integers (reals with fractional part .000000)
occur as the result of a computation far more often than reals with any
other given fixed fractional part such as .708342, and (b) floating
point results of computations represented in floating point typically
include small errors.  These two factors conspire to make bruised
integers, those a tad less than an integer, far more likely than most
other reals.

(ii) If instead of individual rational calculations, i.e.
plus-minus-time-divide, you get your 4-digit numbers from more complex
calculations, you again become vulnerable.  This time it's because
those functions may be using FDIV during those complicated
calculations.  Even though you only consume 4 digits, the 3rd and 4th
places of your answer (and for sufficiently long or fragile types of
computation the second or even first place) can be easily affected by
5th place errors in intermediate results.  With yet longer or more
fragile computations the first four places start to depend on the 6th
and 7th places.

With the sorts of numbers considered above, a 6th or 7th place FDIV
error (and hence a 4th or even 1st place error later in the
computation) is ten times more likely than a 5th place error.  For
other values of both the error place and the amount by which these
small integers are decreased.  See the tables in my earlier posts for
the dependency of the error rate on these two parameters.

I should emphasize that these statistics are for quantities a hair
below a small integer.  The probability of encountering the FDIV bug on
the Pentium when dividing two randomly chosen floating point numbers is
considerably lower (but nowhere near as low as Intel tells us, see
below).  However floating point numbers arising in many kinds of
real-world computation are not randomly distributed, and some scenarios
can be expected to greatly increase the probability of dividing by
these Pentium-bad numbers.

Do not let anyone tell you such scenarios hardly ever happen.  Ask them
how they know this.

Intel's calculation of 27,000 years per error assumes that in all
computations, the numbers being fed to FDIV are uniformly distributed,
meaning that all floating point numbers are equally likely.  There are
few if any real-world situations where such complete randomness is in
fact the case.

On the other hand, as a method of estimating the rate of FDIV bugs,
this assumption should nevertheless yield a good estimate of the
expected rate in many situations, with two caveats.

NOT 27,000 years.  My tables show that 7th place errors occur at a rate
of 40 per million for small integers bruised by 10^-4.  Hence for
uniformly distributed reals the rate cannot be less than 10^-8 times
this, since 10^-4*10^-4 out of such pairs of reals will be within 10^-4
below an integer.  This comes to approximately one such error per
month, or about an error a second for Pentiums collectively.  27,000
years is irresponsibly wishful thinking.

And NOT in all situations.

A crucial question that Intel has ignored to date is, for how many
situations does this assumption of a uniform distribution grossly
underestimate the error rate?  I cannot answer this, BUT NEITHER CAN
INTEL.  Intel's public statements to date depend not only dubious
estimates for the uniform case but also on the proportion of nonuniform
cases being negligibly small.  Intel has given no reason for the
latter.  The greater likelihood of bruised integers over arbitrary
reals strongly suggests that the error rate in such situations is
several orders of magnitude greater than for uniform distributions.

Until the likely or even marginally likely occurrence of FDIV-hazardous
situations has been ruled out, Intel's public reassurances that the
Pentium is safe in practice are grossly irresponsible.  The responsible
thing for Intel to do is to state the full range of BOTH RATES AND
MAGNITUDES of FDIV errors that may occur in code performing some useful
function.

Ideally Intel would also give criteria for estimating approximately
where along that range various applications typically land, but there
is unfortunately little hard data today on which to base such an
estimate.  Hence the only data Intel can responsibly use at this time
in any public assessment of the Pentium FDIV bug is the FULL RANGE of
both the rates and magnitudes of errors possible in useful code.
Lacking further data, any less than that constitutes gross wishful
thinking and as such an abdication of public responsibility.
=================================================================
-------------------------PENTIUM REPORT # bug5----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 20369 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: How to find FDIV bug example
Date: 4 Dec 1994 20:14:31 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 38
Message-ID: <3bt7v7$5t5@Radon.Stanford.EDU>
References: <D05oE4.37pA@austin.ibm.com>
NNTP-Posting-Host: sunburn.stanford.edu

In article <D05oE4.37pA@austin.ibm.com>,
Sivasankar Chander <siva@bond.bocaraton.ibm.com> wrote:
>Remember, the more the "nice" numbers you have, the better the chances
>of getting a good jury award in a class-action suit.

For errors in the 5th place, here is a collection of "nice numbers"
I haven't posted before.
   [oops, just noticed this list buried in an earlier post, oh well. -vp]

18/27
20/30
33/144
36/108
40/120
44/192
72/432
72/864
80/480
80/960

Actually the Pentium computes all these divisions correctly, but if you
subtract .00001 from all of the above integers then the divisions when
performed on a Pentium all have relative errors of about .000012.  For
example the Pentium evaluates 17.99999/26.99999 as 0.6666575 when it
should have obtained 0.66666654,

If 18/27 is there, why isn't 18/54 on the list?  Because
17.99999/53.99999 is computed correctly on the Pentium, you have to go
to 17.999999/53.99998 to see an error.  Pretty nice, but not as nice as
when both operands end in a simple string of 9's, the case for all ten
examples above.

There are many more such "nice numbers" with varying degrees of error,
all worse than the error resulting from using single precision.  See my
previous posts especially my table of error rates as a function of
magnitude of error and how much is knocked off the numerator and
denominator.
-- 
Vaughan Pratt			http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug6----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.sys.intel
Subject: Whiplash potential of the Pentium (Was Re: FDIV probability issue)
Summary: 
Expires: 
References: <gat-011294102834@milo.jpl.nasa.gov> <D09vuz.57zo@austin.ibm.com> <1994Dec4.175212@west.cscwc.pima.edu> <D0BGLx.MIB@cwi.nl>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

In article <D0BGLx.MIB@cwi.nl>, Dik T. Winter <dik@cwi.nl> wrote:
>In article <1994Dec4.175212@west.cscwc.pima.edu> 103t_english@west.cscwc.pima.edu writes:
> > In article <D09vuz.57zo@austin.ibm.com>, siva@bond.bocaraton.ibm.com (Sivasankar Chander) writes:
> > >   Thanks, and I don't want to spoil your euphoria, but Tim Coe found all
> > 
> > You've missed a significant point here: the bugs are easy to find if you go
> > 
>You are missing something.  Tim Coe did not come up with a "bug-generator-
>algorithm"; he came up with a Pentium FDIV emulator that predicted many
>bugs.

I hope no one will think I'm trying to pour crude oil on this storm in
a refined teacup if I describe the path from Tim Coe's analysis to my
particular "bruised integer" bug-generator, which hopefully will
clarify the respective contributions.  I'll then extend that path into
a new (I think) topic, how the FDIV bug impacts real-time
controllers---bottom line: worry (if only theoretically) about
whiplash.  All this of course is moot for the lucky few receiving clean
machines now and for all software incorporating the Moler et al
workaround, but I rather expect this still leaves a substantial
population of users desperately seeking a better grasp of exactly how
concerned they should be.

After getting lost wandering around comp.sys.intel when I was first
pointed at this issue last week, about which I previously knew nothing,
I eventually stumbled on Tim's illuminating table of where 90% of the
problem divisors fell, namely

    3.0 > divisor >= 3.0 - 36*(2^-22)
    9.0 > divisor >= 9.0 - 36*(2^-20)
    15.0 > divisor >= 15.0 - 36*(2^-20)
    21.0 > divisor >= 21.0 - 36*(2^-19)
    27.0 > divisor >= 27.0 - 36*(2^-19)

This was a breath of fresh air.  After staring at this table for a
while it became clear that if the Pentium were at any risk from the
bug, it would have to be where the divisor was a hair below an integer
(at least for divisors > 13.5, the largest noninteger matching the
above description).  Now integers tend to come up a lot in comparison
to other reals in many applications, even if represented in floating
point.  This got me to wondering what the 27,000 year figure turned
into when the division operands were integers.  Well, no bug there for
integers below about 250K since the table says the divisor has to be
*strictly below* these magic numbers in an interval too narrow to hit
any integers.  And for integers above 250K, although the interval
widens to >1, the proportion of those integers that are bad FDIV
divisors becomes very small.

But if a small integer is slightly bruised by computation, then a
nontrivial risk emerges.  How big?  Well, down in the small integers a
sizable proportion of the integers fit Coe's table, i.e. are 3 times a
power of 2 times 1, 3, 5, 7, or 9.  In fact 35 of them fit, namely the
5 numbers 3,6,9,12,15 plus the 30 numbers
{18,21,24,27,30}*{1,2,4,8,16,32}.  Call such integers pentads
(including the ones beyond 1000 obtained by extending the above powers
of two beyond 32---each such power of two gives another 5 pentads that
I'll call an *octave*).  35 out of 1000 gives us our first hint that
the <1000 region may be a sufficiently rich source of FDIV bugs to
cause some serious grief.

But merely being a hair under a pentad, while necessary, is not
sufficient for an FDIV error.  The Coe table tells where all the risk
is concentrated, but it gives no clue as to the *magnitude* of the
risk!

To measure this risk empirically I wrote a program that knocked 10^-6
off all one million pairs of integers from 1 to 1000, did the million
divisions on a Pentium, and checked which were buggy and by how much.
There were 627 bugs, of which 426 had a relative error worse than
10^-7, and 14 worse than 10^-5.

So if you're unlucky enough to do *all* your computing with such
bruised integers, and you do lots of divisions (the Pentium can do two
million divisions a second) you'll see the big 10^-5 errors 14*2 = 28
times a second or once every 35 microseconds, and the smaller errors,
between 10^-5 and 10^-7, 627*2 = 1254 times a second or more than one a
millisecond!

For more dilute concentrations of these unlucky numbers and slower
rates of division, starting from this basic limiting case you can
easily work out the corresponding chances yourself, which are in direct
proportion to each of those quantities.  For example if you are a
spreadsheet user who does 1000 FDIV's a day, Intel's official estimate,
with all numbers small bruised integers, you will see one small error
every 1000/627 = 1.6 days, and one 10^-5 error every 1000/14 = 70
days.  If only 1% of your numbers are bruised integers you're now up to
160 days and 7000 days (20 years) respectively.  That's IF the software
you invoke only does 1000 FDIV's a day, a question you might well ask
yourself.

After getting these basic numbers it was natural to try knocking other
small amounts off, and to tabulate the magnitudes of the resulting
errors more systematically.  This led to the program I posted on
December 3 that generated my two-dimensional table, whose vertical axis
P (precision) says how much we knock off the two operands before giving
them to the Pentium, and whose horizontal axis T (tolerance) gives the
amount the Pentium knocks off in response, and whose entries give the
number of relative errors greater than T over the million operations.
I also looked at what happens when other numbers besides powers of ten
are knocked off: the answer is that there is nothing magic about radix
ten and the statistics remain about the same for any amount knocked off
having the same general magnitude as any given power of ten.

Sometimes the Pentium knocks off more than we do, as when we knock
10^-6 off the divisor and it replies by knocking 10^-5 off the quotient
(way beyond what it should have *added* to the quotient).  In this case
Pentium division is not antimonotone when it should be, that is,
decreasing the denominator *decreases* the quotient when it should
*increase* it.  Consider for example the true arithmetic fact

	4.999999/15 < 4.999999/14.999999.

This inequality holds because the left-hand denominator is larger than
the right-hand denominator.  The Pentium however considers this
inequality to be false because when 15 is decremented by 10^-6, the
quotient instead of going *up* ~10^-7 instead goes *down* by ~10^-5.
(The very nice question of whether such a thing could happen at all,
which I hadn't thought to consider before, was posed to me by Norman
Hirsch.)

Such nonmonotonicity of any rational (+-*/) operator provides a
delightfully simple example of how a 5th place error can become a
leading-bit error with a single Pentium instruction, namely
comparison---the above comparison yields truth-value 0 on the Pentium
when it should have yielded 1.  This is in striking contrast to the
dozens or hundreds of instructions one usually imagines being required
to eventually propagate a 5th place error up to the leading bit.

REAL-TIME CONTROLLERS

This nonmonotonicity may sound like a minor issue, but it is at the
heart of a far more serious issue that has thus far received no
attention to my knowledge, namely the suitability of the Pentium for
real-time control applications in the light of the FDIV bug.  Fix an
unlucky (for FDIV) constant c and consider variables x and y where y
depends on x according to the relationship y = c/x.  Now let x increase
smoothly and watch what happens to y.  Whenever x approaches a pentad p
from below, suddenly y goes haywire: instead of continuing to approach
c/p smoothly from above as predicted by mathematics, y starts jumping
around like crazy, making moves dy thousands of times larger than one
would infer from the corresponding moves dx made by x.  Furthermore
these moves are made at a terrific rate, because the values where c/x
is wrong in that bad region are somewhat chaotically distributed (more
on this distribution in a future message).  Hence the derivative dy/dx
goes sky high, potentially inducing enormous forces.

If c/x is fed *directly* to a 16-bit DAC (CD players in fact use 18-bit
DACS to control the analog audio channel going to the speaker voice
coil), at worst its lower order bit is affected and these "sky-high
derivatives" won't be noticed---they're only sky high because the dx in
dy/dx is *extremely* tiny when dy is large.  But if c/x is instead fed
into some formula that picks off a small segment of the total curve and
feeds *that* to the DAC, it magnifies the FDIV noise in inverse
proportion to the size of that segment.  A designer expecting no noise
from FDIV might have done this as a side effect of copying some formula
and plugging in some values whose numerical analysis aspects he didn't
think too much about, but which shouldn't matter because the arithmetic
would work perfectly smoothly for a correct double-precision FDIV and a
16-bit DAC.  But with the Pentium, suddenly those sky-high derivatives
from the buggy FDIV that were just theoretical before are now magnified
to the point where the 4th or higher bit of the DAC may well be
affected and the forces become whatever the servomotor is capable of
applying.  If you're in a Pentium-controlled elevator, fasten your
seatbelt.

Over 99% (99.96% in the case of the top pentad of each octave) of the
bad divisors just below each pentad p are in the interval [kp,p) where
k is the number .9999975 (this is how the Coe table looks when
translated from bit-oriented computerese into simple math, with k
tweaked slightly to reconcile it with my own empirical measurements of
where the bad divisors live).  So when p is the pentad 960, the
interval is [959.9975,960).  Hence you only have to keep your seatbelt
fastened for a miniscule fraction of the distance between pentads,
about 1/40,000.  So if the ride takes you through one pentad per hour,
e.g. going from pentad 864 to pentad 960, the rough spots only last a
tenth of a second, just enough time to snap your head back and forward
again once.

One FDIV error per 27,000 years doesn't sound so bad.  One whiplash per
hour is a more daunting prospect.

Let me emphasize that this is pure theory, there are no reports to date
of any Pentium-controlled machinery actually doing any such thing.  And
for events lasting only a tenth of a second and occurring only once an
hour, most likely at much smaller magnitudes than in the
high-magnification scenario hypothesized above, such glitches could
occur many times a day in each of thousands of controllers around the
world and yet pass unnoticed at the typical noise levels of real-world
controllers.  This is not to downplay the above risks so much as to
emphasize the extreme difficulty of deciding whether they are at all
significant.
=================================================================
-------------------------PENTIUM REPORT # bug7----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15442 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: Why there is no worst FDIV bug---large vs. likely
Date: 8 Dec 1994 21:20:31 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 100
Message-ID: <3c7tav$g0m@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15442 comp.sys.intel:21785

This is a brief note to highlight an important point that is getting
lost in the volume of technical traffic (to say nothing of the
nontechnical) on the severity of the FDIV bug.

There are two fundamental criteria for judging an arithmetic error,
magnitude and frequency.  (Those who follow my work on the duality of
time and information, see my .sig http, will understand the sense of
"fundamental" I mean here; physicists should identify magnitude with
time and frequency with energy, forming a conjugate pair.)

Any time you have two or more criteria for judging something, it
becomes possible to have no worst case.

Magnitude.  The worst FDIV bug with regard to magnitude is Tim Coe's
pair 4195835/3145727, for which the Pentium gets 1.333739 instead of
the correct 1.33382045.  The relative error here is .999 times 2^-14.
This is the largest error observed to date, and 2^-14 may well be the
maximum possible error for this bug.

Frequency.  The rate at which errors are encountered, which for this
bug is extremely dependent on the application, has to do with the
number and distribution of occurrences of the bug in operand space.
Thus one would not a priori expect that it made sense to talk about any
single bug as relevant to frequency.

Nevertheless the single pair 4.999999/14.999999, more memorable as 5/15
with a millionth shaved off each operand, yielding a 2^-16 error, does
tell us something.  It does not tell us about the total number of
bugs---after all, without further information it could be the only
bug.  Rather, it tells us something about the likelihood of
encountering an FDIV bug.

Suppose the 5/15 pair and the Coe pair were the only two bugs.  While
Tim's pair hurts more four times as badly as mine, I think I can safely
leave it to the reader to dream up plausible applications where my pair
is encountered at least four times as often as Tim's, e.g. data
obtained from a data logging device that is using an analog-to-digital
converter and measuring numbers that for some reason concentrate around
integers, or obtaining data from a decimal calculator that retains only
six digits after the point.

An abstract way of putting this is to say that 4.999999/14.999999 has
low *Kolmogorow complexity*.  The Kolmogorow complexity of any finite
bit pattern is the size in bits of the smallest Turing machine that,
started on a blank tape, writes down that pattern.  Even though
4.999999/14.999999 is as long as 4195835/3145727, it should have lower
Kolmogorow complexity on a non-Pentium.  (But moving the Coe pair to a
Pentium decreases its Kolmogorow complexity in principle if not in
practice because the Pentium can describe it as that pair of odd
integers x,y maximizing x-(x/y)*y; Kolmogorow complexity ignores
running time.  In practice today's architectures trade things off to
represent pairs of numbers rather more compactly than the above
program, but this need not hold for all architectures.)

This is the sense in which there is no worst pair.  Instead there are
two worst pairs, one unambiguously demonstrating how large the error
can get, the other somewhat smaller in magnitude but more likely to be
encountered in practice, depending heavily on the application.

This "two" is "up to isomorphism" as they say in algebra.  The Coe pair
has many siblings that do the same job: just scale either operand by a
pwer of two.  Likewise my pair has many siblings, which however are
explicitly *not* obtainable by scaling as I pointed out earlier, rather
they are those bugs of similar structural simplicity enumerated in my
table posted earlier.  5/15 is simply the most appealing (to me) of the
800 or so small (operands < 1000) fractions that are problematic in
this sense, making them at least cousins to 5/15.  Of these, 26 are
siblings in that they have relative errors of at least 10^-5.

Had all the large bugs been exceedingly unlikely, and had all the
likely bugs all been of very low magnitude, I would not dispute so
strenuously the emerging (this week) industry consensus that the FDIV
bug is not serious other than politically for the computer supply
side.  This is however not the case: the 5/15 bug shows that customers
can experience fairly large errors, a quarter of the largest possible,
in fairly likely numbers.

It is important to bring this consideration to the attention of
industry leaders, who in this week's news have been downplaying the
significance of the bug.  Until it is brought to their attention, one
cannot accuse them of *deliberately* putting their own concerns ahead
of their customers' by burying their collective heads in the sand in
this way.  The strongest accusation possible is that they are doing
this but unintentionally and without malice.

There is a chain of responsibility here.  A whole industry does not
listen to one person; rather the business community looks to the
technical community as a whole for advice on the matter.

I therefore call on the academic community to reach a meeting of the
minds on the severity of the problem so that it can present a united
front to industry on this matter.  One way *you* can help here is to
forward this message, not (necessarily) to management and the media,
but to those of your technical colleagues who are capable of assessing
the technical merits of the above arguments but who lack the time
required to read Usenet, particularly those newsgroups carrying the
highly contagious FDIV bug, which having infected the Pentium is now
infecting many news groups.
-- 
Vaughan Pratt			http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug7.5----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 21229 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Intel Bug Fix Delay for Good Reason
Date: 7 Dec 1994 05:59:09 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 34
Message-ID: <3c3ivd$q71@Radon.Stanford.EDU>
References: <3atfbj$rd5@jabba.cybernetics.net> <3bcm6oINN37ml@rs1.rrz.Uni-Koeln.DE> <KAUL.94Nov29113923@tchebysheff.btv.ibm.com> <D0D13y.Aoy@gasco.com>
NNTP-Posting-Host: sunburn.stanford.edu

In article <D0D13y.Aoy@gasco.com>, Ron Christian x5545 <roc@gasco.com> wrote:
>When I first heard about the bug, I thought it would be more a spin issue
>than a functional one.  But looking at the most common bug test posted here
>has convinced me otherwise.  Lots of trivial algorithms have a point where
>the value of A moves through the value of B in equations such as "C-(C/A)*B",
>and the Pentium introduces a significant disparity at the point where A equals
>B.  (256 does not equal 0.)  You don't have to be doing intricate scientific
>research to notice such a thing.  The first thing that ocurred to me was that
>this bug could throw my wife's freight claims off by a significant amount.

Sorry but this logic won't wash.

First, without knowing more about C and B, the odds of your wife
hitting an FDIV error in this scenario are not worth worrying about.
The figure of 27,000 years is not applicable because those people
limited to calculating only 1000 FDIV's per day put themselves at a
significant disadvantage to those of us earning more than 2 cents an
hour.  An error a month or a day might be more like it. but even so,
with relative errors of size 10^-5 your wife will never know what hit
her (in the good sense :).

Second, in talking about A as the one variable in the above (in the
sense that you seem to be holding B and C constant), you have a
hyperbola that passes through the origin.  Now your argument that an
FDIV error in C/A will cause an enormous *relative* error at A=B is
irrelevant without further information because we have no basis for
attaching significance to this relative error, which is calculated by
dividing the noise (error) found at points along your hyperbola by the
distance of those points from the origin.  In order to justify this
reasoning you first have to attach some significance to distance from
the origin.  An example of what I mean is 0 centigrade; that water
freezes there is irrelevant when there's no water nearby.
-- 
Vaughan Pratt			http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug8----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15446 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: Why there is no worst FDIV bug---large vs. likely
Date: 9 Dec 1994 02:28:07 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 29
Message-ID: <3c8fbn$mi0@Radon.Stanford.EDU>
References: <3c7tav$g0m@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15446 comp.sys.intel:21849

====================================================================
FOR EASE OF REFERENCE I HAVE MADE MY FDIV POSTS AVAILABLE BY FTP AT
boole.stanford.edu:/pub/PENTIUM
====================================================================

In article <3c7tav$g0m@Radon.Stanford.EDU>,
Vaughan R. Pratt <pratt@Sunburn.Stanford.EDU> wrote:
>4.999999/14.999999 is as long as 4195835/3145727, it should have lower
>Kolmogorow complexity on a non-Pentium.

Oops, off by one: at 15 digits the first pair is a digit longer than
the second. :)

This leads into a natural question: what's the shortest integer
manifestation of the bug, or is Coe's pair it?  The answer is that
there are shorter pairs.  I went looking for 13-digit pairs just now
and came up with several, including 41811115/49151, which is 850.6666
but on the Pentium is 850.6640, an error of 3 parts per million.

I would be extremely interested in seeing a 12-digit pair.

This apparent difficulty of finding integer pairs with fewer than 13
digits is in contrast to the number of 3-digit integer pairs (5/15,
9/54) for the variant of division in which one first subtracts a
millionth from each operand.  Two such pairs might not seem like a lot,
but it is a high proportion of the fewer than 2000 candidates.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	      		"Eureka" --  Greek for "Oops, forgot my bathrobe again"


=================================================================
-------------------------PENTIUM REPORT # bug9----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: Why there is no worst FDIV bug---large vs. likely
Summary: 
Expires: 
References: <3c7tav$g0m@Radon.Stanford.EDU> <3c8fbn$mi0@Radon.Stanford.EDU>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

====================================================================
For ease of reference i have made my FDIV posts available by ftp as
boole.stanford.edu:/pub/FDIV/bug*
====================================================================

In article <3c8fbn$mi0@Radon.Stanford.EDU>,
Vaughan R. Pratt <pratt@Sunburn.Stanford.EDU> wrote:
>This apparent difficulty of finding integer pairs with fewer than 13
>digits is in contrast to the number of 3-digit integer pairs (5/15,
>9/54) for the variant of division in which one first subtracts a
>millionth from each operand.  Two such pairs might not seem like a lot,
>but it is a high proportion of the fewer than 2000 candidates.

I'm starting to sound like Intel :), that should have been 6 out of 2000:

      relative error
5/15   1.2207e-05
5/30   1.2207e-05
5/60   1.2207e-05
7/48   8.71931e-06
7/96   8.71931e-06
9/54   1.35634e-05

This is a complete list.  All buggy three digit pairs have large
relative errors as the list indicates, there are none with relative
errors in the range 10^-6 to 10^-18, and none are of the form dd/d.

I'm now looking at the error rate when you perturb integers by
uniformly distributed small amounts, as opposed to by exactly .000001.
This corresponds to a situation where the values are random integers
being sampled in a randomly noisy environment.   A preliminary look
indicates that varying the perturbation randomly from one division to
the next is not much different from the fixed .000001 truncation.  I
will shortly post a 3D table showing error rate as a function of
maximum size integer, maximum perturbation of those integers, and
tolerance (minimum resulting error needed in order to register a bug
occurrence), along with the program that generated the table.
=================================================================
-------------------------PENTIUM REPORT # bug10----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.arch
Subject: Re: Try this one on the Pentium
Summary: 
Expires: 
References: <rlawD0ItH2.5ow@netcom.com>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

In article <rlawD0ItH2.5ow@netcom.com>, reginald law <rlaw@netcom.com> wrote:
>1048958.75/2.9999991
>
>On the 486 you get 349653.021xxx
>
>On a FDIV Pentium you get 349631.688xx
>
>Looks like a huge error to me.

Yes, no one has seen a bigger error than this to date.  On the other
hand your particular example has been known for some time now.  If you
multiply your numerator by 2^2 and denominator by 2^20 you get
essentially the Coe pair, 4195835/3145727, whose relative error (the
only error measure that counts here) is .9996*2^-14, aka 6e-5.

The thing about these FDIV bugs is, once you've found that x/y is a
bug, you know right away that 2x/y and x/2y are bugs as well, and more
generally when you multiply or divide either operand of a buggy pair by
any power of 2 you get another bug.  Note that this rule does not apply
to my bruised integers (integers where you subtract exactly .000001)
because the bruise does not scale with the integer.

The Coe pair showed up in yesterday's San Jose Mercury News on page 2G,
just below W.J. Sander's statement that "To me, it's a disclosure
issue, not a technical issue."  Jerry Sanders is of course CEO of a
determined Intel competitor, AMD, but his bias notwithstanding he's hit
the nail on the head (as have 4195835 contributors to comp.sys.intel).

Actually I would say it was even more an educational issue than a
disclosure issue---it is apparently not at all obvious to a lot of
people that the error rate can be high, there is a continuing
head-in-the-sand belief that the rate is too low to seriously affect
anyone.  This could not possibly be further from the truth!!!  Both the
industry and Pentium customers need to be educated as to the magnitude
and rate of errors, and the dependence of both these quantities on the
their particular application.  The tables I have been publishing,
including a new one I will post shortly, are intended to better
calibrate everyone involved in this mess on these issues.
=================================================================
-------------------------PENTIUM REPORT # bug11----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
X-ORIGINAL-NEWSGROUPS: comp.arch,sci.geo.meteorology
Newsgroups: sci.geo.meteorology
Subject: Re: Pentium bug; Real Applications
Summary: 
Expires: 
References: <863@moene.indiv.nluug.nl> <3c4uho$1bc@runner.ccr-p.ida.org> <3c6r1d$c2f@news-4.nss.udel.edu>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

In article <3c6r1d$c2f@news-4.nss.udel.edu>,
John D. McCalpin <mccalpin@perelandra.uucp> wrote:
>In article <3c4uho$1bc@runner.ccr-p.ida.org>,
>David desJardins <desj@ccr-p.ida.org> wrote:
>>
>>I'd certainly offer high odds that you could randomly change any one
>>cell value in the middle of the run by an amount much bigger than that,
>>and have no measurable effect on the final result.
>
>Whether you would win or lose depends on how long the model was run.
>You would be fairly safe for a 5-day forecast, but not for a seasonal 
>forecast. 

As the bet is stated, I think he'd win.  Any one random change is
*highly* likely to have disappeared at the end of the run.  However the
bet has little bearing on the Pentium issue since its wording is a bad
match to the issue.  The problem with the Pentium is not with single
random changes, the errors can come at *far* higher rates than people
are allowing.

If desJardins wants to bet on the situation where random changes are
made to the cells at a rate during the computation that reflects the
Pentium error rate, he needs to add some conditions to his bet that
rule out the scenarios where the rates approach a 1e-5 error every 10
milliseconds, namely running into a lot of slightly noisy integers.
=================================================================
-------------------------PENTIUM REPORT # bug12----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.ibm.pc.hardware.systems,comp.sys.ibm.pc.hardware.misc,comp.sys.intel,sci.engr,sci.engr.civil
Subject: TECHNICAL: Table of FDIV error rates, case of noisy integers
Summary: 
Expires: 
References: <3c7tav$g0m@Radon.Stanford.EDU> <3c8fbn$mi0@Radon.Stanford.EDU>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

====================================================================
For ease of reference i have made my FDIV posts available by ftp at
boole.stanford.edu:/pub/FDIV/bug*
====================================================================

The following table gives empirically determined estimates of the
expected number of errors per million FDIV operations over a wide range
of operating conditions, for the case of *noisy integers*.  Executive
summary: under moderately restrictive but plausible conditions one may
encounter relative errors of magnitude 10^-5 at rates of one per
100,000 FDIV's, and errors of magnitude 10^-6 to 10^-7 at ten times
that rate and for less restrictive conditions to boot.

On Dec. 3 I posted similar tables for "bruised" integers, defined as
integers less some fixed amount, e.g. 0.001 or 0.000001, tabulating the
actual expected rate however rather than an empirically determined
estimate thereof.  Whereas a "bruise" in my sense is of fixed size,
noise is of constantly varying size, and a noisy integer therefore
differs from a bruised integer in that the amount by which the integer
is perturbed varies at random from one integer to the next.  Hence
generating noisy integers requires a random number generator, unlike
bruised integers.

The following three parameters are varied.

1.  Tolerance.  This measures the tolerated relative error magnitude:
an FDIV error is registered when its relative error exceeds the
tolerance.  The tabulated tolerances range from 10^-5 to 10^-11
inclusive, stepping by 10^-2 (so four tolerances, 5,7,9,11)

2.  Noise.  This measures, as a negative power of ten, the maximum
amount by which the integer can be increased or decreased.  Noise 3 for
example means that each integer is incremented by a random (uniformly
distributed) number in the interval [-0.001,0.001].  Noise 0
corresponds to completely noninteger data, since the perturbation can
be anything in the intervals to the next integer on either side.  Noise
1 uses only 1/5 of each interval between two consecutive integers (1/10
in from each endpoint), noise 2 only 1/50, and so on.  The tabulated
noise levels range from 10^0 = 1 to 10^-13.

3.  Maxint.  This gives the maximum permissible integer for FDIV
operands.  Integers are chosen with equal probability from the interval
[1,maxint].  The tabulated values of maxint are the positive powers of
ten up to 100,000 (five powers).

Here now are the number of FDIV errors encountered in a million random
samples, for each of the above combinations of (tol,noise,maxint).


================
Tolerance 1e-05
-------
     Noise:  0    1    2    3    4    5    6    7    8    9   10   11   12   13
-------------------------------------------------------------------------------
max  10:     0    0    0    0    0    0    0    0    0    0    0    0    0    0
int 100:     0    0    0    0    8   79   43    0    0    0    0    0    0    0
   1000:     0    0    0    0    2    2    3    0    0    0    0    0    0    0
  10000:     0    0    0    1    0    0    0    0    0    0    0    0    0    0
 100000:     0    0    0    0    0    0    0    0    0    0    0    0    0    0
================
Tolerance 1e-07
-------
     Noise:  0    1    2    3    4    5    6    7    8    9   10   11   12   13
-------------------------------------------------------------------------------
max  10:     0    0    0    0    2   61  517 1169 1181    0    0    0    0    0
int 100:     0    0    0    0   24  450 1391 1440  393    3    0    0    0    0
   1000:     0    0    0    1   22   63  104   41    8    1    0    0    0    0
  10000:     0    0    0    0    1    3    3    2    0    0    0    0    0    0
 100000:     0    0    0    0    0    0    0    0    0    0    0    0    0    0
================
Tolerance 1e-09
-------
     Noise:  0    1    2    3    4    5    6    7    8    9   10   11   12   13
-------------------------------------------------------------------------------
max  10:     0    0    0    0    3   77  571 1379 1535 1520 1423  920    0    0
int 100:     0    0    0    1   28  450 1517 2131 2156 2024 1309  268    0    0
   1000:     0    0    0    0   24  118  201  275  223  111   57    1    0    0
  10000:     0    0    0    0    2    6   21   14    8    1    0    0    0    0
 100000:     0    0    0    0    0    1    0    0    0    0    0    0    0    0
================
Tolerance 1e-11
-------
     Noise:  0    1    2    3    4    5    6    7    8    9   10   11   12   13
-------------------------------------------------------------------------------
max  10:     0    0    0    0    4   71  587 1427 1748 1694 1614 1571 1556 1368
int 100:     0    0    0    1   36  506 1584 2167 2297 2277 2272 2162 2140 1333
   1000:     0    0    0    2   21  100  182  280  275  274  223  224  137   24
  10000:     0    0    0    1    8    8   18   34   19   29   22    8    5    0
 100000:     0    0    0    0    0    1    2    0    0    0    0    0    0    0

==========
REMARKS.

1.  Noise.  The noise must get down to .001 (noise 3) in order to
notice anything at all in a million FDIV's.  At this noise level, one
in a thousand perturbations will be within .000001 of the integer, and
these are the most likely causes of error reports.  For large errors
(tolerance 10^-5) the level of noise generating the most errors is
around 10^-5 (noise 5).  Very small noises do not trigger very large
errors, but they trigger lots of small errors.  Over a wide range of
noise levels and tolerances, the Pentium is capable of returning errors
one or two orders of magnitude greater than the injected noise.

2.  Maxint.  The highest rates happen for the smallest integers.  As
the typical integer size increases, the character of this increasing
quantity gradually shades over from integer to real, so that at typical
size 100,000 errors occur negligibly often.  Being a large integer is
not much different from having noise 0: in either case any bit pattern
can happen.


CONCLUSIONS.  If for any reason your data contains a significant
proportion of slightly perturbed integers, you are at extraordinary
risk of a 10^-5 relative error by all reasonable computational
standards.  An extreme case in this respect is for tolerance 10^-5,
noise 10^-5, maxint 100.  The table entry of 79 here means that if you
first pick two random integers in the range 1 to 100, then for each one
add to it a random amount chosen from the range [-0.00001,0.00001], and
finally divide the first perturbed integer by the second, you have a
chance of just under one in 10,000 of getting an error larger than one
in 100,000.  (As it happens, the error if it occurs will almost
certainly be almost exactly one in 65,000, with the occasonial
occurrence of one in 16,000 errors.)

For three-digit operands at this tolerance and noise level the error
rate drops off sharply, but is still dangerously high.  At four digits
things calm down considerably; an isolated 10^-5 error was seen at
noise 10^-3, but this seems to have been just an unlucky chance.
Relatively few applications should be put seriously at risk by these
much lower rates, with the caveat of course that any algorithm might
just happen to be one of those unlucky few.

At tolerance 10^-7 (catering for all errors in the range from 10^-5 to
10^-7) the error rate rises sharply across the board, reaching a rate
that does not increase significantly thereafter as the tolerance
further decreases to 10^-11.  Here for any size operand up to 1000, the
error rate is extremely high for a range of noise levels over four
orders of magnitude, from 10^-4 to 10^-8.  While slide rule users will
not be affected by these 10^-6 and 10^-7 errors, the tendency of
today's complex algorithms to magnify errors as they propagate can
easily blow up these ostensibly minor errors into large errors.

BOTTOM LINE.  The Pentium may well give many or most users little
trouble.  However in our demonstration above that certain reasonably
general types of application are seriously at risk, the complexity of
the governing conditions reveals the great difficulty of deciding
whether any given application has hidden deep inside it circumstances
capable of evoking the full strength of the bug.  The decision whether
or not to use the Pentium in any application that touches the floating
point unit can then become as complex as the problem solved by the
application itself.

==========

Here is the program that generated this table (less some of the
headings):


#include <stdlib.h>
#include <math.h>
#include <stdio.h>

main(argc, argv)
	char *argv[];
{
	int b, i, cnt, maxint, samplesize=1e6;
	double num, den, tol, noise, randnoisyint();
	for (tol = 1e-5; tol >= 1e-11; tol *= .01) {
		printf("================\nTolerance %g\n-------\n", tol);
		for (maxint = 10; maxint <= 1e5; maxint *= 10) {
			printf("%7d: ", maxint), fflush(stdout);
			for (noise = .999, b = 0; b < 14; noise *= .1, b++) {
				for (i = 0, cnt = 0; i < samplesize; i++) {
					num = randnoisyint(maxint, noise);
					den = randnoisyint(maxint, noise);
					cnt += (((num-(num/den)*den)/num)>tol);
				}
				printf("%5d", cnt), fflush(stdout);
			}
			printf("\n");
		}
	}
}

/* Return random integer in [1,maxint], perturbed by [-1,1]*noise */
double
randnoisyint(maxint, noise)
	int maxint;
	double noise;
{
	return 1.0 + ((rand()&0x7fffffff)%maxint) + (2*drand48()-1)*noise;
}
=================================================================
-------------------------PENTIUM REPORT # bug13----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15511 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: Why there is no worst FDIV bug---large vs. likely
Date: 10 Dec 1994 07:34:21 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 75
Message-ID: <3cbllt$nqi@Radon.Stanford.EDU>
References: <3c7tav$g0m@Radon.Stanford.EDU> <D0KEtK.LnC@bocanews.bocaraton.ibm.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15511 comp.sys.intel:22160

In article <D0KEtK.LnC@bocanews.bocaraton.ibm.com>,
Sivasankar Chander <siva@bond.bocaraton.ibm.com> wrote:
>Vaughan R. Pratt (pratt@Sunburn.Stanford.EDU) wrote:
>: ... Even though
>: 4.999999/14.999999 is as long as 4195835/3145727, it should have lower
>: Kolmogorow complexity on a non-Pentium.
>
>  Actually, Tim Coe found many pairs of the form xx.999999/yy.9999999 also,
>but apparently not your 4.99../14.99.. case. However, the *Kolmogorow
>complexity* of most of Tim's divisors are much lower than yours. Just look
>at the hexadecimal representation of the mantissa.

Quite right, I should have gone into this.  A common integer divisor
for these bugs is one less than 3 times an odd number times a power of
2.  Taking the odd number to be 1 presumably minimizes the Kolmogorow
complexity *of the divisor*.

However I was counting the numerator as part of the pair, and about
half of it looks pretty random to me, driving up the total Kolmogorow
complexity.

Our respective subjective estimates of the Kolmogorow complexity of
these examples suffer from being just that: at these very small sizes
Kolmogorow complexity is a pretty architecture-dependent measure, one
has to go out further along the asymptote before it becomes a
meaningful basis for comparisons.

>  Given that the error rate for uniformly distributed random operands is
>of the order of 1e-9, my *expectation* of the smallest decimally represented
>pair that fails is of the order of 9 digits (for both x and y). So I sense
>that at least one *nice* pair with as little as 5 digits in the dividend
>and 4 in the divisor (or (6,3) or (4,5) etc.) is eluding us. Note
>that it could be of the form .xxxxxx / y.yy etc.

There are getting to be various little languages here, in each of which
one can ask for small and/or simple examples.  The basic language is
just decimal digits and /, in which the Coe pair 4195835/3145727 is
stated.  You are willing to allow the decimal point as well in your
language.  My bruised-integer examples assumes a language in which 7/28
denotes 6.999999/27.999999.

All these languages except the basic one add questionable artifacts
that undermine to some extent the significance of their succinctness.
This takes much of the fun out of playing the shortest-example game for
all but the basic language.

Apropos of which, I recently pointed out that the Coe pair
4195835/3145727 had 14 digits, improved it to 13 digits with
41811115/49151, and asked for a buggy pair of integers with a total of
12 digits.  James Shearer pointed out to me that Andreas Kaiser had
already some time ago posted 1/3221224323, having only 11 digits.  I
hadn't thought to look in that region, but when I did I found it
contained quite a few 10-digit examples, including

7/402651871
7/805303742
57/50331637
383/3145725
383/6291450
861/4718589
861/9437178

A half hour search turned up no 9-digit examples,

With the greater expressiveness of your language, which allows a
decimal at an arbitrary location in each operand, your conjecture that
there should be a 9-digit example becomes reasonably plausible.  For
what it's worth, I ran an exhaustive search for two four-digit operands
each with a decimal point anywhere and turned up nothing.  Is 9 digits
with two decimal points better than 10 digits with no point?  Yet
another of life's imponderable little mysteries.  For this pointless
exercise I think I prefer the pointless language
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	      		"Eureka" --  Greek for "Oops, forgot my bathrobe again"


=================================================================
-------------------------PENTIUM REPORT # bug14----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15522 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: Why there is no worst FDIV bug---large vs. likely
Date: 10 Dec 1994 20:48:18 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 28
Message-ID: <3cd46i$ara@Radon.Stanford.EDU>
References: <3c7tav$g0m@Radon.Stanford.EDU> <3c9tii$1tr@duck.ftn.us-ny.citicorp.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15522 comp.sys.intel:22251

In article <3c9tii$1tr@duck.ftn.us-ny.citicorp.com>,
Renato Ghica <ghica@fig.citib.com> wrote:
>duh..I'm no math genius, but isn't an error where the result is
>	256 instead of 0 a *MUCH* larger magnitude than 2^-14 ? ;-)

Only as an absolute error.  In situations like this absolute errors
(simple differences) are a less meaningful measure of accuracy than
relative errors, where one divides the absolute error by the correct
amount.

Suppose for the sake of argument you weigh a flea and get 10 gms
instead of 1 gm, and I weigh the puppy the flea is on and get 1100 gms
instead of 1000 gms.  Which of us would you say has taken more care to
make an accurate measurement?

Our respective absolute errors are 9 gms and 900 gms, suggesting you
were more careful.  Our respective relative errors however are 9 and
0.1.  Examples like this demonstrate that relative error agrees with
our intuitions about the notion of accuracy in measurement better than
does absolute error.

I would not have thought to even mention any of this were it not for
the remarkable number of posts based on an apparent belief in the
superiority of absolute error over relative even in these
applications.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	      		"Eureka" --  Greek for "Oops, forgot my bathrobe again"


=================================================================
-------------------------PENTIUM REPORT # bug15----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15535 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel,sci.math.num-analysis
Subject: TECHNICAL (Pentium): Distribution of problem divisors
Date: 11 Dec 1994 04:16:43 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 116
Distribution: inet
Message-ID: <3cdufb$r9u@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15535 comp.sys.intel:22326 sci.math.num-analysis:10047

In order to get a better grasp of the FDIV error rates, I sampled the
distribution of the problem divisors, assuming a random numerator
uniformly distributed in [0,1].  This assumption probably masks some
patterns that might otherwise leap out and offer an easier way of
estimating those probabilities.  Even with this masking however,
certain patterns were still visible.

The pentads as I've been calling them are the five integer multiples of
3 between 18 and 30 inclusive, namely 18,21,24,27,30, together with all
reals obtainable from these by repeated doubling or halving.  (What I
was previously calling a pentad I'll now call an integer pentad, ruling
out the pentad 4.5 = 18/4 for example.)  The pentads are sufficiently
uniformly distributed along the positive real line that every half-open
interval [x,2x) contains exactly five pentads.

Tim Coe's table of problem FDIV divisors indicates that each lies in
some interval [kp,p] where p is a pentad and k<1 is very close to 1.
Certainly k>.99978, and for most cases k > .999995.

For the pentad p = 7680 = 30*2^8 (as a convenient representative of all
pentads that are multiples of 30) I have looked at 10 million FDIV
errors arising from randomly chosen divisions and used how they tended
to cluster to divide up the interval [7679,7680] into 16 regions.  The
following table shows, as a function of the divisor d in the erroneous
division, what proportion of those ten million errors lies in each
range of possible values of p-d.  For example if 12345/7679.997 is in
error then p-d = 7678-7679.997 = .003, putting this error in the first
interval .0000-.0069.  For 7680-.0078 < d < 7680-.0069, errors go in
the second interval, so on.  Note that the proportion is for the whole
interval; dividing it by the width of the interval yields the average
error *density* in the interval (left to the reader).

The numerators used to generate this table were generated at random,
uniformly from the interval [0,1].  I have not attempted to correlate
the dependence of this distribution on the choice of numerator.

    p-d	       proportion of errors in that interval
---------
.0000-.0069	.8989954	
.0069-.0078	.0000454	
.0078-.0084	.0026463	
.0084-.0097	.0000260	
.0097-.0102	.0001228	
.0102-.0156	.0000846	
.0156-.0189	.0949042	
.0189-.0195	.0000010	
.0195-.0201	.0027801	
.0201-.0214	.0000120	
.0214-.0218	.0001364	
.0218-.0312	.0000082	
.0312-.0321	.0001803	
.0321-.0331	.0000002	
.0331-.0335	.0000180	
.0335-.1000	.0000391

Sorting this table by error density yields the following ranking of
intervals by danger level.

.0000-.0069	.8989954	
.0156-.0189	.0949042	
.0195-.0201	.0027801	
.0078-.0084	.0026463	
.0312-.0321	.0001803	
.0214-.0218	.0001364	
.0097-.0102	.0001228	
.0102-.0156	.0000846	
.0069-.0078	.0000454	
.0335-.1000	.0000391	
.0084-.0097	.0000260	
.0331-.0335	.0000180	
.0201-.0214	.0000120	
.0218-.0312	.0000082	
.0189-.0195	.0000010	
.0321-.0331	.0000002

So nearly 90% of the errors occur in the interval 7679.9937 to 7680,
while another 9.5% occur in the interval 7679.9811 to 7679.9844, 0.28%
in 7679.9799 to 7679.9805, and 0.26% in 7679.9916 to 7679.9922.  This
accounts for all but 0.067% sprinkled over the other intervals.

Within the first interval the distribution is not flat but seems to be
piecewise continuous (meaning an occasional discontinuity, i.e. sharp
jumps, typically downward) to within the resolution I was using.  At
p-d=0 it starts out nonzero and rises steadily with small wobbles to
reach twice the starting rate at p-d = .002, then drops abruptly to one
third of that rate for some time thereafter.  It then rolls off
somewhat unsteadily down to 0 very briefly, goes up a bit to wobble
around some more, then abruptly collapses to almost zero at .0069 where
it stays until .0078.  All this would be more visualizable if plotted.

The binary representation of the boundaries of these intervals is
significant.  The typical though not universal pattern as p-d increases
is that the likelihood of an error increases in general, with sudden
downward jumps whenever there is a carry, with bigger jumps for bigger
carries, particularly when the carry yields a power of two.

I have not had time to repeat this in full for the other four basic
pentads, which at the corresponding scaling are those shown in the left
column of the following table.  However I have looked briefly at them,
again with a random numerator in [0,1], and found the following
*number* of errors encountered during a 15-second random search in the
specified interval, for p-d as shown.

     Pentad p      Number of errors in 15 seconds for p-d in the range:
	              .00 --- .01 --- .02 ---.03
  3*256*6  = 4608          45     21       1
  3*256*7  = 5376          11      9       0
  3*256*8  = 6144          33     13       0
  3*256*9  = 6912          13      5       0
  3*256*10 = 7680          17     11       0

This shows that for all five of these pentads p, most of the errors
happen for p-d in [.000,.020].
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	      		"Eureka" --  Greek for "Oops, forgot my bathrobe again"


=================================================================
-------------------------PENTIUM REPORT # bug16----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15537 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel,sci.math.num-analysis
Subject: TECHNICAL (Pentium): Bad section of the noisy-integer table
Date: 11 Dec 1994 09:01:34 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 59
Distribution: inet
Message-ID: <3cef5e$4vl@Radon.Stanford.EDU>
References: <3c7tav$g0m@Radon.Stanford.EDU> <3c8fbn$mi0@Radon.Stanford.EDU> <3c978o$r25@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15537 comp.sys.intel:22365 sci.math.num-analysis:10049

====================================================================
For ease of reference i have made my FDIV posts available by ftp at
boole.stanford.edu:/pub/FDIV/bug*
====================================================================

Here is a particularly bad section of the noisy-integer table I
recently posted, organized as a self-contained and reasonably short
account for ease of forwarding and posting elsewhere.  This section is
so bad that it pretty much subsumes (and looks rather grimmer than) the
previous tables I've posted.

============================================

Think of two real numbers x and y in the range 1 to 100, each at most
1/100,000 less than an integer.   For example x might be 35.9999947 and
y 53.9999982.  The possible choices are assumed equally likely.

Now calculate x/y on the Pentium.  If the difference between the
Pentium's answer and the correct value of x/y is more than 1/100,000 of
the correct value, declare the Pentium to be wrong.

For this situation the Pentium will be wrong with the following
probabilities.

1.  In the general case, the probability that Pentium will be wrong
is one in 3000.

2.  In special cases, namely when the two integers that x and y are
near appear on the following list, the Pentium will be wrong at least
300 times more often than for the general case.  The probability it
will be wrong for each such special case is shown to the right of the
pair.  In all of these cases the probability of being wrong is at least
one in ten.

36/54	0.23		40/60	0.21		10/60	0.20
18/54	0.19		10/30	0.19		9/54	0.18
9/27	0.16		20/60	0.15		18/27	0.14
15/36	0.14		36/27	0.12		20/30	0.12
15/72	0.11		10/15	0.10		5/60	0.10

3.  In order for the Pentium to be wrong, x and y must be near one of
the above 15 pairs or one of the following 34 pairs.  For the latter,
the probability that the Pentium will be wrong ranges from 0.096 down
to 0.001.

40/30  5/30 20/15 5/15 33/72 15/18 40/15 44/96 72/54 33/36 22/96
22/48 80/60 15/9 44/48 22/24 72/27 11/48 11/96 80/30 44/24 11/24
33/18 44/12 22/6 22/12 11/6 80/15 33/9 44/6 11/12 11/3 44/3 22/3

4.  If the Pentium is judged wrong when x and y are each at most
1/1,000,000 less than an integer instead of 1/100,000, and the error is
at least 1/1,000,000 instead of at least 1/100,000, then the Pentium
will now be wrong with probability one in 730, and an additional 181
pairs besides the above can give the wrong answer.  In this case 60/72
is the most likely source of a wrong answer, with associated
probability 0.34.  This brings to 230 the total number of possibly
wrong pairs, out of the 10,000 pairs of integers from 1 to 100.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	      		"Eureka" --  Greek for "Oops, forgot my bathrobe again"


=================================================================
-------------------------PENTIUM REPORT # bug17----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15547 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel,sci.math.num-analysis
Subject: Re: TECHNICAL (Pentium): Bad section of the noisy-integer table
Date: 11 Dec 1994 17:14:20 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 33
Distribution: inet
Message-ID: <3cfc1c$ihs@Radon.Stanford.EDU>
References: <3c7tav$g0m@Radon.Stanford.EDU> <3c8fbn$mi0@Radon.Stanford.EDU> <3c978o$r25@Radon.Stanford.EDU> <3cef5e$4vl@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15547 comp.sys.intel:22413 sci.math.num-analysis:10053

In article <3cef5e$4vl@Radon.Stanford.EDU>,
Vaughan R. Pratt <pratt@Sunburn.Stanford.EDU> wrote:
>
>4.  If the Pentium is judged wrong when the error is at least
                                        ^
>1/1,000,000 instead of at least 1/100,000, then the Pentium will now be
>wrong with probability one in 730, and an additional 181 pairs besides

Insert: "x and y are at most 1/1,000,000 less than an integer and"
(I forgot that my program changes these simultaneously, as a heuristic
for keeping down the number of parameters in this simplification of the
more comprehensive picture given in my first post, /pub/FDIV/bug1 on
boole.stanford.edu.)

Apropos of this, it is well worth bearing in mind the following useful
empirically determined rule of thumb about truncated integers, which
can be read off the table in my first post (bug1).

RULE: For a truncated integer divisor, the Pentium's relative error is
at most about two orders of magnitude greater than the amount of the
truncation.

Please let me know of any counterexamples.

The significance of this is that for extremely small truncations, the
associated errors will be negligibly small for all but extremely
sensitive applications.  In particular truncations down near the last
two or three bits of double precision significance should affect no one
at all, taking into account that these effects are further diluted by
the probability of any error at all being much less than 1.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	      		"Eureka" --  Greek for "Oops, forgot my bathrobe again"


=================================================================
-------------------------PENTIUM REPORT # bug18----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.sys.intel
Subject: Re: Pentium bug and small numbers?
Summary: 
Expires: 
References: <3cdqa4$rce@csc.mc.edu>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

In article <3cdqa4$rce@csc.mc.edu>,
Stephen Wilhelm <swilhelm@csc.mc.edu> wrote:
>I have seen plenty of examples of numbers that causes the Pentium to
>give wrong results.  Usually, they are 4195835.0 and 3145727.0, or
>some multiple of them.  Both are pretty big numbers.  I recently heard
>a rumor that the FDIV bug also affects some small numbers, but I am
>very suspicious of this claim because I have neither seen nor heard of
>any examples of this claim.

Close.  The so-called "small" numbers are divisions like 5/15, which of
themselves cause no problem.  But if you truncate them for any of
various reasons, e.g. limited decimal precision of externally obtained
data, then the bug hits you.  For 5/15, truncating as
4.999999/14.999999 yields an error in the 16-th significant bit,
i.e. > 10^-5.

About 800 out of the 10,000 possible two-digit divisions are at risk
here, of which 230 are at risk of a 10^-6 error, and 49 at risk of a
10^-5 error.  For a quick tour of this situation see my post yesterday
on this, also available as boole.stanford.edu:/pub/FDIV/bug16.  For
mind-numbingly more detailed tables see bug1 treating truncated
integers of the above kind, and bug12, "noisy" integers with zero-mean
rectangularly distributed noise.

The shortest "honest" examples I know of, no truncation tricks or
decimal points anywhere, are given in bug13, namely the 10-digit
examples 7/402651871, 7/805303742, 57/50331637, 383/3145725,
383/6291450, 861/4718589, and 861/9437178.  It is open whether there
are any 9-digit examples, and still open even if one allows decimal
points (not counting the points themselves).  Truncation (what I was
previously calling bruising but that term though catchy is less clear
and standard than truncation) is the only way proposed so far for
getting to the really small examples.
=================================================================
-------------------------PENTIUM REPORT # bug19----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15585 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel,sci.math.num-analysis
Subject: TECHNICAL (Pentium): Exact number of FDIV error-pair FLOATs (was: Bad section of the noisy-integer table)
Date: 12 Dec 1994 18:27:59 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 142
Distribution: inet
Message-ID: <3ci4nf$ar2@Radon.Stanford.EDU>
References: <3cef5e$4vl@Radon.Stanford.EDU> <872@moene.indiv.nluug.nl>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15585 comp.sys.intel:22652 sci.math.num-analysis:10072

Summary:  A search of the 2^46 pairs of single-precision mantissas
turned up 1281 erroneous FDIV pairs x/y, under the error criterion that
x - (x/y)*y is at least 3 for an exponent that assigns unit weight to
the least significant bit.  The search was such that there should be
fewer than half a dozen error pairs remaining unobserved, which a
longer search will produce tomorrow.  For Toon Moene's weather
forecasting application this corresponds to an error rate of one every
four days for uniformly distributed FDIV operands, and far more for
operands with even an imperceptible tendency to cluster around
integers.

In article <872@moene.indiv.nluug.nl>,
Toon Moene <toon@moene.indiv.nluug.nl> wrote:
>In article <3cef5e$4vl@Radon.Stanford.EDU> pratt@Sunburn.Stanford.EDU  
>(Vaughan R. Pratt) writes:
>> Here is a particularly bad section of the noisy-integer table
>
>This is very interesting stuff, but for my application it would be far  
>more useful to simply have a list of dividend / divisor / Pentium-quotient  
>triples of mantissa's for the cases where the Pentium is wrong. As my  
>application uses almost exclusively 32-bit floating point arithmetic - and  
>is stable at that - I'm currently only left with the 8 (!) 32-bit dividend  
>/ divisor pairs in Tim Coe's text (The Pentium Papers entry at  
>http://www.mathworks.com). Now 8 error pairs in a sea of (2^24)^2 possible  
>mantissa pairs is a vanishingly small stick in a haystack.

This haystack contains only (2^23)^2 pairs if we stick to positive
floats, or 70 trillion.  Even so, although the Pentium is a fast
machine brute-force search (perebor as Russian computer scientists call
it) of that haystack would take 2.25 years at 100 MHz, though a labfull
of 100 Pentiums could do it in 8 days.

However Tim Coe has shown that the bad divisors lie in at most 5/4096
of the possible single-precision mantissas.  By only examining those
mantissas the search time on one Pentium is reduced to 24.0 hours
(coincidence).

Coe claims that over 90% of the bad divisors lie in the top 36/2048 of
*that* already small space, and my own observation shows (for pentad 30
at least) that over 99.9% of them lie in the top 64/2048.  Limiting the
search to that space saves a further factor of 32, bring it down to 45
minutes.

I did this.  For that region the search program below found 1281 bad
pairs.  I've started up the 24-hour search and unless the other four
pentads are very different I expect fewer than half a dozen more
pairs.  So this 1281 number is not only a lower bound but I expect will
also turn out to be a sufficiently good approximation to the exact
number for all practical purposes.

>Earlier I wrote  
>that our Numerical Weather Forecasting code does on the order of 3.5 *  
>10^9 divisions per forecast (or 4 * 365 * 3.5 * 10^9 ~ 5 * 10^12 divisions  
>per year), which looked like _something_ against an error occurrence rate  
>of 1 in 9 * 10^9 - but now it seems that for 32-bit operations the error  
>occurrence rate is only 8 * 2^-48 ~ 3 * 10^-14. That's quite something  
                                ^^
>else than 1 in 9 * 10^9 ...

This should be 46 (this is for positives only, I'm assuming that the
negatives behave essentially the same), giving an error rate of 1.14 *
10^-13 assuming only 8 erroneous pairs.

With 1281 erroneous pairs the rate jumps by a factor of 160 to 1281 *
2^-46 ~ 1.82 * 10^-11.  IF your FDIV operands do not cluster around
integers *at all* (and let me emphasize the "at all" here so you do not
come back a month later and include me with Intel in a suit alleging
failure to disclose due to some tendency towards such clustering
throwing these numbers *way* off), then your forecast has a probability
of 0.064 of containing an error.  Thus at 4 forecasts per day you will
see an error every four days on average.  Not exactly 27,000 years, but
then Intel's hypothetical FDIV-averse "average spreadsheet user" is
happy to let you tell her whether she'll need an umbrella tomorrow.

An error every four days is a butterfly wing that is very unlikely to
affect your estimate of the chance of showers.

Toon, how do you know you do that many FDIV's?  Did you instrument your
code to count them?  If so, could you conveniently instrument it to
count errors?  You should certainly see a couple of errors every week,
and any more than that would give some idea of how the nonuniformity of
your data was affecting things.  Of course it is conceivable (but
highly unlikely in my opinion) that your data *stays away from* the
problem areas, giving a lower error rate.  It would be very nice to
know for sure.  Not that I speak for Intel, but I would be somewhat
surprised if they were unwilling to underwrite the associated expenses
if that were an obstacle to doing this experiment.

Here's the program that produced the above statistic.  As a curiosity,
the only thing tipping one off to the fact that all the arithmetic is
done in floating point is the float declaration itself.  I hope the ">
2" is close enough to being the appropriate test here, if not please
spend 45 minutes running it again with whatever you believe to be the
right test.  To see what proportion of 15th-bit errors occur for
example, run it with "> 2" replaced by ">= 512."  Bear in mind that the
"first bit" is the sign bit in IEEE arithmetic, and the least
significant bit is thus the 24th bit.  In general an n-th bit error is
of size 2^(24-n) for this program, e.g. a 15th bit error is of size
2^(24-15) = 512, confirming my ">= 512" above.

main()
{
	int wrong = 0;
	float p, x, y;
	for (x = 1<<23; x < 1<<24; x++)
		for (p = 18*(1<<19); p <= 30*(1<<19); p += 3*(1<<19))
			for (y = p-1; y >= p-64; y--)
				wrong += (x - (x/y)*y > 2);
	printf("%d\n", wrong);
}

I will do the all-day run next (simply increase the -64 to -2048) and
report the exact answer tomorrow, not that anyone is likely to care
unless it uncovers an unexpected difference in one of the other four
pentads.  I have previously posted the distribution for pentad 7680 for
the case of double precision, available as /pub/FDIV/bug15 on
boole.stanford.edu, which has less than .1% of the errors outside the
region in which the 1281 errors were found.  This article will be
bug19.

This program is "embarrassingly parallelizable" as they say, and
someone with a dozen Pentiums in their lab could easily beat me to the
punch by parallelizing this program thereby getting the exact answer in
two hours instead of a day.

Let me close with two caveats.  First, the single-precision domain is
relevant for people like Toon Moene who for whatever reason work
there.  (But I would not be enormously surprised if the
single-precision error rate turned out to be a good predictor of the
double-precision Pentium error rate.)

Second, let me emphasize yet again that clustering around integers can
magnify these low error rates considerably.  In order to go from a rate
of one error per four days in your forecast to one per four minutes
requires only a very tiny bulge around integers in the distribution of
your FDIV operands, a bulge you would never notice if plotted on a
graph.  By the time the bulge became noticeable errors would be
happening more than one a second (back of the envelope, your envelope
may vary:-).
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	      		"Eureka" --  Greek for "Oops, forgot my bathrobe again"


=================================================================
-------------------------PENTIUM REPORT # bug20----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15647 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch,sci.math.num-analysis
Subject: TECHNICAL (Pentium) Distribution of error MAGNITUDES
Date: 14 Dec 1994 01:42:18 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 155
Sender: pratt@cs.stanford.edu
Distribution: inet
Message-ID: <3clihq$1kt@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Keywords: Pentium, FDIV, error, magnitude
Xref: Radon.Stanford.EDU comp.sys.intel:23313 comp.arch:15647 sci.math.num-analysis:10114

Thus far I've been writing about the distribution of RATES of FDIV
errors on the Pentium, as a function of the distribution of operand
*values*, in particular how near an integer they are.  In this post
I'll address the distribution of MAGNITUDES of those errors, as a
function of operand *precisions*, in particular whether the operands
are double-precision or something less, meaning that the mantissa is
zero in the last n bits for some fixed n; n=29 is for IEEE
single-precision.  For this note I will fix n at 29 and hopefully will
have time to explore smaller (and maybe larger) n later on.

As has been pointed out and acted on long ago by both Tim Coe and Mike
Carlton (and my face is red for not being aware of this yesterday when
I did this), there are sufficiently few single precision mantissas,
namely a mere 8 million, that it becomes feasible to explicitly examine
every single-precision mantissa pair whose divisor is in one of the 5
danger regions.  There are 64 trillion pairs, 5/4096 of which have
their divisors in one of those critical regions.  This yields a space
that can be exhaustively searched by a single Pentium in 24 hours
(exactly!), and in exactly 24/p hours on p Pentiums (what the parallel
computing community calls an "embarrassingly parallelizable"
computation).  A further decrease to 1/32 of this space, taking only 45
minutes, is made possible at the cost of overlooking only about 2.5%
error pairs.  I pointed this out in my previous two posts, where I
reported a total of 1312 errors off by 3 or more for numbers scaled to
assign unit weight to the least significant bit of a single-precision
mantissa, of which 1281 were found by the 45-minute search.

The following both subsumes and improves on the 1281 number, while
greatly clarifying (at least in my mind) a number of issues about
density that I had not previously appreciated.

Running the 45-minute search, I collected all errors at *all*
magnitudes, even those that were up to 11 bits more precise than even
double precision!  (Unless I've misunderstood something, the Pentium
appears to retain 64 bits of precision throughout the evaluation of the
code resulting from the C expression (x/y)*y.  I'd love to know more
about what's going on here, can someone more knowledgable in this area
please explain?)

The quick search found 9236 such errors, which should be within 2.5% of
what the 24-hour search would produce.

I then classified these errors simply according to their exponents.
Each such class therefore represents one "octave" of precision, and the
full range of possible errors is then 63 octaves, one per bit of
precision in the mantissa plus the extra 11 bits mentioned above.  The
biggest known FDIV error is in the 15th bit (I'm calling the leading
bit of the true mantissa the 1st bit, so that an answer of 4 = 100
instead of 5 = 101 would be wrong in the 3rd bit), whence the first 14
octaves are all empty.

Starting at octave 15, the 9236 errors found by the quick search are
distributed as follows.

	15  67	22 130	29 256	36 137	43 231	50 139	57 236 
	16 105	23 272	30 138	37 226	44 153	51 238	58 146 
	17 167	24 146	31 262	38 155	45 254	52 142	59 214 
	18 148	25 242	32 148	39 236	46 143	53 222	60 160 
	19 273	26 157	33 252	40 139	47 246	54 153	61 236 
	20 131	27 272	34 148	41 223	48 142	55 239	62 139 
	21 251	28 131	35 249	42 141	49 233	56 145	63 223 

To withing 2.5%, this tells us that, of the 2^46 single-precision pairs
of mantissas, 67 of them will produce an error in the 15-th bit.  The
most-frequently cited Coe pair, 4195835/3145727, is among these because
both operands are representable exactly in single precision
(4195835 < 16777216 = 2^24, don't forget the hidden leading bit, the
true mantissa has 24 bits).

In the next octave, consisting of errors of half that magnitude, there
are 105 errors.  And so on down the line.

A very striking feature of this distribution is that, except for the
ramp-up at first three entries, it is so beautifully uniform.  The one
nonuniformity is a "high-frequency oscillation" in which each even
octave has around 240 errors while each odd octave has only 140 or so
errors.  Had we been collecting octaves two at a time (or four at a
time had we used the same method on an 1960's IBM 360) we would have
seen only a nearly perfectly flat distribution; the 2x or 4x slower
sampling would have filtered out the high frequency (with no aliasing
or beats since the "buzz" period is exactly 2 octaves).

This table is extremely useful for predicting error rates.  The
likelihood of a worst-case error is 67/2^46 or 1 in a trillion.  The
likelihood of a 2^-n error is obtained by adding up the number of
errors in the octaves up to n+1 inclusive.  (This is to within a factor
of 2, or certainly at most 4, I haven't yet analyzed the range
resulting from variation within an octave and variation of the dividend
within its octave.)  Thus a 10^-9 error is a 2^-30 error, so we add up
the octaves up to 31, getting 3148, and the rate of errors at least
this size is then 3148/2^46.

This table actually goes a little further:

64      329602833
65      46

I presume that octaves 64 and 65 together count the number of
single-precision operands for which (x/y)*y is not equal to x, for the
region searched.  Although that region closely reflects the populations
of octaves 63 and below, octave 64 would be much larger had we examined
all 2^46 pairs.  As it was the quick search examined 2^23*5*64 =
2684354560 pairs, almost exactly 8 times the population of octave 64,
which I can well believe is the theoretically correct number of (x/y)*y
!= x cases.  In which case, well done, Intel, too bad about the missing
entries.

I would love to know how other products stack up against this, and to
this end I append my code generating the above table so people can try
it out on their favorite machine to see just what it does at these very
high precisions.  Note that octaves 51 and below had better be empty if
it isn't a Pentium, you are looking to see in which octave the number
2^23*5*64/8 = 335544320 (or so) shows up.  This program takes 45
minutes on a Pentium.  I haven't tried this on any other machine yet,
so there may be glitches.  For example a big-endian machine like a
Sparc will need [0] instead of [3] in extracting the exponent from
err.

To mount a 24-hour search on the Pentium that catches the remaining 2.5%
of error pairs, change "-64" to "-2048", covering the whole of the five
Coe intervals.


#include <stdlib.h>
#include <math.h>

int bit[1<<11];

main()
{
	int i;
	float x, y, p;
	double err;
	for (x = 1<<23; x < 1<<24; x++)
		for (p = 18*(1<<19); p <= 30*(1<<19); p += 3*(1<<19))
			for (y = p-1; y >= p-64; y--)
				if (err = fabs(x - (x/y)*y))
					bit[((short*)(&err))[3]/16]++;
	for (i = (1<<11)-1; i >= 0; i--)
		if (bit[i])
			printf("%d %d\n", 1047-i, bit[i]);
}

What is needed next is to generalize this program by making the fixed
single-precision assumption (implicit in all the non-unity constants in
lines 6-8, the three for-loop controls) into a parameter.  This way one
can study the effect of precision in the operands on the above table.
The octave populations must grow monotonically with increasing
precision since every 37-bit-precision mantissa is automatically a
38-bit-precision one, and so on.  One question this would help one answer
is how many double-precision errors there are.  The catch is that each
additional bit of precision in the operands quadruples the search space.
However I believe this can be handled by suitable heuristics.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug21----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 23479 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Pratt to Intel: prove it's safe (Was: Who do you believe?)
Date: 14 Dec 1994 09:57:00 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 184
Message-ID: <3cmfhc$ies@Radon.Stanford.EDU>
References: <3clu4g$dpe$1@garnet.msen.com>
NNTP-Posting-Host: sunburn.stanford.edu
Keywords: Intel Pentium Detroit Free Press IBM

In article <3clu4g$dpe$1@garnet.msen.com>,
Hiawatha Bray <watha@garnet.msen.com> wrote:
>I'm a columnist for the Detroit Free Press.  A couple of weeks ago, I 
>wrote that the Pentium problem seemed like a tempest in a teacup to me, 
>though Intel should have come clean a lot sooner.  Then the Monday 
>announcement by IBM got me thinking I'd taken the matter too lightly.  
>
>Now I'm having third thoughts.  If the Pentium is as bad as Big Blue 
>says, thousands of users worldwide would surely have gone howling to 
>their computer dealers, complaining of bizarre arithmetic errors.

For a change of pace from my gruelling postings of tables of numbers,
let me wax political for a little bit.

The media has asked repeatedly why Intel is being beaten up, and has
speculated at length on the phenomenon, quoting AMD's Jerry Sanders,
"The Internet is democracy at its ugliest."  Intel for its part asks
plaintively why is it being made the guinea-pig "in the new era of
crisis management," and protests that it has learned nothing about the
bug from the Internet except how emotional people can get (San Jose
Mercury News 12/12, page 2D).

There is an answer, one that is like a short proof of a hard problem.
The sort of proof that you don't think of right away, but once you see
it you slap your forehead and say, of course, it's obvious, that's what
it has to be.

First let me clarify my adversarial role in all this.  Intel is not my
enemy, my only enemies are ignorance and slowness.  "What you don't
know won't hurt you" only works in a society with no information
highway and whose regular highway has a 4 mph speed limit for horseless
carriages.  Speed is great, and more of it is where we're all headed.
But the faster things go, the more alert you have to be to risks.  It's
that way with Ferraris, and it's that way with Pentiums.  If you hear a
little tapping sound coming from the engine at 130 mph, it's heads up
or take your chances.

Intel is an effective player in technology and I very much hope it
continues to be.  Intel along with many other companies around Silicon
Valley is family to me.

This said, I must side with IBM (I have no IBM stock or other material
incentive to take sides here) and state without qualification that I
believe the Pentium to be for all practical purposes exactly as bad as
IBM says it is.

Now in saying it is bad, IBM is not also saying that thousands of users
are currently reporting bizarre arithmetic errors, as implied in your
question.  What IBM has pointed out are neither anecdotes nor case
studies but possible risks.  For this they used simple examples that
the financial community can easily relate to, and that it can use as a
basis for deciding for itself (perhaps not so easily) whether these are
risks they are willing to take.

There are two basic technical aspects to this bug that are insidious in
the extreme.  One is the astronomical range of error rates, from one in
every 27,000 years if your coffee breaks leave you time for only a
thousand divisions a day *and* your data is perfectly uniformly
distributed, to one every millisecond while frantically dividing hordes
of small bruised integers.  The other is the uncertainty of where the
world's applications lie along this spectrum.

Neither end of the spectrum is a likely candidate for any ordinary
application.  No Pentium user limits herself to 1000 divisions a day,
and no one's data is perfectly random or they could use it for a random
number generator.  At the other end, no one does a division every third
instruction, and no one's data consists exclusively of small bruised
integers.

Ok, so since no application operates at the extreme low-rate end of the
range, where Intel seems to think almost everyone lives, and since no
application operates at the worst-case high-rate end either, where the
doomsayers like to congregate, where *does* the typical application
work and play?

No one knows.  Not me, not you, not Intel.

If Intel can show that people's data are distributed so that their
division operands are sufficiently random as to achieve close to the
one-in-9-billion errors-per-division rate, let Intel produce its
argument.  To date it has not come close to doing so.

Meanwhile various qualified parties (and not-so-qualified parties put
there by an invisible force to make interpretation of all these claims
more of a challenge) have pointed out that there exist other points on
the error-rate spectrum besides the one Intel continues to cling to
grimly.  And, moreover, that these other points are not just miles but
light-years up this spectrum from Intel's cherished low-rate end.

Let me convey some idea of just how hard it is to make any plausible
argument about randomness stick.  You might look carefully at the
actual operands to the division instruction arising in practice,
observe that they seem to be spread extremely uniformly across the
whole range, and deduce complete randomness.  But now suppose that a
mere one percent of those operands turn out to be crowded around
integers, to within 0.00001 of the nearest integer, the "bruised
integer" danger region" for the Pentium's division instruction.
Without microscopic examination of the distribution, this little blip
so near the integers could pass completely unnoticed.  Yet now the
proportion of bruised integers has gone up from one in a hundred
thousand (for random data) to one in a hundred, this being the
percentage now crowding around integers.  This increase of a factor of
a thousand in bruised integers, in both the divisor and the dividend,
will increase the error rate by a thousand squared, that is, a
million.  Yet this staggering increase has resulted from a one percent
change to the distribution, a change almost as invisible as the bug
itself.  Such extreme sensitivity to the ambient conditions verges on
chaos.

It is *very* hard to prove what Intel wants us all to take
unquestioningly for granted.

Now, another automotive analogy.  A question has arisen inside the
Duesenberg Motor Car Company's as to the safety of the brakes of this
year's model.  The question involves a very peculiar problem brought to
light by an engineer.  No car in that line has yet had a brake failure
attributable to this problem; moreover it is quite on the cards that
none ever will.  Unless, that is, some people brake in a certain way
that though peculiar is not really all that difficult to do and which
you could conceivably imagine your kid doing while out for a spin, for
whatever strange reason, even if you've never seen him do it.  The sort
of car the GOP might once have wished Rosemary Wood drove.

Now what would you recommend DMC do under these circumstances?  Should
they hush up this discovery and cross their fingers that this problem
will quite possibly never happen even once, and that if by very bad
luck it should ever happen, no one will correlate it with the brake
design.  Or, should they make every effort to warn every owner of this
problem?  Or should they mount a tremendously expensive recall of every
vehicle?

If you think they should insist on recalling the whole line, well,
automotive safety being the issue it is in today's litigious society,
perhaps this is really their only financially viable option, not to
mention the decent thing to do and of course great for their image.

In a less litigious world one can imagine DMC simply disclosing their
understanding of the situation in full, and pointing out the very
obvious fact that there are many other things you can easily do with
the pedals that are far more likely to run you into a tree at 90 mph
than this obscure brake problem.  Here is what we know, here is what we
don't know, we have hidden nothing bearing on this risk from you,
decide for yourself.

But if you think DMC's optimal strategy is simply to deny that the
problem could present any real risk, then Intel's handling of this
inconsequential flaw that will bite you only on your 27,000th birthday
will likewise seem entirely appropriate to you.

As long as Intel doesn't know for sure how badly and how often its
customers can be tickled, injured (see bug6 of my FDIV papers
boole.stanford.edu:/pub/FDIV/bugs for how the bug could cause
whiplash), or killed by the bug, it has no business reassuring its
customers that they'll never notice a thing.  In the spirit of full
disclosure Intel should state the uncertainty of the situation.  It
should point out, or at least allow for the fact, that the pessimists
naturally drift to one end of the spectrum and the optimists to the
other.  And it should strongly resist the understandable temptation to
join the most extreme optimists, and instead stick to the facts, in
particular to what is actually known about just where along this very
very long spectrum applications really do fall.

Exactly the same problem comes up with publicly held companies when
they communicate with industry analysts.  Companies are naturally
wildly optimistic about their prospects, it is hard to keep going when
you're not.  But to communicate that optimism to the stock market
through the analysts when it is sufficiently misplaced is to leave
yourself wide open to actions by disgruntled stockholders, who will
bring your grandiose but naive promises before the jury as witness to
your malicious lies.

The marketplace does not suffer cockeyed optimists gladly.  My
great-grandfather, member of Parliament for Collingwood and one of
Victoria's most successful promoters of land speculation late last
century, was one, and paid for it with six months in jail when the land
boom bubble burst in 1892.

Intel should conduct its optimistic speculations in private, in public
it needs to stick to the facts.  Nothing less will mollify this net.
Agreed, netters?
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	 		My every word is copyright, especially those
			not in the dictionary.


=================================================================
-------------------------PENTIUM REPORT # bug22----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 23742 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: sci.math,sci.math.num-analysis,comp.sys.intel
Subject: Re: Pentium Chip Calulation Bug
Date: 14 Dec 1994 22:08:09 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 51
Distribution: inet
Message-ID: <3cnqc9$i60@Radon.Stanford.EDU>
References: <rlawD0Ituu.6nu@netcom.com> <D0tHKC.6FJ@cnn.nas.nasa.gov>
NNTP-Posting-Host: sunburn.stanford.edu
Keywords: FDIV
Xref: Radon.Stanford.EDU sci.math:47431 sci.math.num-analysis:10139 comp.sys.intel:23742

In article <D0tHKC.6FJ@cnn.nas.nasa.gov>,
Daniel A. Asimov <asimov@wk223.nas.nasa.gov> wrote:
>Does anyone have precise information about just what class of
>arithmetic problems results in a buggy calculation?

The nature of the bug is such that divisors a tad under an integer are
at far greater risk.  The risk is maximized when that integer is 3
times an odd digit times a power of 2 (quantities that are 2^-n times a
risky integer are equally at risk) with "tad" no more than 10^-4 and
becoming more significant at lower values.  The risk is yet further
increased when the dividend is near an integer, but the conditions are
more complex.  See tables governing this phenemonon in
boole.stanford.edu:/pub/FDIV/bugs or individual.bugs/bug1 for the
shorter file, see also the README.

The above will give an excellent idea of error rates, which can exceed
one in a thousand for randomly chosen "small bruised integers," defined
as numbers of the form n-k where k is some fixed negative power of 10
(or any other radix, 10 is not special here), e.g. 10^-6.

While it does not bear on your specific question about rates, the
following example I contrived just now has a certain appeal.  A useful
pair of statistics to keep on a township comprised of sorts A and B is
the size s=a+b of the total population and the proportion p=a/(a+b) of
members that are of sort A.  From this other questions such as the
proportion of sort B, the ratio of the two sorts, the number of people
of sort A, and the number of sort B, can all be determined.  In
particular we can reckon a as sp.

Now suppose this useful technique is applied in a large city having
a=5,505,021 and b=3,932,159.  For our records we compute s=9,437,180
and p=0.583333263, sufficient precision for the expected questions
about this population.  We may later recover a as the product of these
two, namely sp=5505020.9746 which we round to the nearest integer,
which unsurprisingly is our original a.

But if we do our record keeping on a Pentium, when we come to compute p
it turns out to be p=0.583328176.  Later when we go to retrieve a as sp
we find it to be 5504972.996, clearly the integer 5,504,973.  But now
48 people have disappeared!

The Pentium is a blazingly fast computer, and can easily make 20
million people a second disappear.  It is also priced right for
quantity purchases, permitting even greater rates.  As supplies of the
buggy Pentium dry up next year, expect their price to rise sharply as
certain countries anxious to please the UN about their human rights
compliance snap them up for their bookkeeping.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	 		My every word is copyright, especially those
			not in the dictionary.


=================================================================
-------------------------PENTIUM REPORT # bug23----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 23755 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Ne wYork Times does better than before covering Pentium problem
Date: 14 Dec 1994 22:41:30 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 37
Message-ID: <3cnsaq$mjs@Radon.Stanford.EDU>
References: <oppedahl.1212.2EEF40EA@patents.com>
NNTP-Posting-Host: sunburn.stanford.edu

In article <oppedahl.1212.2EEF40EA@patents.com>,
Carl Oppedahl <oppedahl@patents.com> wrote:
>
>corporations buying machines for use within the company.  It identifies a 
>Vaughan Pratt, a Stanford computer scientist who has conducted his own 
>analysis of the frequency of errors occurring in Pentium calculations and who 
>has found it to be "significantly higher" than what Intel has reported.

I hope the "be" is not what the article actually said.  I have never
made a stronger statement than "could be."  Intel is the one making the
strong statement when they say it can't be, for average users.  Not
enough is known to support Intel's position; in fact the strange nature
of the bug tends ironically to put average users *more* at risk than at
least some scientific users.  *Quite* enough is known to support my
position, which is in full agreement with IBM's statement.

In case the media is oversimplifying things for their readers, it
should be made clear that there are two different (in fact dual) kinds
of noise involved in these statistics, the distribution of operands in
a given class of application, and our uncertainty about those
distributions, which one might imagine formalizing as a distribution
itself, namely of distributions.  This is like disjunctive normal form:
each disjunct as a conjunction of atoms corresponds to a distribution
of values, and the disjunction of those conjunctions corresponds to a
distribution of distributions.  This is the sense in which the two
kinds of noise are dual: it is the De Morgan duality of conjunction and
disjunction, but for distributions in place of simple logical
connectives.  These ideas about nested distributions permeate both
quantum mechanics (mixed states consisting of distributions of pure
states which are complex-fuzzy conjunctions of possibly contradictory
assertions such as Schr"odinger's_cat_is_dead and
Schr"odinger's_cat_is_alive) and game theory, important in both
economics and computer science.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	 		My every word is copyright, especially those
			not in the dictionary.


=================================================================
-------------------------PENTIUM REPORT # bug24----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 24025 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Dataquest Report: The Great Pentium Fire Drill
Date: 15 Dec 1994 10:37:59 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 228
Message-ID: <3cp6a7$q5n@Radon.Stanford.EDU>
References: <3cnceb$fms@newsbf01.news.aol.com> <dmurdoch.1294.787452544@mast.queensu.ca>
NNTP-Posting-Host: sunburn.stanford.edu

In article <dmurdoch.1294.787452544@mast.queensu.ca>,
Duncan Murdoch <dmurdoch@mast.queensu.ca> wrote:
>That was an interesting article, but it contained one very substantial error:

I found this article inaccurate biased rubbish from start to finish.
The one thing I did not find offensive in this regard was the article's
insistence on the economic havoc a total recall would wreak.  This may
well be the case, though it is far beyond my competence to judge.  What
I *have* been recommending is not full recall but full disclosure,
which I take to include the acknowledgement that the range of error
rates is far wider than Intel, Dataquest, and some other wishful
thinkers are allowing.

A total recall sounds horrendously expensive and is not the sort of
thing that one can productively encourage without a lot better
understanding of the tradeoffs than I could muster.  Full disclosure is
not at all expensive, quite the opposite.  First, full disclosure would
restore the confidence of us Intel customers in Intel's forthrightness
in its dealings with us.  Second, it might save some unfortunate user
from a watery grave or a repeated experiment.  Both are very
important.

How Intel imagines it is benefitting by refusing to even look in the
direction of the high error rates spelled out in the IBM report and my
posts is totally beyond my comprehension.  Is Intel concerned that the
possibility of such rates will turn meekly accepting customers into
angry users all demanding new chips.  How is hiding its head in the
sand about the very possibility of high error rates helping Intel when
it is obvious to anyone with half an eye that they are perfectly
possible?  (Whether they are likely in say Lotus-123 is another
matter.) Who does Intel think it's kidding?

Here are my detailed comments bearing out my above analysis of this
unfortunate article's merits, or lack thereof.

	The errors are extremely small--in many cases, insignificant--and occur
	in approximately one out of every billion calculations involving a
	reciprocal.

Wrong.  1/16,000 is an extremely small error for a bathroom scale, it
is an extremely large error for even a $3 calculator let alone a $3,000
computer.

	For the set of all divisors and dividends, we accept Intel's
	claim of one failure in every nine billion pairs.

Qualification: "All" here refers to double precision.  For the set of
all single-precision divisors and dividends, which the authors seem to
be assuming a little later, the failure rate is less than one in forty
billion pairs.

	The most famous calculation, now known as Coe's Ratio, and
	undoubtedly the most common single division performed on the
	Pentium, is 4195835/3145727.

Strong competition: 1/1, 0/1.

	Internet correspondence indicates that this is about the
	largest error obtainable, and that there are approximately 1,738
	unique pairs that can generate an error with this big.

Wrong.  This is the approximate number, depending only on how you judge
"maximum error," not on any other uncertainty, of single precision
pairs generating an error of any magnitude greater than allowed by the
IEEE standard for single-precision results.  The number of error pairs
generating the maximal error depends on the precision of the operands.
For single precision operands it is 67 (depending on exactly how you
define "maximal error"), for double precision operands it is on the
order of 2 * 10^19.

	It is extremely difficult to contrive a situation in which the
	divide error could manifest itself in a way that is material to
	an end result.

Wrong.  It took me 10 minutes of looking at Coe's table to come up with
a simple and plausible situation producing 10^-5 errors at very high
rates.  See bug1, my first post on this subject.

	This is because the error is normally very small,

Define "normally."

	occurs at a rate that is extremely rare,

Wrong: it occurs at a rate that depends on the application, and can
vary from not at all (which is even less than extremely rare) to as
fast as a thousand times a second.  This is a fundamental point that
Intel refuses to acknowledge.  This is unconscionable.

	and should have no effect on the huge majority of Pentium users.

Agreed provided only that "huge" is deleted, which without more data is
just more of the wishful thinking pervading this whole miserable
article.

	The only class of user likely to be affected is that of
	mathematicians working in the field of number theory, the class
	of user from which the bug was originally reported.

This inference is based on a population of one.  Number theorists using
only integer arithmetic, the arithmetic of numbers, will be unaffected
altogether.  IBM makes the point that financial users are at some
risk.  The facts of the IBM report are unquestionable, leaving the
interpretation of those facts as the only possible point of rebuttal.
Interpreting those facts as that financial users are at *some* risk
cannot possibly be considered controversial.  The debate is over the
extent of that risk, but IBM has chosen not to address that issue,
leaving no grounds on which they can be faulted.

	Clearly, users of a Pentium-based system designing devices
	where a divide error could conceivably be life-threatening--
	for example, a bridge, a ship, or a plane--should consider an
	upgraded part even though the error is unlikely to occur and,
	if it does, is effectively certain to be insignificant. Much
	engineering design work involves iterative calculations that
	work toward a solution. In these cases, the likelihood of the
	bug occurring is increased, but its effect is totally negated
	by the subsequent iterative calculation.

"Totally" is not appropriate here.  The presumed model is that every
individual error must decay after further computation.  I would expect
that even in this iterative situation a single error *could* ruin your
whole day, although I am also comfortable believing it may very likely
decay as claimed.  I would be even more comfortable hearing all this
from an expert.

	If the flaw were to somehow cause a nonconvergence, the process
	would be started over with a new set of initial conditions, and
	the error would be eliminated.

While I find this quite plausible, I would again defer to an expert in
this area.  There is a great variety of types of iterative calculation,
and moreover some iterative processes can be sped up with noniterative
algorithms, making arguments about the advantages of iterative
algorithms for buggy processors specious.

	Financial users could conceivably be affected by the bug.
	However, the size of the error is again likely to be
	immaterial, representing hundreds of dollars in a million-
	dollar transaction.

Ingenuous in the extreme.  Transactions depend critically on
maintaining sufficient accuracy to avoid diluting tight profit
margins.  Is Dataquest offering to make up the difference?  And why
this insistence on using buggy computers for transactions involving
millions of dollars.  What exactly is the drawback to using reliable
computers?  Does reliability command that big a premium?

	Although larger transactions could be affected, no financial
	institution would allow a transaction to take place without some level
	of cross-checking. The cross-check process would inevitably surface the
	error, and the rarity of the bug would prevent compensating errors from
	occurring or for multiple errors to compound.

Upping the ante merely amplifies my previous remarks.  What accountant
who knows the meaning of the term "job security" would allow a buggy
computer remotely near any transaction of this magnitude?

	In summary, the low level of the error relative to real-world
	situations, coupled with the extreme rarity of the bug,
	guarantees that even calculations involving lives and money are
	safe on a Pentium.

As opposed to on what?  This is a quotable quote for Newsweek.  Earlier
Dataquest stated that doing this sort of thing was "clearly" not on,
now they seem to be taking that back.  In so doing Dataquest is going
out on *really* thin ice.

	Numerologists and some extremely theoretical physicists,
	however, stand at risk of producing bad results and should
	request replacements.

Speak for yourself.

	intensive numerical calculations that fill
	memory with intermediate values could turn up an occasional
	incorrect answer once every few hundreds of machine years. We
	believe that this class of error is as significant or more so
	than the Pentium divide bug, as it occurs more frequently and
	can result in a much larger change in the number.

What about applications with high error rates, like one a day or a
minute or a second?  Oh, right, these don't exist.

	Having established that the bug itself is immaterial (and we
	would be happy to talk about this at great length--it is
	extremely difficult to contrive a real-world calculation in
	which it matters),

Define real-world.  If I contrive one, how do I know you won't
disqualify it on the ground that contrived calculations are by
definition not real-world?  Provided there's no such catch-22, let me
contrive the following, taken from bug22, which is as real-world as a
contrived example can get.

Santa Clara county has 5,505,021 rock lovers and 3,932,159 jazz
lovers.  (Not the real numbers as Dave Barry would say, but Santa Clara
is a real place, I live there.)  We record all this in our database in
the form of the total population 9,437,180 of such music lovers, along
with the proportion that love rock, namely 0.583333263, these two
parameters being the most frequently requested.  To retrieve the number
of rock lovers from our database, a less frequent request but still of
interest, we just multiply 9,437,180*0.583333263 = 5505020.9746 which
we then round to the nearest integer.  Hey, this works great.

All well and good had we done this on a true-math computer.  On a
Pentium the key difference is that when we compute the proportion that
love rock, the Pentium pegs it at 0.583328176 (ouch!).  When later on
we go to look up the number of rock lovers we get 9,437,180*0.583328176
= 5504972.996, which when rounded yields an answer 48 below the correct
number.  48 rock lovers have disappeared from Santa Clara.  There is a
plot afoot.

And you say this calculation doesn't matter.  Define "matter."

	...

        Why a Total Recall Is Dangerous to the Industry
	Several writers have suggested the idea of a total recall of
	the part. We at Dataquest believe that this is an imprudent and
	irresponsible suggestion;

Probably right, but this is getting outside my sphere of competence so
I'll pass on the rest of this amateurishly written article.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	 		My every word is copyright, especially those
			not in the dictionary.


=================================================================
-------------------------PENTIUM REPORT # bug25----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.sys.intel
Subject: Re: Ne wYork Times does better than before covering Pentium problem
Summary: 
Expires: 
References: <oppedahl.1212.2EEF40EA@patents.com> <3cnsaq$mjs@Radon.Stanford.EDU> <3cprkd$idc@newstand.syr.edu>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

Readers of comp.sys.intel might be so good as to forward this message
to those who should see it but don't have the time to wade through the
incredible volume.  It corrects a misattribution to me in the New York
Times, which is read by many people in a position to understand the
technicalities of the error but who are in no position to read
comp.sys.intel.

In article <3cprkd$idc@newstand.syr.edu>,
Mark Levitt <melevitt@mailbox.syr.edu> wrote:
>Vaughan R. Pratt (pratt@Sunburn.Stanford.EDU) wrote:
>>I hope the "be" is not what the article actually said.  I have never
>>made a stronger statement than "could be."

>   Sorry, the times are says "and found it to be significantly higher..."

I took this up with the reporter, John Markoff, who had interviewed me
for this article.  He understood exactly the problem, acknowledged the
error, and apologized profusely.

I *really* wish he had said "could be" instead of "to be" here.  While
this distinction may be lost on the average reader, more technically
qualified readers will see it and wonder.

This error was greatly compounded by an unfortunate "transition" as
editors call it that was added to Markoff's article by an editor and
which Markoff did not subsequently catch.  The objectionable phrase was
"Yet some other scientists said that it was difficult to predict
average error rates," in a context that implied that both IBM (in their
Pentium study) and I were saying something else.  Yet my whole
interview with Markoff had dwelt on just how difficult any such
prediction was.  The existence of high error rate applications that
Intel to date has refused point blank to acknowledge makes such
predictions much harder than Intel is claiming.

Intel is going out on a limb in its strenuous efforts to persuade
customers that the range of possible error rates never drops to a point
that would cause anyone any concern, allowing them to make a concrete
prediction that there is no serious problem.  If the complexities of
this situation are this far beyond Intel's management to grasp, think
what it must be like for the poor customer trying to make sense of the
situation.  Intel says there's no risk, IBM says there's some risk.
Who do you believe?

I have not made the sort of prediction that Intel is making, and I see
nowhere in IBM's Pentium study where they do either.  All that I claim
is that there is an astronomically wide range of error rates, and that
perfectly plausible applications exist at all points along this
spectrum from an error rate of one a second (hundreds a second for
slightly less plausible ones) to the extreme low of no errors at all.
Oddly, Intel neglects *both* these extremes.

It seems to me that it is very hard to defend Intel's position that all
real-world applications must be at the low-rate end of the range.  And
it is completely impossible to defend their failure to even *mention*
that the high-rate end of the spectrum exists, other than in terms of
running to stand under a meteor as Andy Grove puts it (where the error
rate is of course *millions* a second, just keep dividing with Coe's
pair).  This shoe is on the other foot:  Intel is running very hard to
stand under the Pentium division meteor, and might even get there if
they don't turn around soon.

>  However, I agree that this is the most accurate and balanced coverage I
>have seen to date.

Indeed.  Markoff says he and the editor worked for an hour to get the
article as a whole right, they've done a super job with this, and he
deeply regrets that there were still two fairly basic errors left in
that little part of the article when it was done.

The article does unfortunately leave the impression that there may be
three points of view, Intel's, IBM's and mine, and "other scientists"
(whom the article leaves unnamed).  I hope this posting corrects this
by establishing that there are only two points of view, Intel's, that
the average user need have no concern about high error rates, and the
opposing view, that it is very hard to make such a prediction in the
light of the very plausible high rate scenarios that can arise even in
simple financial transactions of the kind detailed by the IBM study.

Hopefully future interviews will average out this sort of noise, which
as Markoff pointed out is inevitable for any operation under time
pressure, such as getting fast microprocessors to market.  In this case
NYT's short burst of noise here is not to be compared to the ongoing
noise being generated today, both inside Pentiums and in the media, by
the missing five entries in the FPU's P-D plot!

>  Just my  $0.199935454  (Kudos to the originator of this joke...)

On Intel's behalf I would like to state that, whatever its other flaws,
the Pentium does not move the decimal point one place to the left. :)

I would also like to state on Intel's behalf that, except for the bug,
the Pentium FPU does a gorgeous job on the low order 11 bits in the
internal 80-bit (64 bits of mantissa) representation of the FPU, much
better than on some other FPU's I've seen.  I am very much looking
forward to extraordinarily accurate computations with the replacement
chip they promised early last week to send me. I've never had this much
floating point accuracy before, and I am confident that the occasional
glitches I used to get at single-precision with my circle-tracking
software for my font-capture program, whose glitch rate was greatly
reduced by passing to double-precision (back in 1987, long before the
Pentium existed), will be essentially eliminated altogether on the
repaired Pentium and any other processors that handle with kid gloves
those 11 low-order bits hiding out there in the FPU below the 52nd bit
of the visible mantissa.

=================================================================
-------------------------PENTIUM REPORT # bug26----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 24514 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: FDIV Myths (with additions, re-posted by popular demand)
Date: 16 Dec 1994 09:36:31 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 134
Message-ID: <3crn2v$i75@Radon.Stanford.EDU>
References: <PETERM.94Dec16161352@whio.grace.cri.nz>
NNTP-Posting-Host: sunburn.stanford.edu

In article <PETERM.94Dec16161352@whio.grace.cri.nz>,
Peter McGavin <peterm@maths.grace.cri.nz> wrote:
><Many myths debunked>

This is excellent work.  Just a few small cavils:

>Myth: The flaw affects the precision of the result only after the 9th
>decimal place.
>
>Answer: No, operands can be scaled so that the error occurs in the
>tens place or the millions place or any other place.  It is much
>better to consider significant digits.  The largest possible errors
>occur in about the 4th or 5th significant digit, but most errors are
>much smaller than that.
 ^^^^

The "much smaller" in this last clause is open to misinterpretation.
As Intel has pointed out in their white paper, released to their WWW
site on Tuesday, "Statistical measurements using over a trillion test
points indicate that the inaccuracy is is equally likely to manifest
itself in bit positions 12 through 52 to the right of the binary
point."  Another way of saying this is that the the errors are
uniformly distributed by their exponents.  This would imply that 1/41
of the errors have their leading bit in the 12th position (13th
position if you are using 1-origin indexing for the true mantissa, as I
did in my table), 1/41 in the 13th position, and so on down the line.

I undertook to investigate this phenomenon myself before the white
paper was released (I saw it for the first time yesterday, Wednesday).
I posted my preliminary conclusions late Tuesday night, available as
boole.stanford.edu:/pub/FDIV/individual.bugs/bug20.  I observed the
same basic uniformity in error magnitude distribution noted by Intel.
However I have some minor concerns with their results, ranked here in
decreasing order of importance in my estimation.

(The numbering system followed below numbers the bits of the mantissa
from 1 (most significant) to either 23 or 52 (least significant)
depending on whether the operand is single or double precision.)

(i)  Since the largest observed error is known to be 2^-14, for Intel
to be observing errors in bit 12 suggests that their definition of
where the first bit is in error is one that would make 1000000
different from 0111111 in bit 1.  Readers may draw their own
conclusions about the usefulness of that measure.  Instead I
*subtracted* the two quantities and used the position of the leading
bit of that difference.  (Normalization after the subtraction moves
that bit up, but one can easily read off the old bit position from the
now-decreased exponent.)  Applying this criterion to the above example,
one gets the difference at bit 7 instead of bit 1, a more faithful
reflection of the actual difference.

In this way Intel would appear to have overstated the mean magnitude
of the error by a factor of perhaps 2, i.e. one bit position (would
someone please do the exact statistics here?), and greatly increased its
variance (modulo the "buzz" observed below).

(ii) Instead of performing statistical measurements as an indicator, I
used exhaustive search to count the exact number of single precision
operands giving rise to each magnitude of error as defined above.  Thus
my numbers are not statistical but exact counts.  As such they are
exact predictors of what a statistical search, in the space of
single-precision pairs, would come up with in the limit assuming a
perfect random number generator.

Here are these numbers for the first 10 nonzero bit positions, namely
positions 14 through 23 inclusive (the possible error positions in the
single-precision mantissa).  (The numbers I posted on Tuesday were from
a quick search, these are from an exhaustive search that ended Thursday
morning, which turned up an additional 7%, not 2.5% as I had
predicted.  This complete search also recorded in a file (all1precerrs
on boole.stanford.edu as above) the set of actual erroneous pairs
themselves, so one does not need to repeat this search ever again, at
least for single-precision operands.)

67 108 170 150 276 138 259 135 283 152

This starts out a little low and takes about three bit positions to
ramp up to a steadily alternating ("buzzing") value, plus a little
white noise.

These numbers continue with the 29 bits numbered 30 through 52 as
follows.

262 163 300 140 279 147 280 159 274 158 272 145 250 166 257
148 250 149 256 163 278 152 268 152 256 147 259 150 245

This shows that the uniform distribution of log magnitudes of errors
continues down into double precision.

Intel did not investigate whether the distribution remained uniform for
the 11 bits of extended precision beyond bit 52.  Here is my data
for positions 53 to 62 inclusive.  (Position 64, the LSB, is
meaningless, being where normal truncation errors accumulate.)

161 265 153 259 154 236 169 258 149 248

This shows that the bug extends into the 11-bit region, with no
essential difference from the rest of the mantissa.

>Myth: Approximately 1800 different operand pairs to FDIV produce the
>wrong answer.
>
>Answer: No, that is only the number of unique pairs of
>single-precision mantissas.

To be more precise, 1738 such pairs yield single-precision errors,
defined as errors down to bit 23, the least significant
single-precision bit, while including bit 24 makes this 2000 exactly.
This is obtained by adding up bit positions 14 through 23 inclusive.
For various definitions of single-precision error, the number of errors
meeting this criterion will lie between 1738 and 2000.  Adding up
positions 14 through 52, namely 7863, gives the number of
double-precision errors for single-precision operands, again modulo the
exact definition of double-precision error.


>The total number of double-precision
>pairs is vastly larger and is probably impossible to count exactly.

We'll see (about getting the exact count).

>Myth: IBM says a typical spreadsheet user will see an error every 24
>days.
>
>Answer: No, the press release says "as often as once every 24 days".

A subtle but very important distinction.  I was similarly misquoted in
Wednesday's New York Times as saying the error *was* high when I said
it *could* be high.  For these distinctions to be lost on the average
Pentium customer makes computer-buying a relatively risky enterprise.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	 		My every word is copyright, especially those
			not in the dictionary.


=================================================================
-------------------------PENTIUM REPORT # bug27----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 24522 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: sci.math.num-analysis,comp.arch.arithmetic,comp.sys.intel
Subject: Re: Variance of FDIV error
Date: 16 Dec 1994 10:00:09 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 34
Distribution: inet
Message-ID: <3crof9$ior@Radon.Stanford.EDU>
References: <jgkD0uCKL.4KL@netcom.com>
NNTP-Posting-Host: sunburn.stanford.edu
Keywords: bug
Xref: Radon.Stanford.EDU sci.math.num-analysis:10190 comp.arch.arithmetic:819 comp.sys.intel:24522

In article <jgkD0uCKL.4KL@netcom.com>, Joe Keane <jgk@netcom.com> wrote:
>About the Pentium FDIV bug, there is of course a tradeoff between frequency
>and magnitude of errors.  There's a certain chance that any error occurs, and
>a smaller chance that an error of some given magnitude occurs.  We can't
>immediately say which is worse, a bigger error or a more frequent one.
>
>I figure that a reasonable way to combine these characteristics is to compute
>the variance of the error, and i was wondering if anyone has tried to do this.

This was already answered implicitly on this list when I posted a
within-7% estimate of the actual distribution on Tuesday night.  Here
is the exact number of errors of each magnitude (leading bit position
of x-(x/y)*y) for the 2^46 pairs of single-precision mantissas.

bits 14 to 23 (single precision)
67 108 170 150 276 138 259 135 283 152
bits 24 to 52 (double precision)
262 163 300 140 279 147 280 159 274 158 272 145 250 166 257
148 250 149 256 163 278 152 268 152 256 147 259 150 245
bits 53 to 62 (extended precision)
161 265 153 259 154 236 169 258 149 248

Except for a period-2 "buzz" which won't affect central moments below
the 25th or so, this is effectively a rectangular distribution.  The
variance of this density as a function of ln e, the *log* of the error
e is that of a rectangular distribution (so 1/12 the width? or is that
the standard deviation?).  The density as a function of the error e
itself is a hyperbola 1/e, whose variance I suppose must be the
integral of e^-2, namely 1/e (for a suitable value of 1).  Someone
might check this more carefully than I can this late at night.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	 		My every word is copyright, especially those
			not in the dictionary.


=================================================================
-------------------------PENTIUM REPORT # bug28----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15743 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch
Subject: Re: TECHNICAL (Pentium): Exact number of FDIV error-pair FLOATs (was: Bad section of the noisy-integer table)
Date: 16 Dec 1994 10:11:16 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 18
Message-ID: <3crp44$j1p@Radon.Stanford.EDU>
References: <874@moene.indiv.nluug.nl>
NNTP-Posting-Host: sunburn.stanford.edu

In article <874@moene.indiv.nluug.nl>,
Toon Moene <toon@moene.indiv.nluug.nl> wrote:
>Besides, to be really able to assess the effects of this particular bug, I  
>have to have the 1281 triples of (dividend, divisor, Pentium-quotient) in  
>a look-up table in my code, to do the right :-) thing. I haven't seen this  
>list up till now.

This is now available (expanded to catch the few additional pairs
overlooked by the quick search, and extended down to all precisions of
errors) in the file all1precerrs in /pub/FDIV on boole.stanford.edu.
The errors are collected under headings, one for each magnitude
(position of leading bit) of error.  Only operands in [2^31,2^24) are
listed, all other buggy pairs of single-precision operands are these
with arbitrary exponents.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	 		My every word is copyright, especially those
			not in the dictionary.


=================================================================
-------------------------PENTIUM REPORT # bug29----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 24965 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: SAS Institu's Position
Date: 17 Dec 1994 15:41:17 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 41
Message-ID: <3cv0qt$pca@Radon.Stanford.EDU>
References: <3cs3ug$b8d@news.cloud9.net>
NNTP-Posting-Host: sunburn.stanford.edu

In article <3cs3ug$b8d@news.cloud9.net>,
Andrew Beveridge <beveridg@cloud9.net> wrote:
>According to Pratt's table, if the amount of
>bruising is less than 1E-12, the result is accurate to at least
>10 significant digits.

This is only true for integer operands less than 1000, the range
covered by my table.  Dividing the integer 4194687 by 3 bruised in the
least significant bit, an amount of bruising less than 1E-15, yields an
error of 1E-5.  Ramesh Agarwal has given two simple ways that such a
very slightly bruised 3 can arise: as .3/.1, and as 4.1-1.1.  In light
of item 2 below it should pointed out that these are not the only
ways.

>If you have a DATA step division
>involving a value that you know should be a small integer in
>exact arithmetic but that may have been bruised, you can use the
>ROUND function to make the value an exact integer.

1.  "Small" can be omitted from this rule, which is sound for all
positive integers and for bruises < 0.5.

2.  This would be just one rule of thousands for the production of
"Pentium-safe" software.  For example if you know the value should have
been a multiple of 1/2 but may be slightly bruised, then you should
round to the nearest 1/2; all the same reasoning applies.

3.  This method only works if you know *for sure* that it is an
integer; if it is merely an integer most of the time then the method is
inapplicable, yet you still remain just as vulnerable to the bruised
integer problem.

The right thing to do for *any* FDIV, whether or not you know the
divisor should be an integer, is to invoke the Mathworks Pentium-safe
division routine.  Don't waste huge amounts of time messing with huge
amounts of code trying to catch huge numbers of obscure special cases
like the above.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	 		My every word is copyright, especially those
			not in the dictionary.


=================================================================
-------------------------PENTIUM REPORT # bug30----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 25106 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch,sci.math.num-analysis
Subject: News media deja vu: IBM misquoted again
Date: 18 Dec 1994 05:57:49 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 172
Distribution: inet
Message-ID: <3d0j0t$n7r@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:25106 comp.arch:15785 sci.math.num-analysis:10228

====================================================================
Readers of comp.sys.intel might be so good as to forward this message
to those who should see it but lack the time to cope with the
extraordinary volume of correspondence.  It corrects a seemingly slight
yet significant misrepresentation of the IBM study in the San Jose
Mercury News, which is read by many people in a position to understand
the technicalities of the error but who are in no position to read
comp.sys.intel.
====================================================================

In the business section of Wednesday's New York Times I was quoted as
having found the frequency of errors in the Pentium chip *to be*
significantly higher than what Intel has reported.  I corrected this
misattribution at the time in this forum, and that *could be* was the
strongest form of this statement I would stand by.  See bug25 on
boole.stanford.edu:/pub/FDIV/individual.bugs for further details.

Today's San Jose Mercury News (12/17, page 13C) repeats this error,
this time for IBM, reporting that "IBM said that the typical
spreadsheet users *would* see an error every 24 days" (my italics).
Had IBM indeed said such a thing, one could understand the contrast
drawn here with Prof. William Kahan's complaint in the same article, "I
wish people would stop trying to estimate probabilities based on
incomplete information about what is a typical user."  I am in full
agreement with the sentiment of this wish.

As with the NYT article, I am quoted in the SJMN article as fully
agreeing with IBM's conclusion, which I certainly am.  I would have no
problem with being linked to IBM in this way were it not in the context
of these repeated misrepresentations of IBM's conclusion.  By way of
clarifying how it could be that I find myself in any agreement at all
with both IBM and Prof. Kahan when the papers appear to be reporting
them at odds with each other, I would like to draw attention once again
to what it was that IBM actually said in their Pentium study,
*especially* their conclusion, which clearly refutes the Mercury's
interpretation.

As I've said a number of times, both in this forum and to reporters, we
do not know where the great bulk of applications lie on the rate
spectrum.  An excellent quote in this context is Hennessy and
Patterson's "Fallacy: there is such a thing as a typical program"
("Computer Architecture" p.183).

The low-rate end of the Pentium's error-rate spectrum has been
estimated by Intel at one error in 9 billion divisions for random
double-precision data, an estimate not in dispute.  (The yet lower rate
of no errors at all is an even more welcome possibility for
applications doing divisions with suitably constrained data, e.g. small
integers.) There is no controversy here; the controversy arises when
Intel makes the considerably stronger statement that this is where one
can expect to find the typical applications.

IBM's study differs from Intel's in the following two key respects.

First, IBM demonstrates that the error-rate spectrum is a lot wider
than contemplated by Intel, by showing the existence of high-rate
scenarios of a rather more plausible kind than repeatedly performing
4195835/3145727.  Intel in contrast does not raise the question of
whether anywhere near such a high rate might be achieved in any
practical situation.

Second, IBM makes no claim as to where along this spectrum any given
application might happen to reside.  Intel in contrast takes a definite
stand on the low probability of the typical spreadsheet user
encountering the bug once, not just in his own lifetime but in several
hundred lifetimes.

The closest statement IBM makes about actually experiencing any given
rate is with their statement that a user *could* make a mistake every
24 days.  Both this statement and its context make clear that it is a
purely hypothetical statement based on clearly stated assumptions
which, when met, fully and rigorously justify the claimed rate.

No suggestion is made that these assumptions are actually satisfied in
any existing situations.  Readers must draw their own conclusions about
actual likelihoods of encountering such scenarios.  Evidently the
Mercury article has drawn its own conclusion, whether or not warranted,
and then drawn the additional unwarranted conclusion that IBM must have
done the same.

Even had IBM's reference to 24 days inadvertently led some readers
astray, the conclusion of the study should dispel all doubt as to the
basic message of the report.  The final two sentences of the brief
conclusion read as follows.  "In reality, a user could face either
significantly more errors or no errors at all.  If an error occurs, it
will first appear in the fifth or higher significant digit and it may
have no effect or it may have catastrophic effects."

Nowhere in its study does IBM make a claim substantially exceeding this
clear summary.

==========

A possible small objection to IBM's study is that it does not explore
very far the range of dependencies on some of the assumptions made in
their hypothetical scenarios, such as the assumption in some examples
that operands have two decimal digits.  While this is a reasonable
assumption for monetary amounts represented in floating point, one
would like some sense of whether this is an extreme case or whether
other precisions are equally afflicted.

A possible objection to my own studies of small bruised integers is
that it is not clear to the typical spreadsheet user what inference to
draw for typical calculations.  When one knows the incidence of both
integers and bruising in one's spreadsheet it is straightforward to
take my raw error rates for a steady stream of small bruised integers
and dilute them according to the proportion of small bruised integers
encountered in one's own daily use.  But thanks to roundoff when
displaying data to only 6 or 8 decimal digits, bruising tends to
disappear like Mr. Snuffleupagus or an electron when you try to look
directly at it.  Hence most users remain blissfully unaware of the
extent to which their data is subject to bruising.  The typical user's
reaction to the observation that the difference between 4.1 and 1.1 is
not exactly 3 but a slightly bruised 3 is one of incredulity; it took
me half an hour to convince a Byte reporter of the reality of this
phenomenon (we tracked every bit).  This makes the *practical*
significance of my tables quite unclear to most spreadsheet users.

To address these objections I plan to write a short program based on
relatively unobjectionable features of these studies to permit people
to determine for themselves the rate, seriousness, and cumulative
effect of errors for the evaluation of simple expressions involving
division.  Unlike my study, there is no subtraction of small quantities
to force integers to be bruised, and unlike the IBM study the
experimenter gets to choose for herself the distribution of operands by
both size and precision.  Bruising still happens, but now only to the
same extent as in routine spreadsheet calculations, as a result of
arithmetic done with operands prior to their arrival at the division
operation.  My program limits this arithmetic to a single addition of
nonnegative reals, a particularly common operation performed throughout
typical spreadsheets.  The omission of both subtraction and negative
operands removes any concern about the possibility of mistaking the
implicit numerical instability of x/(ya-yb) for a large Pentium error
(and besides, what substance is there in negative numbers?:).

I will release the C program as source; it will be quite short and
hopefully reasonably readable, which should leave no doubt as to what
is going on behind the scenes when running experiments.

The limitation to a single addition for each operand of division
implies that the expression to be analyzed is of the form
(xa+xb)/(ya+yb) with xb and yb optional.  This severe restriction of
the language greatly increases the chances of sampling a fair
cross-section of the available expressions; when one complicates the
language with more constructs, the resulting combinatorial explosion in
variety of expressions makes it far harder to tell whether any given
set of expressions is a fair cross-section, and the results of any
feasible number of experiments then have a less clear significance.

I will include with the program the results of a few experiments
illustrating the full range of error rates, from no errors at all (an
even lower rate than Intel's estimate) to rates higher than Intel's,
all obtainable merely by changing sizes and precisions independently
for the numerator operands and the denominator operands.  The rate
claimed by Intel for typical spreadsheet users is achieved for data
with generous limits on both size and precision of input data.

Besides the overall error rates and cumulative and average magnitudes,
the program will also report, for the much smaller population
consisting of those divisions having any nontrivial error (e.g. at
least one cent in a financial transaction), the minimum, mean, and
maximum error magnitudes, together with their standard deviation.  This
data tells the complaints office whether to expect a large number of
trivial complaints or a much smaller number of serious injuries.  In
the former case a complaints office is neither a worthwhile return on
the investment nor urgently needed; in the latter however it would be
market suicide not to set up an efficient complaints office geared
towards the most likely complaints.
-- 
Vaughan Pratt		http://boole.stanford.edu/boole.html
	 		My every word is copyright, especially those
			not in the dictionary.


=================================================================
-------------------------PENTIUM REPORT # bug31----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15798 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic
Subject: Re: seriousness of intel FDIV bug
Date: 18 Dec 1994 18:52:39 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 69
Message-ID: <3d20dn$e80@Radon.Stanford.EDU>
References: <afraser.122.2EDD21BD@hookup.net> <3csbasINNahs@life.ai.mit.edu> <1994Dec17.222208.11858@mole-end.matawan.nj.us> <3d1cta$q33@b.stat.purdue.edu>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:25188 comp.arch:15798 comp.arch.arithmetic:835

In article <3d1cta$q33@b.stat.purdue.edu>,
Herman Rubin <hrubin@b.stat.purdue.edu> wrote:
>Before a bug is known, there should only be ordinary
>liability.  But when a bug becomes known to a manufacturer or vendor,
>it becomes incumbent on the INDIVIDUALS responsible to take all 
>reasonable steps to see that the users are notified.

This is not at all clear-cut.

PAST.  You have to understand the social hierarchy of bugs, there's a
nontrivial entomological pecking order here.  Showstoppers mean war:
they are hunted down mercilessly and shot on sight, no questions
asked.  Felons putting users at serious risk generate an all-points
bulletin.  Pickpockets are everywhere, and when there's a rash of them
people are simply warned to watch their wallet.

Industry practice today is not to report every known pickpocket to
every tourist entering town.  Too expensive and no one cares ("You have
pickpockets?  Oh dear, now how are your restaurants?").  The really
hard question in the case of the Pentium bug is, when did the
seriousness of this bug for at least number crunching and high finance
become clear to Intel?  Up to a certain point, industry standards
permit Intel to simply log the bug and deal with it in due course
without notifying users.

Just as it is impossible today to quantify how typical applications are
distributed along the error-rate spectrum, so is it impossible today to
fault Intel for tardiness in reporting the bug: Intel is innocent until
proven guilty.  Neither issue can be resolved without a lot more work.

PRESENT.  Why has Intel not modified its stand on the likelihood of
errors?  My guess: they placed their bet when their horse, Low Rate,
looked to them like an odds-on favorite, back when Intel was the only
one making book on the event, for the simple reason that Low Rate
seemed then to be the only horse in the race.  Then up rolls a slew of
owners and bettors, whom the media can't tell apart, the owners put
their horses in the race, the bettors plunge heavily on the newcomers,
and pretty soon Low Rates' odds start stretching out.  What do you do?
Do you take back your bet?  If you really believe in Low Rate yourself
all along and she wins, you'd look a complete ninny if you had in the
meantime taken your bet back merely under peer pressure.

Intel is both owner of and heaviest bettor on Low Rate.  IBM and I own
Two Digits and Bruiser respectively, but have not joined the crowd at
the bookie's.  (IBM pokes two digits in the divisor's eye, I give it my
regulation bruising.)  I can't speak for IBM when I say that I would be
very happy for all the bettors on Low Rate if she wins, and very sorry
for Intel if not---in fact it would be a major tragedy.  That it
*would* be a tragedy however is no excuse for sticking your head in the
sand during the race, like the Dataquest "fire drill" report which with
Monty Pythonesque logic solemnly advises us all to bet on Low Rate or
face industry havoc.

FUTURE.  Should Intel recall?  With the numerical literacy test toned
down or eliminated (how's that doing, btw?), Intel's present course
seems fine: put users who want a replacement in the queue.  Here Intel
can bet that many users will assess the bug's overall impact on both
their work and eventual resale value as down in the noise compared to
the hassle and downtime of unplugging and shipping the chip and
reinstalling the new one.  For Intel not to place this bet confidently
is to betray a lack of confidence in their bet on Low Rate.

The numerical intricacy of the bug, the guaranteed transience of this
phenomenon (a more satisfactory escape in this regard than inventing
TeX), and the apparent discrepancy between these two bets, are
combining to maintain my intense and hopefully brief interest in the
bug.
-- 
Vaughan Pratt			http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug32----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 10234 of sci.math.num-analysis:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch,sci.math.num-analysis
Subject: Re: Dataquest Report: The Great Pentium Fire Drill
Date: 18 Dec 1994 22:11:57 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 61
Distribution: inet
Message-ID: <3d2c3d$na8@Radon.Stanford.EDU>
References: <3cnceb$fms@newsbf01.news.aol.com> <D0vqI2.3J3G@austin.ibm.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:25240 comp.arch:15805 sci.math.num-analysis:10234

In article <D0vqI2.3J3G@austin.ibm.com>,
Sivasankar Chander <siva@bond.bocaraton.ibm.com> wrote:
> There 1738 single-precision pairs that generate errors, not necessarily
>this big. And there are only 2^46 single precision mantissa pairs, which
>is only ~ 7 * 10e13. *Big difference*.

That's the number generating *single-precision* errors.  Although these
are the natural kind to associate with single-precision operands,
dividing two floats yields a double which is only truncated to a float
after being transferred from the FPU to a 32-bit general-purpose
register.  Thus 1/3 is correct to 52 bits internally in the FPU, and
that's what you get if you unload it into 64 bits.

Here's a table for all combinations of input and output precision.


			NUMBER OF PENTIUM ERRORS
		by precision of operands and quotient

				 Quotient
				Precision
		
                  --------------------------------
		  |	Single	   |	Double	 |
                  |------------------------------|
	Single    |	 1738	   |	 9915	 |
 Operand          |		   |		 |
Precision         |		   |		 |
	Double	  |   ~4.0*10^20   |   ~23*10^20 |
                  --------------------------------

The figures for single precision operands are exact to within choice of
definition of error.  The figures for double-in-double-out are obtained
dividing the 2^104 mantissa pairs by 9 billion, Intel's figure for the
number of divisions per error.  I obtained double-in-single-out by
scaling in proportion to the line above.

A naive model of the passage from single to double precision operands
is to simply assume that x/y is in error just when (float)x/(float)y is
in error.  This is not true, but the hope is that what you lose on the
swings you win back on the merry-go-round.  With that model the lower
line becomes 5.0*10^20 and 28*10^20.  I don't understand what diluting
mechanism leads to a figure 20% lower than this simplistic estimate,
but Intel arrived at its figure by two independent means whereas so far
I've only used one.  There may well be a good explanation for the Intel
figure, so that's what I've gone with.

There is no unique or even obviously best definition of
single-precision errors.  I take a single-precision error to be one for
which x - (x/y)*y before normalization differs from x in the
single-precision part of the mantissa (first 23 bits not counting the
hidden bit).  This differs very slightly from some other definitions,
and by a potentially very large amount from the definition Intel uses,
namely the first bit position where the quotient is in error---this is
how Intel finds the largest possible FDIV error to be in the 12th bit
when my definition locates it in the 14th.  The error in 10000000 vs.
01111111 is in bit 1 or bit 8 according to whether you use Intel's
definition or mine respectively.
-- 
Vaughan Pratt			http://boole.stanford.edu/boole.html
				4.999999/14.999999 has 6 3's, not 4


=================================================================
-------------------------PENTIUM REPORT # bug33----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 25253 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Dataquest Report: The Great Pentium Fire Drill
Date: 18 Dec 1994 23:05:31 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 41
Message-ID: <3d2f7r$p0n@Radon.Stanford.EDU>
References: <1994Dec14.203004.655@news.media.mit.edu> <3d1np2$m4o@newsbf01.news.aol.com> <3d230j$fom@Radon.Stanford.EDU> <dmurdoch.1324.787786000@mast.queensu.ca>
NNTP-Posting-Host: sunburn.stanford.edu

In article <dmurdoch.1324.787786000@mast.queensu.ca>,
Duncan Murdoch <dmurdoch@mast.queensu.ca> wrote:
>In article <3d230j$fom@Radon.Stanford.EDU> pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes:
>>
>>If someone you've never heard of claims that 2+2=5, would you hold your
>>tongue on the off-chance that they might just have the necessary
>>authority to back it up?  Your logic is no better than Dataquest's.
>
>Dataquest's article was consistent and reasonable; it had no logical 
							      ^^^^^^
(Linear logic's inventor Jean-Yves Girard would be very pleased with
this identification of logic and arithmetic.  A Good Thing too.)

>inconsistency like 2+2=5.

To the contrary, Dataquest endorsed much *larger* arithmetic errors
than 2+2=5, and I quote, "Financial users could conceivably be affected
by the bug.  However, the size of the error is again likely to be
immaterial, representing hundreds of dollars in a million-dollar
transaction."

(Granted that the *relative* error of hundreds out of a million is
smaller than 1 out of 4, but I hope I will be permitted one brief use
of absolute error to make my point...)

Ross Perot is entitled to claim 2+2=5 if he's willing to make up the
difference with his own money.  Not that it would last very long.  If
the Pentiums in the US are collectively chugging along at 10^14
instructions per second, with as few as one FDIV per million
instructions, that's still 9*10^9*10^6/10^14 = 90 seconds between
errors even for random data.  But that rate of divisions might be way
off, and it's anyone's guess what proportion of the daily fluctuations
in the deficit is attributable to undetected division errors.  At what
odds would you feel safe betting it's less than $1000?

To Keith Sullivan ("Yes, that you're windy"): if I told him to go fly a
kite he'd do it to make his point.  I do worry about it though, just
not enough to do any good.
-- 
Vaughan Pratt			http://boole.stanford.edu/boole.html
				4.999999/14.999999 has 6 3's, not 4


=================================================================
-------------------------PENTIUM REPORT # bug34----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15810 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic
Subject: Re: seriousness of intel FDIV bug
Date: 19 Dec 1994 00:39:32 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 42
Message-ID: <3d2ko4$rkc@Radon.Stanford.EDU>
References: <afraser.122.2EDD21BD@hookup.net> <3d20dn$e80@Radon.Stanford.EDU> <jnavasD10u2w.3Fp@netcom.com> <3d2fv3$q64@Sequoia.picosof.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:25276 comp.arch:15810 comp.arch.arithmetic:838

In article <3d2fv3$q64@Sequoia.picosof.com>,
Bob Richardson  <brichard@sensemedia.net> wrote:
>   His contention about industry standards is quite correct.  I have

Thanks, Bob.  Quite apart from saving my bacon, this was a wonderfully
informative post.  I'll resist the strong temptation to reply to it
all, and take up just one very interesting point that others have
already addressed, but not nearly as often and vociferously as it
deserves.

>   However, this brings me to an interesting point vis-a-vis Professor
>Pratt's previous post regarding the error rate, and the difficulty
>of determining a valid error rate since we can't define a "typical
>user".  Is the cat in the box dead or alive?  Well, I'd say we can
>infer a rather high error rate given the rather short time which
>elapsed since the i486 price cuts and the conversion of the marketing
>channel to Pentium based machine and the discovery of the error(s).

Like any crime, the pecking order for numerical errors is, undetected,
detected, corrected.  Your estimate of the *detected* error rate (if
I've interpreted you correctly) is probably not too different from any
else's, and Mathworks is addressing the corrections issue.  Trouble is,
no one has so far convincingly subtantiated their estimate of the
*undetected* rate.

If the Pentium's current contribution to the deficit measured in
dollars per second can be shown to be either more or less than the cost
of aerosol-induced global warming I would love to see the
calculations.

These errors are insidiously small---just small enough to get in under
the radar, just large enough that in the enormous quantities implied by
millions of Pentiums, Andy's meteors have the potential to bomb the
place to hell in a handbasket in a matter of hours.  I cannot imagine a
more effective way of screwing up the numerical computing world than
omitting a few little entries from the partial divisor table.  Andy
Grove nothing, this is interplanetary sabotage.  Aliens stole Intel's
brain, and it's anyone's guess what they are now planning to do with
it.
-- 
Vaughan Pratt			http://boole.stanford.edu/boole.html
				4.999999/14.999999 has 6 3's, not 4


=================================================================
-------------------------PENTIUM REPORT # bug35----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 25326 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Pentium 90 for a Pentium 100?
Date: 19 Dec 1994 04:18:20 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 37
Message-ID: <3d31ic$4j4@Radon.Stanford.EDU>
References: <1994Dec18.182746.80926@kuhub.cc.ukans.edu>
NNTP-Posting-Host: sunburn.stanford.edu

In article <1994Dec18.182746.80926@kuhub.cc.ukans.edu>,
Jim Wilson  <dwilfong@falcon.cc.ukans.edu> wrote:
>Has anyone tried when they got there replacement chip asked if
>you can give them some extra dough and just get a P100?  Just curious
>cause that is what I would like to do if at all possible, since I have
>to get a new cpu anyway might as well get the best they got :)

Disclaimer: the following is intended purely as a facetious
recommendation; you follow it at your own risk.

I estimate the probability of encountering an error running a randomly
selected P90 at 100 MHz to be < one millionth of the chance of
encountering the Pentium bug.  This on the basis of having seen the
exact same million bugs on two P90's running at respectively 90 and 100
MHz, and never having had any other problem.

You could be unlucky: due to manufacturing variations my P90 might be
be especially good at adapting to 100 MHz (making all readers of this
message unlucky), or your P90 might be especially bad at it (just you
are then unlucky).  Unlike the division error, this is a crap shoot.
The division error guarantees you the same bugs on every processor.

So put your faith in those of us who understand statistics better than
you, and your sped-up P90 will be good for the next 27 billion years.
Leaves Japanese quality in the dust, not to mention you.  But you will
have been able to leave that extra dough to your kids.

These remarks are to be read in the context of table 5-1 on page 13 of
the Intel White Paper, which gives mean times between failure for 8Mb
of DRAM without ECC, particle defects in the Pentium chip, 8Mb of DRAM
with ECC, and Pentium FDIV on a typical spreadsheet, as respectively 7,
200-250, 700, and 27,000.  Units are years.

Reread the disclaimer at the top.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug36----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 25328 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: alt.sys.pc-clone.gateway2000,comp.sys.intel
Subject: Re: Why does Intel want the bad ones back?
Date: 19 Dec 1994 04:30:15 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 34
Message-ID: <3d328n$4sc@Radon.Stanford.EDU>
References: <mckinley.155.00008578@msupa.pa.msu.edu> <jwalper.29.0012FB99@bbs.sd68.nanaimo.bc.ca> <3d2vk9$kv4@sundog.tiac.net>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU alt.sys.pc-clone.gateway2000:25531 comp.sys.intel:25328

In article <3d2vk9$kv4@sundog.tiac.net>,  <bobz@kirwaido.trystero.com> wrote:
>In <jwalper.29.0012FB99@bbs.sd68.nanaimo.bc.ca>, jwalper@bbs.sd68.nanaimo.bc.ca (John Walper) writes:
>|In article <3csq40$f4p@lynx.dac.neu.edu> csolomon@lynx.dac.neu.edu (Chad Solomon) writes:
>|>Subject: Re: Why does Intel want the bad ones back?
>|>From: csolomon@lynx.dac.neu.edu (Chad Solomon)
>|>Date: 16 Dec 1994 14:34:24 -0500
>|
>|>> Does anyone know why on earth Intel would want to pay Fed Ex to return
>|>> the bad chips? Since when can they be repaired? Just curious ...
>|
>|> Isn't it obvious, they are probably going to turn around and reuse
>|>them. ;-)
>|...
>
>It's the only way Intel has of confirming they are replacing a 
>chip rather than just sending them to whoever claims to have one.
>I don't think this is an issue.

If the system is still under warranty the same procedure should apply
as for any system under warranty: you get a replacement part and keep
the old part.  This point was made in a letter in today's SJ Mercury
News.

This particular fault in the warranted merchandise is a little out of
the ordinary however.  Usually something either works or it doesn't,
more or less.  Here it works for everyone, except for very occasional
and barely noticeable changes to one's data.  It seems to me that the
enormous temptation to keep replaced units in service or to sell them
as barely flawed merchandise makes a whole new ballgame out of the old
warranty rules.  High school debating teams would have a field day with
this one.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug37----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 25703 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: alt.sys.pc-clone.gateway2000,comp.sys.intel
Subject: Re: Why does Intel want the bad ones back?
Date: 20 Dec 1994 05:26:26 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 46
Message-ID: <3d5pu2$ove@Radon.Stanford.EDU>
References: <mckinley.155.00008578@msupa.pa.msu.edu> <3d2vk9$kv4@sundog.tiac.net> <3d328n$4sc@radon.stanford.edu> <1994Dec20.013555.14346@mercury.ncat.edu>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU alt.sys.pc-clone.gateway2000:25652 comp.sys.intel:25703

In article <1994Dec20.013555.14346@mercury.ncat.edu>,
 <hkennedy@mercury.ncat.edu> wrote:
>In article <3d328n$4sc@radon.stanford.edu> pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes:
>>If the system is still under warranty the same procedure should apply
>>as for any system under warranty: you get a replacement part and keep
>>the old part.  This point was made in a letter in today's SJ Mercury
>>News.
>
>I just got a note from intel saying the reason that the chips were being
>collected and destroyed was to avoid the possiblity that someone would buy
>a buggy chip and ask for a new chip again. :-) Sounds like to me that would
>happen. I could see it know!

The following are only my impressions, I'll gladly defer to the facts.

Apples and oranges.  Intel is not replacing chips under warranty.  If
you tell Intel you have a chip they will mail you one.  Anyone in the
US can do this.  You must pay for what they send you within 30 days,
either on your credit card or by sending them a chip.

A warranty replacement of the chip would follow a procedure that
included an audit trail of where and when you acquired the chip,
including showing receipts when requested.  Unlike Intel's procedure,
it should not be possible to get two warranty replacements on the one
receipt.

There is nothing wrong with Intel's procedure, it simply is a different
animal from a replacement under warranty.

My take on the mini-exam is that Intel was trying to sort these phone
orders for good Pentiums by urgency of need.  Given the limited supply,
the presumed goal of giving existing customers priority, and the ease
with which anyone can call up to order a good Pentium, one can
understand Intel trying to come up with *some* workable solution.  Too
bad they came up with the exam solution; with 20/20 hindsight it's hard
to imagine a worse one PR-wise, just like it's hard to imagine a worse
way to mess up the FPU than with radar-invisible mosquitoes delivering
malaria most people won't catch, and which took doctors more than a
year to notice.

It is conceivable that this incident will spawn individual component
warranties for expensive components in the future.  Warranties give
manufacturers a great competitive edge provided they can afford them.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug38----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 25776 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: REPOST: The Hard-Wired Example (confirming IBM's findings)
Date: 20 Dec 1994 12:19:01 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 172
Message-ID: <3d6i3l$6j0@Radon.Stanford.EDU>
References: <3d4i5t$t6@tuba.cit.cornell.edu>
NNTP-Posting-Host: sunburn.stanford.edu

In article <3d4i5t$t6@tuba.cit.cornell.edu>,
Allen Kim <ak10@crux4.cit.cornell.edu> wrote:
>
>		A			B
>1	("alpha")		(Number of months)
>	= 1/12 = 0.083333333	22.5
>
>A2, B1, and B2 are hard-wired, and the rest are calculated values.
>
>Notice that in this implementation, I got no error!...Wondering if IBM lied, I
>decided to change A1 from a calculation into a hard-wired number, so
>I typed 0.083333333 exactly into the cell, and then here are the
>results: ....
>This case gives the error that IBM publishes.  The only difference was
>that in the first example, "alpha" was calculated from 1/12, and the
>second example, "alpha" was typed straight in.
>...
>The key words in IBM's example are "hard wired."  In other words, you
>must do IBM's example *exactly* as explained in order to hit the
>Pentium bug.
>
>Now what does this say about IBM and its text on the Pentium bug?
>After all, IBM did say that their example was a case where
>"hard-wired" constants can be put into a program or spreadsheet, and
>that in this case, the hard-wired constant can lead to an error.
>
>But this example seems too artificial for me.

The same thing bothered me too, and I asked IBM's Fred Gustavson about
this.  He said that many organizations mandate, as specific truncated
decimals, constants that you and I would find more natural to write as
rationals.  I challenged him to produce examples.  He complained he was
too busy with management stuff, but a day later he sent me the IBM
federal credit union's quoted daily rates for secured, realty,
signature (?) and auto loans.  Expressed as yearly rates these are
respectively 9%, 9.5%, 13.5% and 10%.  The credit union however gives
you actual daily rates that *they* calculate, not you (who knows, maybe
IBM has a system where they call up Fred to ask him what number of
decimal digits will put the Pentium in the worst light :).  The current
daily rates are specified in the right hand column as

	 9/365      ~	.024658
	 9.5/365    ~	.026027
	13.5/365    ~	.036986
	10.0/365    ~	.027397

Ok, so these are real examples culled off the wall (or wherever they're
posted) of a working office.  Let's see whether any of them them lead
to an error.  Fred's suggestion is to convert them back into yearly
rates (which will be more official than the original annual rates they
were derived from if the daily rates happen to be the official ones, I
don't know in this case but say they were).  So we multiply them all by
365.

Now we want to know if any of these four numbers are bad divisors.
(Why are we dividing by them?  Who knows, maybe to get a ratio with
some competing rate, my accountant says such comparisons are done this
way a lot.)  To do this it helps to understand the five *Coe intervals*
as I'll call them found in every octave [x,2x).  Within octave [16,32)
they consist of all half-open intervals of width 1/256 ending at an
integer multiple of three, of which there are five, namely
[18-1/256,18), [21-1/256,21), [24-1/256,24), [27-1/256,27), and
[30-1/256,30).  To within a power of two, all *known* bad divisors
(this is not yet a theorem!) lie in one of these intervals, with the
riskiest divisors concentrated close to the top of each interval.

It is easier to work with unit intervals, so we scale everything by 256
to the octave [4096,8192), where these intervals become [4607,4608),
[5375,5376), [6143,6144), [6911,6912), and [7679,7680).

Doing all this for our four rates, we get:
							  Nearest pentad
	.024658*365*512  =  9.000170*512  =  4608.087040    4608
	.026027*365*512  =  9.499855*512  =  4863.925760    4608
	.036986*365*512  =  13.49989*512  =  6911.943680    6912
	.027397*365*512  =  9.999905*512  =  5119.951360    5376

The second and fourth rates are very safe, nowhere near any pentad.
The first rate just *barely* misses the first Coe interval (whew); if
the credit union had just rounded it down, giving 4607.900160, they'd
have landed in the top of that interval.  The third rate however lands
right in the fourth Coe interval, very close to the top.

Given that these intervals constitute only 5/4096 of this octave, it is
scary that half of these "bruised annual rates" land in or almost in
one, and at the high risk end at that.

Ok, so 13.49989 is risky as a divisor, but how risky?  To test this I
divided it into random reals in [0,1], generated by drand48().   The
error rate was one in every four million divisions.  The errors were
uniformly distributed (by exponent) starting from 8*10^-6 and going
down all the way to 10^-18 from there.  A mildly bad divisor.

An example bad dividend for this divisor is 124.31147, which gives a
relative error of 6*10^-6.  Here are some more high-error dividends,
with their corresponding relative errors.

32.189679	2.4*10^-07
45.689569	1.7*10^-07
62.155735	7.9*10^-06	this is one is off by 1 in 130,000
64.379358	2.4*10^-07
91.379138	1.7*10^-07

You might say that this one in 4 million rate was so much faster than
one in 9 billion because we stood under the meteor 13.49989.  But this
meteor came from one of four mandated loan rates Fred got from a credit
union and a perfectly natural multiplication.  Not that I know what a
signature loan is.

==========

If the interval-checking arithmetic seems like too much work, you can
take the easy way out and do what I really did (I worked out the above
table after the fact so people could see behind the scenes).  I just
applied the following bit-hacking algorithm that checks whether its
argument lies in a Coe interval.  I use this procedure to find the
5/4096 bad divisors cases when collecting statistics by a search
involving many divisors.

risky_divisor(double x) {
	return ((char*)(&x))[5]==-1 && ((0x2492>>(((short*)(&x))[3]&0xf))&1);
}

The first half of the test checks that bits 5-12 are all 1's (for
safety this might better be 5-10).

==========

>First of all, people
>using spreadsheets would not convert from months to years by
>hand-typing a decimal factor that goes to the ninth digit.  After all,
>it's much more convenient to type in X*(1/12) than X*0.083333333, plus
>it's more readable, too.  People looking at the spreadsheet can easily
>understand what's going on.

Indeed.  When will the people who dictate these constants discover
rationals?

>Andy Grove stated that IBM is "finding exactly when and where a meteor
>will strike and going there themselves."  Somehow, Mr. Grove's
>statement doesn't sound so ridiculous anymore.

Your arguments point up the extreme reasonableness of Andy Grove's
position.  Gustavson's arguments point up the ease with which
probabilities can be grossly misestimated using such perfectly
reasonable reasoning.  (Marilyn vos Savant (Ask Marilyn) gets a lot of
mileage out of this sort of thing; she's not so hot on physics but
she's got good statistics intuition.)

>Then there are also people (like me)
>who think the issue is exaggerated and will blow over in about two
>months, depending on how fast Intel can produce fixed chips.

I very much hope it does.  It might blow over (or at least defuse
itself a little) if only Intel would enter into more open dialogue
about all these probabilities and possibilities everyone is bandying
about.  Unfortunately the litigation aspects may be putting a big crimp
in this.  Earlier I suggested aliens stole Intel's brain, but the truth
may well be that it was lawyers.  What *they're* going to do with it is
plain as day: make scads of money off it.

>The point is that IBM has succeeded in creating more fear over the
>Pentium.

Years of practice, and don't forget the uncertainty and doubt.  With
the astronomical range of error rates and the general lack of
information about most applications' profiles, it's not half surprising
that there's so much uncertainty and doubt about this issue, with or
without IBM.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug39----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 10271 of sci.math.num-analysis:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch,sci.math.num-analysis
Subject: Three precisions if you count extended (was Fire Drill)
Date: 20 Dec 1994 13:09:46 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 75
Distribution: inet
Message-ID: <3d6l2q$7v1@Radon.Stanford.EDU>
References: <3cnceb$fms@newsbf01.news.aol.com> <D0vqI2.3J3G@austin.ibm.com> <3d2c3d$na8@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:25783 comp.arch:15861 sci.math.num-analysis:10271

In article <3d2c3d$na8@Radon.Stanford.EDU>,
Vaughan R. Pratt <pratt@Sunburn.Stanford.EDU> wrote:
>Here's a table for all combinations of input and output precision.
>
>
>			NUMBER OF PENTIUM ERRORS
>		by precision of operands and quotient
>
>				 Quotient
>				Precision
>		
>                  --------------------------------
>		  |	Single	   |	Double	 |
>                 |------------------------------|
>	Single    |	 1738	   |	 9915	 |
> Operand         |		   |		 |
>Precision        |		   |		 |
>	Double	  |   ~4.0*10^20   |   ~23*10^20 |
>                  --------------------------------

Well, I blew this big time.  I just realized that the 9915 figure I
measured counts *all* Pentium bug errors, *including* those that happen
in the 12-bit extension extending the 52-bit mantissa to 64 bits.  I'd
checked the number itself several times and couldn't find anything
wrong with any of the arithmetic, so I finally went with it.  This led
to a 20% discrepancy between Intel's estimate of the error rate and the
naive 2^58 scaling of 9915, which I remarked on as unexplained in my
post.  The explanation turned out to be my stupidity.

Here's a corrected table, which moves the 4 entries in the above table
apart to make room for 5 more entries in a 3x3 table, which makes
extended precision a possible precision for both operands and result
independently.  I obtained line 1 by an exhaustive search, see
/pub/FDIV/individual.bugs/bug26 on boole.stanford.edu.

			    NUMBER OF PENTIUM ERRORS
		     by precision of operands and quotient

				 Quotient
				Precision
		____________________/\__________________
               /                                        \
              ---------------------------------------------------------
             |     Single     |   Double     |  Extended  |   Total    |
             |---------------------------------------------------------|
             |                |              |            |         13 |
       Single|      1738      |     7863     |    9915    | 7.036*10   |
Operand      |                |              |            |            |
Precision    |           20   |          21  |         21 |         31 |
       Double|   5.009*10     |  2.266*10    | 2.858*10   | 2.028*10   |
             |                |              |            |            |
             |           27   |          28  |         28 |         38 |
     Extended|   8.404*10     |  3.802*10    | 4.795*10   | 3.403*10   |
             |                |              |            |            |
              ---------------------------------------------------------

The heuristic from getting lines 2 and 3 from line 1 is simply to
multiply by 2^58 and 2^82 respectively, on the win-some lose-some
theory that makes "buggy(x/y) iff buggy((float)x/(float)y)" a good
indicator despite not being a fact.

For any row, dividing the Total column by the Double column yields the
infamous 9 billion number, 8.95 to be precise.  This is exact for
single precision operands with double precision results, but scaling by
2^58 is merely a heuristic for extending this to the next row, and as
such should be taken with a (hopefully small, possibly very small)
grain of salt.

Modulo any fine tuning of lines 2 and 3 as more data comes to hand, and
any picky complaints about my definition of quotient precision (other
definitions should have only a tiny effect), I am optimistic that this
table has converged at last.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug40----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 25797 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: A simple example for Pentium FDIV flaw
Date: 20 Dec 1994 14:08:21 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 94
Message-ID: <3d6ogl$a2t@Radon.Stanford.EDU>
References: <19941219.151430.414@almaden.ibm.com>
NNTP-Posting-Host: sunburn.stanford.edu
Keywords: Pentium FDIV flaw simple example

In article <19941219.151430.414@almaden.ibm.com>,
 <agarwal@watson.ibm.com> wrote:
>
>Here is another simple example which demonstrates the FDIV bug.
>
>     Q = (2148959.8 - 1100000.8) / (2.3 - 0.8)
>       =   699263.33  -  Pentium Answer
>           699306     -  correct answer is an exact integer

When I saw Agarwal's first example of this kind, 4.1-1.1 as bruised 3,
I was both startled and impressed.  I have since been looking at this
general class of examples to get a sense of how serious they really
are.  I have measured the following densities of "bad divisors,"
defined as reals lying in one of the Coe intervals.

The first line of the following table gives the probability that a
randomly chosen pair of the form d.d-d.d > 0 (e.g. 2.3-0.8 as above)
will be a bad divisor.  There are roughly 5000 candidates, and 31 are
in the region of risk.  (The problem of counting these turned out to be
riddled with traps for young players, it is easy to generate other
kinds of noise besides the one you're studying, as a side of effect of
the search process itself.)

d.d - d.d       1 in 160        (31/5000)
dd.d - dd.d     1 in 360        (1392/500000)
d.dd - d.dd     1 in 750        (666/500000)
dd.dd - dd.dd   1 in 2300       (21936/5*10^7)
Random          1 in 800        (5/4096)

When a digit is added to the left (numbers up to 99.9) the probability
drops by more than a half.  When added to the right however,
corresponding to dollar amounts up to $9.99, it drops more than
fourfold.  Dollar amounts up to $100 are at such low risk that they are
lower than random, *assuming a uniform risk in each risky interval*.

This is a bad assumption.  The reason is that each risky region
concentrates most of the risk right at its high end.  Thus a random
number in the risky region is not automatically put at *high* risk.  In
contrast, when d.d-d.d enters a risky region, it will be because it is
bruised in the least significant bit.  I'd like to say this puts it at
significantly higher risk, but anything *that* close to a pentad (the
right end of the risky region) is I suspect somewhat or even a lot
lower risk than just a tiny amount lower down.  This needs to be
measured more carefully.

My conclusion pending such measurements is that d.d-d.d may well be
reasonably serious, because of its relatively high probability and
because this size of data may arise sufficiently often as to present a
problem.  The three-digit operands to subtraction are on somewhat
weaker ground, and dollars up to $99.99 would need to be investigated
more closely before I would accept any claim about their high
probability.

I would like to contrast isolated examples such as 4.1-1.1 and 2.3-0.8
with my various tables, starting with those in bug1 and bug2 (on
boole.stanford.edu in /pub/FDIV/individual.bugs).  There I simply
*defined* three parameters without setting them to specific values,
number of digits to the left of the decimal point (<= 3 in bug1, <= 2
and <=1 in bug2, >3 still needs to be explored but I am sure the
probabilities drop off very fast), amount of bruising (ranging from
10^-3 to 10^-15), and tolerance for resulting error (similar very wide
range), and asked what was the *exact* probability of an error as a
function of those parameters.  Since the sample space is feasibly small
the probability can be measured directly by exhaustive search instead
of via Monte Carlo methods.  One then obtains an interesting shape from
which it becomes clear where the highest risks are.  This shape can be
put to various uses; for example one can integrate over regions of this
shape when one knows that this region is where one's program remains.

In the light of Toon Moene's just-posted observations about his
numerical weather prediction program and its low rate of buggy divides,
which he attributes to his data being sufficiently random (exactly as
Dik Winter predicted for Toon a few days ago), it would seem that
operands with enough decimal digits, perhaps four, perhaps six, with
these being reasonably uniformly distributed, should be subject to the
basic 1 in 9 billion rate (see my recently posted 3x3 table for how
this depends on input and output precision).  For *clean* (i.e. integer
or unbruised) data with only three decimal digits, there is no risk at
all, but with any bruising the probabilities go up, and with bruising
at around 10^-7 they go up very sharply.  Driving the number of digits
down to two digits increases it further yet.

It seems highly plausible to a number of people that these small
integers should arise often, and in a context where they may be subject
to bruising.  However there have not to date been any sightings of
ominous clouds of such mosquitoes gathering for a strike.  If it turns
out that these are rarely encountered by most users if at all, then
this potentially major source of concern may in practice not be so big
a deal.  This was a distinct possibility that I discussed carefully at
the end of bug1.  Ruling this sort of possibility out to any given
user's satisfaction however may be very hard.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug41----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.sys.intel,comp.arch.arithmetic
Subject: Re: Pentium FDIV bug details and detection with OBDDs
Summary: 
Expires: 
References: <3d6pjo$3i@cantaloupe.srv.cs.cmu.edu>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

In article <3d6pjo$3i@cantaloupe.srv.cs.cmu.edu>,
Randy Bryant <bryant+@N3.SP.CS.CMU.EDU> wrote:
>In article <3d6lk2$87d@Radon.Stanford.EDU>
>>Vaughan Pratt <pratt@boole.stanford.edu> responded to my earlier post:
>
>Like any bug, once you've identified it, there's no need for a fancy
>tool to detect its presence.  What I'm trying to establish is whether
>the Intel engineers could have used OBDDs to detect a design flaw
>prior to fabrication.  The advantage of OBDDs, vs. other tools such as
>theorem provers, is that we can generate an OBDD representation of the
>circuit functions directly from a gate-level representation of the
>circuit.

Oh, *a* design flaw, not this one.  OBDDs should be very good for
finding design flaws.  I thought you were trying to get some mileage out
of The Bug.

Btw, BDD's may be state of the art for CAV, but I predict a backlash
within a year or so from the deduction community, for whom BDD's are a
nice heuristic but nevertheless a simplistic bandaid.  BDD's need logic
as an equal partner, not as a discarded aging spouse.  (Not only am
I not acting in the service of Intel or IBM, I don't do logic and I
don't do category theory, I do set theory and menu (antiset) theory,
a potent pair.  See my url/ftp below.

>As has been stated by others: this bug is maximally
>insidious---it doesn't show up on enough cases and doesn't create
>enough error to be caught by even extensive simulation, but it's bad
>enough to be unacceptable for at a large class of computations.
>
>I'd like to demonstrate the feasibility of my approach by generating
>a model for one iteration of the division algorithm.  That's why I'm
>trying to learn more details. 

Your last paragraph has no logical connection to your second last.  To
make a connection you need a bug that resulted from missing a case in
designing the table.  BDD's are overkill for verifying that P = T
(copying the table) was executed successfully, you can do that with
verification techniques known to Alan Turing half a century ago.

Unless, that is, you were proposing say to develop a version of the
division algorithm that could survive any 5 random hits on its table.
Now *that* would be *really* something that BDD's could go to town on!
=================================================================
-------------------------PENTIUM REPORT # bug42----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.sys.intel
Subject: Re: New documents available on http://www.intel.com/
Summary: 
Expires: 
References: <1994Dec20.153214.3161@news.media.mit.edu>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

In article <1994Dec20.153214.3161@news.media.mit.edu>,
Ron Newman <rnewman@media.mit.edu> wrote:
>
>    ANALYSIS OF FLOATING POINT DIVIDE CALCULATIONS IN SPREADSHEET
>    APPLICATIONS IN THE COMMERCIAL PC MARKETPLACE
>
>    by  M. L. Barton, Ph.D., Staff Computational Scientist, and
>    R. A. Passov, Senior Treasury Manager, Intel Corporation
>
>This paper explicitly disputes both IBM's and Vaughan Pratt's analyses
>of the "bruised integer" divisor problem.

Not true in my case, they only explicitly disputed IBM's analysis.  Had
they disputed mine I would be interested to know the particular sentence
I wrote that they found objectionable, and would be happy to focus on
it.  As it stands however this argument is strictly between Intel and
IBM, my analysis simply does not enter into their paper.

Except possibly at one point, where they be inferring from my bug1
paper that because no small bruised integer generates errors more than
100x times the bruise, therefore *no* bruised integers do so.  This is
a serious fallacy: the 4.1-1.1 example can generate bits in the 14th
bit, as Intel could and should have easily verified for themselves, and
would have were there as much incentive for them to consider this
particular case carefully as there evidently is for IBM.
=================================================================
-------------------------PENTIUM REPORT # bug43----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15884 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: THE Pentium WAR IS WON! (Intel now replaces Pentiums on demand)
Date: 21 Dec 1994 00:32:14 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 57
Message-ID: <3d7t2e$2p8@Radon.Stanford.EDU>
References: <BAI.94Dec20220839@xi.iesd.auc.dk>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15884 comp.sys.intel:26083

In article <BAI.94Dec20220839@xi.iesd.auc.dk>,
Bo Nygaard Bai <bai@iesd.auc.dk> wrote:
>
>BTW., it's customary to leave the battlefield after a battle :-)

This was just a small ~zero-sum skirmish, and is what my
PAST-PRESENT-FUTURE message called the future:  eventually Intel would
do this.  It wasn't the battle some of us were fighting (they promised
me my replacement two weeks ago), and it seemed obvious from the
beginning that as supplies improved Intel would soften.  The outcry
simply softened Intel faster than it had planned.  The
hardline-in-the-beginning strategy backfired horribly and in retrospect
was a PR blunder of the first magnitude.  (You may see, but I can't, a
reason to presume that they ever seriously intended to hold their hard
line forever.)

PAST.  Some feel that Intel should have notified users as soon as it
found out about this bug.  This one's probably going to find its way
into court and is way outside my ability to judge.  As I've said
before, Intel's initial perception of the bug *might* have been that it
was just another of a hundred annoying little bugs a month; proving the
opposite may turn out harder than some people are estimating.

PRESENT.  Intel wants its customers to believe that a buggy Pentium is
safe.  Motivation: minimize the number of people going to the bother of
actually replacing their chip.  I think this is an excellent strategy
for Intel, with one caveat: no fair wilfully misleading the customer in
order to get this effect.  The two facts, that the temptation to do so
is enormous, and that Intel has not issued what most of us would regard
as a reasonable statement discussing the risks, put Intel's atypical
customers, whatever the notion of atypical eventually turns out to be,
at unreasonable risk, and at the same time makes it much harder to be
sure that one really is a typical Intel customer.  How do you know
Intel is being forthright with you when it will not enter into any
two-way public dialogue on the risks.

The best-case scenario all round would be if everyone turned out to be
a typical Intel customer.  However we have heard quite enough by now
from Intel about what a low risk many or most customers are at, and
Intel is starting to sound like a cracked CD on this aspect of the
situation, as are the rest of us (did-didn't-did-didn't).  When are we
going to get some unprejudiced discussion from Intel of the danger
regions, who might be at risk from them, and when people *should*
worry?  They have to be the authorities on what their chip might or
might not do to people and their bank balances and their planes.  They
can't leave it to IBM and random windy netters like myself churning out
messages that seem in direct contradiction to Intel.  To do so is to
put at risk all those customers not covered by the loudly implied
"Intel warranty," that if you are typical you are safe.

On this present point the battle has only just begun.  It will be
interesting to see if it turns out in a way in which there are any
winners at all.  Like all the important battles of this world, this
battle is not a zero-sum game.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug44----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15896 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel,rec.crafts.jewelry
Subject: Re: THE Pentium WAR IS WON! (Intel now replaces Pentiums on demand)
Date: 21 Dec 1994 04:56:41 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 41
Message-ID: <3d8ci9$ab1@Radon.Stanford.EDU>
References: <BAI.94Dec20220839@xi.iesd.auc.dk> <3d7t2e$2p8@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15896 comp.sys.intel:26202 rec.crafts.jewelry:2013

In article <3d7t2e$2p8@Radon.Stanford.EDU>,
Vaughan R. Pratt <pratt@Sunburn.Stanford.EDU> wrote:
>This was just a small ~zero-sum skirmish

It occurred to me after thinking about this a bit more that if everyone
had stopped screaming bloody murder about nonreplacement during the
past seven days, Intel might have taken heart and not repented.  If
indeed the case (how will we ever find out?) then this *was* a real
battle, and well fought, and I have no business downplaying it as a
"skirmish" merely because I had different fish to fry.

My expectation for the past week or more had been that the battle was
already over and that Intel would be meeting all requests as soon as
supplies loosened up.

What gave me pause was hearing Intel say they would take a Q4 hit for
repenting.  My first reaction was, so what else is new?  I'd figured
that any Intel shareholder who hadn't discounted this a week ago should
not be in the stockmarket.

But if Intel honestly expected that this announcement opened flood
gates that would otherwise have been secure, then the "skirmish" really
was a battle.  This could only be the case if Intel were more out of
touch with reality the whole past week than a smaller company could
afford to be for more than ten minutes at a time.

The jewelry is incredibly beautiful, btw.  I spent several shares of
Sun stock on a bunch of it for Chrissy presents.  These are the chips
themselves, no carrier but the chip is mounted like a triplex opal,
i.e. a layer of quartz or more likely acrylic or something protecting
the chip.  It has more flash than an opal (just trying to get a rise
out of rec.crafts.jewelry :), with the submicron lines forming a
diffraction grating reflecting rainbows in each of a dozen on-chip
units (would someone please tell me where the FPU is, especially the
P-D table, so I can point at people's ears and say that's my research
area).  I got some earrings, bracelets, and cufflinks (cufflinks, who
wears cufflinks?  well, they were too pretty to resist).  No
commissions yet but I can always hope. :)
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug45----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 10312 of sci.math.num-analysis:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch.arithmetic,sci.math.num-analysis,comp.sys.intel
Subject: Re: Why didn't Intel test the Pentium FDIV *tables*?
Date: 21 Dec 1994 16:08:45 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 29
Distribution: inet
Message-ID: <3d9jud$sm5@Radon.Stanford.EDU>
References: <1994Dec14.200603.28449@wdl.loral.com> <3d84mj$44v@news1.deltanet.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch.arithmetic:881 sci.math.num-analysis:10312 comp.sys.intel:26367

In article <3d84mj$44v@news1.deltanet.com>,
David K. Every <dke@deltanet.com> wrote:
>As I understand it... it is not testing the entries that causes
>problems... but you have to test various SEQUENCES through the tables.
>Which gets you back to the first problem... lots of potential
>sequences.

A simple but effective check on one's program is a test suite that
exercises every branch of the program.  This does not constitute
complete verification by any means, yet it catches a remarkable number
of bugs, and is among the most cost effective software tests.

For hardware the simplest failure mode is for the power cord to be left
unplugged.  The second simplest is for wires to be stuck at 0 or 1.

Any manufacturer planning to sell more than a billion dollars worth of
a chip, that does not include among its test suite for that chip a test
that exercises every wire to verify that it is stuck at neither 0 nor
1, does a gross disservice to its stockholders.

In the case of the Pentium division flaw, such a test would have
immediately found all five missing table entries.

This is in hindsight.  I am amazed that it takes a problem of the
magnitude of the Pentium division error to wake us all up to such
simple realities.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug46----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15914 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic
Subject: Re: seriousness of intel FDIV bug
Date: 21 Dec 1994 16:49:05 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 26
Message-ID: <3d9ma1$n7@Radon.Stanford.EDU>
References: <afraser.122.2EDD21BD@hookup.net> <jnavasD10u2w.3Fp@netcom.com> <3d2fv3$q64@Sequoia.picosof.com> <3d4h0o$17qv@locutus.rchland.ibm.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:26381 comp.arch:15914 comp.arch.arithmetic:882

In article <3d4h0o$17qv@locutus.rchland.ibm.com>,
Del Cecchi <cecchi@vnet.ibm.com> wrote:
> Problems which can cause data integrity or undetected errors are taken very
>seriously by some companies.  Much more seriously than bugs which could cause a
>program to crash once in a rare while.  

Time is money, and a crash is downtime and lost money.  This is
liveness.  Information is also money, and damaged data is therefore
also lost money.  This is safety.

In comparing safety failures to *rare* liveness failures you are
comparing apples and oranges.  What if the data damage or undetected
error happens only exceedingly rarely, and moreover the undetected
error is so tiny that it is equivalent to the error from using single
precision?  Intel continues to insist that both of these are the case,
and may well be able to persuade a jury that it was reasonable to
believe these before December, if not after.

Competitive companies take both safety and liveness very seriously.  I
conjecture that the most competitive companies are those that balance
these fundamental concerns the best.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html
In public, for the sake of decency I clothe my speculation in facts.
Privately I speculate occasionally, but you can go blind doing it too often.


=================================================================
-------------------------PENTIUM REPORT # bug47----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 26421 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Intel shouldn't give in!
Date: 21 Dec 1994 18:03:07 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 24
Message-ID: <3d9qkr$2t1@Radon.Stanford.EDU>
References: <3d9dk1$5hs@newsbf02.news.aol.com>
NNTP-Posting-Host: sunburn.stanford.edu

In article <3d9dk1$5hs@newsbf02.news.aol.com>,
JohnL25490 <johnl25490@aol.com> wrote:
>          All this is gonna do is to force the manufacturers and software
>companies to put extra effort into longer, redundant testings and thus
>drives up the cost and delays the product release.

What it might do is encourage them to test every product with a suite
that exercises every component at least once.  This is a relatively
simple and short test for whether the component is present and whether
it is doing the right thing on at least one case.  Such a simple test,
which has been strongly recommended for software for years, but
evidently not for chips, would have caught the missing entries in the
Pentium FPU's partial divisor table.

This only shows the continuing enormous gap between the worlds of
hardware and software.  Closing this gap will be hard, but any
companies that do not make a strenuous effort to do so from now on put
themselves at great risk of being judged negligent by juries
investigating snafus like Intel's.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html
In public, for the sake of decency I clothe my speculation in facts.
Privately I speculate occasionally, but you can go blind doing it too often.


=================================================================
-------------------------PENTIUM REPORT # bug48----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 26409 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: THE Pentium WAR IS WON! (Intel now replaces Pentiums on demand)
Date: 21 Dec 1994 17:36:35 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 29
Message-ID: <3d9p33$276@Radon.Stanford.EDU>
References: <BAI.94Dec20220839@xi.iesd.auc.dk> <3d7t2e$2p8@Radon.Stanford.EDU> <3d8ci9$ab1@radon.stanford.edu> <3d9c0v$gep@news.iastate.edu>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15916 comp.sys.intel:26409

In article <3d9c0v$gep@news.iastate.edu>,
John Hascall <john@iastate.edu> wrote:
>Vaughan R. Pratt <pratt@Sunburn.Stanford.EDU> wrote:
>}What gave me pause was hearing Intel say they would take a Q4 hit for
>}repenting.  My first reaction was, so what else is new?  I'd figured
>}that any Intel shareholder who hadn't discounted this a week ago should
>}not be in the stockmarket.
>
>    Intel stock actually gained back 3pts after announcing
>    that they would `do the right thing'.

The stock market was invented in response to the requests of people who
have no business being in it but just can't resist a flutter.  "More
fools they" was the response, and it was born.

Some people have the good sense to use good money managers.  The three
dollars are easily accounted for by myopic money managers and knee-jerk
speculators.

Someone asked me a while back if I was going to sell Intel short.  I
said that would be insane.  Selling a company short is tying one leg to
the ground and the other leg to the company to win a bet that the
company will back up before it pulls you apart.  Would you do that with
a juggernaut like Intel?
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html
In public, for the sake of decency I clothe my speculation in facts.
Privately I speculate occasionally, but you can go blind doing it too often.


=================================================================
-------------------------PENTIUM REPORT # bug49----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.sys.intel
Subject: Re: Intel shouldn't give in!
Summary: 
Expires: 
References: <3d9dk1$5hs@newsbf02.news.aol.com>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 

In article <3d9dk1$5hs@newsbf02.news.aol.com>,
JohnL25490 <johnl25490@aol.com> wrote:
>          All this is gonna do is to force the manufacturers and software
>companies to put extra effort into longer, redundant testings and thus
>drives up the cost and delays the product release.

What it might do is encourage them to test every product with a suite
that exercises every component at least once.  This is a relatively
simple and short test for whether the component is present and whether
it is doing the right thing on at least one case.  Such a simple test,
which has been strongly recommended for software for years, but
evidently not for chips, would have caught the missing entries in the
Pentium FPU's partial divisor table.

This only shows the continuing enormous gap between the worlds of
hardware and software.  Closing this gap will be hard, but any
companies that do not make a strenuous effort to do so from now on put
themselves at great risk of being judged negligent by juries
investigating snafus like Intel's.
=================================================================
-------------------------PENTIUM REPORT # bug50----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 26963 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: THE Pentium WAR IS WON! (Intel now replaces Pentiums on demand)
Date: 23 Dec 1994 04:58:02 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 23
Message-ID: <3ddlcq$hmi@Radon.Stanford.EDU>
References: <BAI.94Dec20220839@xi.iesd.auc.dk> <3d7t2e$2p8@Radon.Stanford.EDU> <docjoe-2212940840310001@ip39.wilmington.de.interramp.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15963 comp.sys.intel:26963

(Hi, Mandy.)

In article <docjoe-2212940840310001@ip39.wilmington.de.interramp.com>,
Joe Ragosta <docjoe@interramp.com> wrote:
>
>In case you missed it, Intel is still holding their hard line. If you
>re-read their announcement, they offer to replace chips for the lifetime
>of the computer only for users WHO CAN SHOW THAT THEY HAVE BEEN AFFECTED.
>
>Clever wording, and it apparently got past a lot of people.

Sheesh, the noise level on this newsgroup.  How about saying where we can find
this announcement?  The word "affected" does not appear in Intel's Dec. 20
press release on http://web.jf.intel.com/about-intel/press/replcrel.html,
which is very unambiguous: Intel offers to exchange a processor for a
concerned user.  This is a fair exchange, we urgently need the processors
and presumably Intel can make a small profit turning the concerned users
into jewelry.  Volunteers please take one step forwards.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html
In public, for the sake of decency I clothe my speculation in facts.
Privately I speculate occasionally, but you can go blind doing it too often.


=================================================================
-------------------------PENTIUM REPORT # bug51----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15974 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.arch.arithmetic,comp.sys.intel
Subject: Re: Pentium Errors and SPEC
Date: 23 Dec 1994 21:35:57 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 40
Message-ID: <3dffrt$mq8@Radon.Stanford.EDU>
References: <3dab5t$gr9@nntp.Stanford.EDU> <879@moene.indiv.nluug.nl>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15974 comp.arch.arithmetic:920 comp.sys.intel:27183

In article <879@moene.indiv.nluug.nl>,
Toon Moene <toon@moene.indiv.nluug.nl> wrote:
>In article <3dab5t$gr9@nntp.Stanford.EDU> oberman@misd.stanford.edu  
>(Stuart Oberman) writes:
>> The following data resulted from executing 10 of the SPECfp92 benchmarks
>> with their reference input data sets:
>> 
>> Errors      Total Divides     Percent     SPECmark
>> ------      -------------     -------     --------
>> 34260       21187229          0.162 %     015.doduc
>><etc>
>
>This is veeeerrrrry interesting. I have been running our numerical weather  
>forecasting model* (instrumented to analyse the most frequently occurring  
>division operations) for a test case based on input data for a 36-hour  
>forecast starting on December 9'th, 0 GMT.

Stuart's more recent post on comp.arch.arithmetic and comp.sys.intel
observes that his simulator is counting some false positives.  He was
kind enough to let me compare his simulator to the real thing, making
it possible to quantify the false positives.  With the divisor fixed at
the Coe divisor 3145727, I compared the simulator's error rate to that
of the Pentium, obtaining a result which when applied to the above rate
for 015.doduc reduces the 0.162% figure to .0000255%.  Whether this
extrapolation changes for other divisors would take a rather longer run
to investigate and it seemed more productive to let Stuart work on it
further before doing any more testing.

As a couple of posters have pointed out, only counting hits on missing
table entries may overestimate the error rate.  The self-correcting
nature of the algorithm makes this particularly true, in that it tends
to neutralize some of the errors resulting from hits on the missing
entries.  If this turns out to be main reason for this overreporting of
errors, then this tendency must be remarkably strong.  Properly
interpreted, Stuart's measurements have the potential to yield very
useful data about the relationship between errors in the P-D plot and
errors in the quotient.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug52----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 27202 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.powerpc,comp.sys.intel,comp.os.misc,comp.unix.bsd,comp.unix.pc-clone.32bit,comp.unix.sys5.r4,comp.unix.misc,comp.os.linux.development,comp.os.linux.misc,comp.os.linux.misc,comp.os.386bsd.development,comp.os.386bsd.misc
Subject: Re: Interested in PowerPC for Linux / FreeBSD / NetBSD?
Date: 23 Dec 1994 23:31:38 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 164
Message-ID: <3dfmkq$39@Radon.Stanford.EDU>
References: <3cphs0$l6e@ddi2.digital.net> <stephenkD0vyyn.1GG@netcom.com> <D12MGt.1L9@tyrell.net> <D16ony.K5z@cogsci.ed.ac.uk>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.powerpc:35706 comp.sys.intel:27202 comp.os.misc:1718 comp.unix.bsd:4594 comp.unix.pc-clone.32bit:6697 comp.unix.sys5.r4:7989 comp.unix.misc:9703 comp.os.linux.development:24526 comp.os.linux.misc:36614 comp.os.386bsd.development:2815 comp.os.386bsd.misc:4859

In article <D16ony.K5z@cogsci.ed.ac.uk>,
Richard Tobin <richard@cogsci.ed.ac.uk> wrote:
>In article <D12MGt.1L9@tyrell.net> john@tyrell.net (John A. Matzen) writes:
>>If a bank is using the FPU to do numerical calculations, they could be loosing
>>hundreds of pennies every day due to rounding errors.  It doesn't matter
>>how many digits of precision the FPU is using.
>
>Sorry to be rude, but this is just drivel.  IEEE floating point is
>guaranteed (when correctly implemented!) to be exact for operations
>such as addition on integers up to 53 (?) bits.  Floating point isn't
>just any old approximation; it's a well-defined approximation that
>can be safely used by people who understand it (except on Pentiums).

The IBM examples show you to be guilty of a non sequitur at the top of
your voice.  For example if you have $4.10 and you take away $1.10 then
you have strictly less than $3.

Your observation *is* relevant in the case of programs that take 1.0 to
denote a penny, and that restrict their divisors to powers of 2.
However my accountant tells me that most financial spreadsheet software
takes 1.0 to be a dollar, the case the IBM examples are for.

Normally this is no great loss; on other than a Pentium it would take
years of double precision rounding errors before you ever lost a
penny.  This is forestalled in practice by the device of keeping
relatively short those calculations that do not round their results to
the nearest penny; I don't know how good spreadsheets are at observing
this commonsense rule.

Binary arithmetic inevitably does injustice to decimals, in fact poetic
injustice:

		What you see when you add
		Isn't what you've got.
		Four quarters make a dollar,
		Ten dimes do not.

Whether you add .25+.25+.25+.25 or .10+.10+.10+.10+.10+.10+.10+.10+.10+.10,
the amount *displayed* is 1.00.  Internally however things aren't what
they appear.  The former sum is, in binary, 1.000...000, no matter how
you compute it, because 0.25 has an exact finite binary representation,
namely .01.  A dime however is .0001100110011..., and when ten of them
are added up at double precision, the result is not 1.000...000 as one
might expect but instead 0.111...111.  At both single precision and
extended precision (if I've interpreted things correctly) ten dimes
come to 1.000...001.

Paradoxically, five dimes *do* make half a dollar.  This provides a
graphic illustration of the nonassociativity of IEEE addition: adding
ten dimes one at a time is not exact while adding together two sums of
five dimes each is exact.  The difference between adding a dime to
nothing and adding it to an exact half dollar is in the alignment: dime
6 has to be shifted three places further to the right than dime 1.
Dimes 7-10 are shifted 2 or 1 places further right than dimes 2-5
respectively, bruising them also.  The dollar emerges from this ordeal
thoroughly cowed by its beating; it's feeble protest to the FPU is
greeted by a stern "We don't do things by halves in decimal
arithmetic."

Here in hex are the low order 16 bits of the 10 single precision sums
of one to ten dimes.  It helps to understand the full details of
alignment, normalization, and rounding to see why each bit is just so,
see David Goldberg's beautifully done Appendix A (A.4 treats addition)
in Hennessy and Patterson's "Computer Architecture."

cccd cccd 999a cccd 0000 999a 3334 ccce 6668 0001 

Here are the corresponding trailing bits for double precision:

999a 999a 3334 999a 0000 3333 6666 9999 cccc ffff 

I tried doing this for extended precision.  Unfortunately I don't have
a Pentium manual and my 486 manual isn't clear about all the details of
80-bit extended.  The main problem is to suck out the 12 hidden
extension bits without falling foul of the rounding rules.  The trick
I've used here is to subtract 819/8192. (or 1638/16384., or
3276/32768., or 6553/65536., or 13107/131072., get the pattern?) from
.1, which turns out to create a number that normalizes by shifting up
by 12, 12, 12, 15, or 16 bits respectively (this is not an arithmetic
progression because bits 14 and 15 of unnormalized .1 are 0).  The
following illustrates this for the last of these, for a 16-bit shift.
The data shown is hex for the trailing end of .1, K, .1-K, etc.  The
same 16-bit shift is realized for .2-2*K, .3-3*K, etc., which is very
convenient.  If you read C, see the code below for a more detailed
explanation; normalization is faked in the printout as a corollary of
printf printing numbers left justified.

Extended precision:
    0.1 = 1999999
      K = 1999800 (16 bits of 0.1)
  0.1-K = 199 (normalize = shift left 16)
    0.5 = 8000000
0.5-5*K = 800
    0.7 = b333333
0.7-7*K = b33
    0.9 = e666666
0.9-9*K = e66
    1.0 = 10000000
1.-10*K = 1000

This 16-bit normalization pulls the hidden 12 bits out into the open,
where they can then be stored, printed, etc.

Here is what the 12 extension bits look like for dimes 1 to 10; they
are the top 12 bits of the following ten 16-bit hex quantities.

99a0 99a0 3340 99a0 0000 3340 6680 99c0 cd00 0020 

Assuming the extension really is 12 bits, I calculated the third entry
to be 3330 instead of 3340; why the discrepancy?  Are there really only
11 bits?  This would make sense if the purported 64-bit internal
mantissa includes the hidden leading bit, which in turn would make
sense because this would then make it just a normal 64-bit integer
register.

Here's the program that produced this.  No comments, I don't distribute
my jigsaw puzzles assembled.


#include <stdlib.h>

#define K	(((1<<17)/10)/((1<<17)+0.0))	/* first 16 bits of 0.1 */

main()
{
	int i;
	float x;
	double t, y, z[11];
	t = 10;
	printf("Single precision:\n");
	for (x = 1./10; x < 1.01; x += 1./10)
		printf("%.4x ", ((unsigned*)&x)[0]&0xffff);
	printf("\nDouble precision:\n");
	for (y = 1./10; y <= 1; y += 1./10)
		printf("%.4x ", ((unsigned*)&y)[0]&0xffff);
	printf("\nExtended precision:\n");
	printf("    0.1 = %x\n", (int)(.1*(1<<4)*(1<<16)*(1<<8)));
	printf("      K = %x (16 bits of 0.1)\n", (int)(K*(1<<4)*(1<<16)*(1<<8)));
	printf("  0.1-K = %x (normalize = shift left 16)\n", (int)((.1-K)*(1<<4)*(1<<16)*(1<<8)));
	printf("    0.5 = %x\n", (int)(.5*(1<<4)*(1<<16)*(1<<8)));
	printf("0.5-5*K = %x\n", (int)((.5-5*K)*(1<<4)*(1<<16)*(1<<8)));
	printf("    0.7 = %x\n", (int)(.7*(1<<4)*(1<<16)*(1<<8)));
	printf("0.7-7*K = %x\n", (int)((.7-7*K)*(1<<4)*(1<<16)*(1<<8)));
	printf("    0.9 = %x\n", (int)(.9*(1<<4)*(1<<16)*(1<<8)));
	printf("0.9-9*K = %x\n", (int)((.9-9*K)*(1<<4)*(1<<16)*(1<<8)));
	printf("    1.0 = %x\n", (int)(1.0*(1<<4)*(1<<16)*(1<<8)));
	printf("1.-10*K = %x\n", (int)((1.-10*K)*(1<<4)*(1<<16)*(1<<8)));
	z[1] = 1./t-1*K;
	z[2] = 1./t+1./t-2*K;
	z[3] = 1./t+1./t+1./t-3*K;
	z[4] = 1./t+1./t+1./t+1./t-4*K;
	z[5] = 1./t+1./t+1./t+1./t+1./t-5*K;
	z[6] = 1./t+1./t+1./t+1./t+1./t+1./t-6*K;
	z[7] = 1./t+1./t+1./t+1./t+1./t+1./t+1./t-7*K;
	z[8] = 1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t-8*K;
	z[9] = 1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t-9*K;
	z[10] = 1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t-10*K;
	for (i = 1; i <= 10; i++)
		printf("%.4x ", ((unsigned*)&z)[2*i]&0xffff);
	printf("\n");
}
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug53----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15978 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: What is 'Virtually full double precision accuracy'?
Date: 24 Dec 1994 00:49:01 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 44
Message-ID: <3dfr5t$291@Radon.Stanford.EDU>
References: <ertemD170wG.7x6@netcom.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.arch:15978 comp.sys.intel:27225

In article <ertemD170wG.7x6@netcom.com>, M. C. Ertem <ertem@netcom.com> wrote:
>I was just going through intel's www site and I ran across the
>following quote [1]:
>
>   'But if 3.0 is approximated to 16 digits as 2.999999999999999, then
>    the accuracy expected in the quotient is 13 to 14 significant digits,
>    virtually full double precision accuracy even in the presence of a
>    flawed result.'
><...>
>So, what in the world is 'virtually full double precision accuracy'?

This is a meaningful notion in context, where Mike Barton was
addressing the question of whether very slight bruising can yield much
larger errors.  Mike claimed a negative answer as a corollary of my
bug1 results (boole.stanford.edu:/pub/FDIV/README).  By "virtually full
double precision" Mike meant that superlight bruising as with IBM's
4.1-1.1 divisor can yield only superlight errors.

However the domain of bug1 is restricted to small bruised integer
*operands* (both dividend and divisor).  This domain is of interest
because there the error rate approaches one 10^-7 or larger error in a
2,500 divisions at 10^-6 bruising, along with one 10^-5  or larger
error in 70,000 divisions.  Unfortunately this is not the only domain
that users have to worry about when 4.1-1.1 shows up as a bad divisor.
When the dividend is purely random (equivalently, a very large
integer), the error rate is only one in several hundred thousand, but
offsetting this lower rate is a much larger error for superlight
bruising than possible for small bruised integer dividends where any
errors that do occur do so at higher rates.  With superlightly bruised
divisor 4.1-1.1, errors on the order of 10^-5 can occur at a rate of
one every three or four million divisions.  This is the situation that
the IBM observations are most relevant to.

Heisenberg's principle is at work in heavy disguise here (the duality
of safety/information and liveness/time, see the papers on Boole on
this topic, particularly those on Chu spaces).  With very light
bruising one can get either relatively large errors or relatively high
error rates; to get both together one must increase the bruising.  The
table in bug1 shows that bruising at 10^-6 is about the level of
irritation that drives the Pentium bug into a frenzy.  Much more and
you knock it unconscious.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug54----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 15987 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch
Subject: Whose Pentium bug is this?  Intel's, gnu's, or the programmer's?
Date: 24 Dec 1994 09:07:19 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 65
Message-ID: <3dgoc7$i3a@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:27283 comp.arch:15987

I'm running Linux with gcc 2.5.8 on my Pentium.  The following program
came up while I was investigating the Pentium FDIV bug.  When I compile
this program with -O (optimize) and then run it, it outputs "here
today, gone tomorrow."  If you follow the logic of the main loop you
will see that it can only execute once, when a==6, since adding one to
6 yields exactly 7.  Hence K-(K/b)*b should not change from one
invocation to the next.  Hence depending on whether it is zero or not,
the program should print either "here today, here today" or "gone
tomorrow.\ngone tomorrow."  It certainly should not print something
different when called with the same value each time.

#define K (10-1e-13)

f(double x)
{
	printf("%s", x? "here today, ": "gone tomorrow.\n");
	
}

main()
{
	double a,b;
	for (a = 6; a < 7; a++) { 
		b = 8.1 - a/10.;
		f(K-(K/b)*b);
	}
	f(K-(K/b)*b);
}

I presume what is happening is that the first (K-(K/b)*b) is calculated
on the stack using the value of b left on the stack by the previous
subtraction, which will differ from 7.5 in the hidden extension bits,
and which is a buggy division.  However the second (K-(K/b)*b)
presumably no longer has b on the stack and so has to get it from
variable b, where it has been truncated to double precision, becoming
exactly 7.5, where the division is not buggy.

When this program is compiled without -O it prints "gone
tomorrow.\ngone tomorrow." as one would expect.

The question for people who keep their buggy Pentiums then becomes, how
do you reason about your program given that the optimizer can change
the program's behavior in this way?

What I'd like to know is, whose fault is this strange dependence on the
-O flag?  Intel's, Gnu's, or mine for writing such a silly program?

I am rather annoyed at this quirk, since I wasted several hours trying
to figure out whether I was going crazy or the Pentium was more broken
than I had realized.  The above is the distillation of the bug down to
a small program starting with a much larger one, whose seemingly almost
nondeterministic behavior had me totally mystified.  The large one had
no constants in it except 0, 1, and 10, and was far from silly when I
wrote it.  The constants in this distillation are for the sake of
having a simple concrete example.

This incident points up a hazard of having an FPU that runs at a
different precision (80 bits) from the precision of the operands being
loaded and unloaded (64 bits).  RISC machines that have 32-bit, 64-bit,
and 128-bit floating point should not encounter the above problem when
arithmetic is done at the same precision as the operands.  Anyone care
to comment?
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug55----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 27286 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: TECHNICAL: What is the simplest example ?
Date: 24 Dec 1994 10:00:38 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 29
Message-ID: <3dgrg6$jjn@Radon.Stanford.EDU>
References: <3dasr6$7ad@news.primenet.com> <3dd9sc$dn@nntp.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu

In article <3dd9sc$dn@nntp.Stanford.EDU>,
 <Christopher_Lane@Med.Stanford.EDU> wrote:
>Lawson English writes
>> What we need is the simplest example of the FDIV bug.
>> 
>> The example posted earlier had was a bit to large to make the point.
>
>I don't know if it fits your constraints, but this is the minimal FDIV  
>test I've whittled my way down to based on postings to this group:
>
>void main() {
>    float x = 3.0 - (4.1 - 1.1);
>    puts(((((5 - x) / (15 - x)) * (3 + x)) >= 1) ? "good" : "bad");
>    }

Try 1/(8.3-8.2)/(8.2-.7).  Theoretically this equals 1/.1/7.5 = 4/3 =
1.3333333333333.  The Pentium gets 1.3333333330850.  Not a spectacular
relative error at only 1.8*10^-9, but clearly wrong.

When testing this in C, don't write it as a constant expression since C
will try to compute the whole constant very accurately at compile
time.  Prevent this by assigning a subexpression to a variable, as
follows.

	double x;
	printf("%.12f\n", (x = 1/(8.3 - 8.2))/(8.2 - 0.7));
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html

Followup:

Incidentally, of the 25 million or so expressions of the form
1/(d.d-d.d)/(d.d-d.d) where both differences are positive, 814 are in
error.  This constitutes an error rate of one in thirty thousand, or
thirty a second if all you are doing is evaluating random expressions
of this form.
=================================================================
-------------------------PENTIUM REPORT # bug56----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 27346 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch
Subject: Re: Whose Pentium bug is this?  Intel's, gnu's, or the programmer's?
Date: 24 Dec 1994 21:52:51 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 96
Message-ID: <3di57j$9mb@Radon.Stanford.EDU>
References: <3dgoc7$i3a@Radon.Stanford.EDU> <D> <D1BA31.EBw@dsg.tandem.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:27346 comp.arch:15993

In article <D1BA31.EBw@dsg.tandem.com>,
Anil Das <anil@nskernel.tandem.com> wrote:
>In article <3dgoc7$i3a@Radon.Stanford.EDU>, pratt@Sunburn.Stanford.EDU (Vaughan
>R. Pratt) writes:
>> [gcc gives too much precision with -O switch].
>
>> What I'd like to know is, whose fault is this strange dependence on the
>> -O flag?  Intel's, Gnu's, or mine for writing such a silly program?
>
>	I think the fault is yours. The C standard doesn't say
>there shouldn't be too much precision. (Maybe the Fortran standard
>does?)

This is naive.  If people are going to take a complex-integral
viewpoint of the C standard as the sum of its loopholes, then the C
standard can go jump in a lake.

It is C's place to prescribe some things and preserve others.
Precision falls into the latter category; C should preserve the
precision of the platform at hand.  If the C standard does not say this
then the C standard is at fault, not me.

So are you saying that it wouldn't bother you if a program was able to
find the inverse of some matrix when not optimized but after
optimization, still on the same platform, it changed its mind and
decided the matrix was singular?

I've never before encountered a situation where this sort of thing
actually happened (as opposed to being a contrived hypothetical case)
and the blame was not placed on the optimizer.   IEEE arithmetic obeys
certain laws, including commutativity of addition and multiplication,
but not associativity.  I would argue that an optimizer should not
assume *as a default* any falsifiable law.  I have no objection to the
programmer lying about such laws to the compiler, e.g. as a compiler
option, in those rare instances when there is a detectable performance
gain.  It should however be possible for the user of the binary to
interrogate it at any time about all liberties taken with the truth in
its compilation.

Incidentally I should clarify the nature of the dependency.  As I said,
I was able to change the behavior of the program just by turning the -O
flag on and off.  But I was also able to change the behavior of the
program without touching the -O flag (leaving it on), simply by adding
a certain statement that came *several instructions after* the point
where the behavior changed!  It looked for all the world as though the
computer could sense whether this new statement was going to happen.
Hopefully this makes it clearer why I was tearing my hair.  Now tell me
*that's* my fault.

So why didn't I just stop typing -O?  Well, I hadn't even noticed it.
I was compiling the program, hack.c, with "make hack", but without a
Makefile (the "make" command defaults to an ordinary compilation when
there is no Makefile).  Unfortunately there *was* a makefile in the
current directory, and although it did not contain any reference to
hack.c, it did set CFLAGS to -O, which I had not thought to pay any
attention to.  Needless to say I removed it immediately once I found
that it was the cause of the problem.

>	You could have an initialization routine called to
>set the control word of the FPU to `double precision'. That should
>solve the problem, since it will hold the precision down to 64
>bits (shouldn't we say 54 bits, seeing as how the mantissa is
>actually 54 bits including the hidden bit?) even when the operands
>are on the FPU stack.

(53 bits as Ron Newman points out.)  So when do I apply this solution?
Five hours after tearing my hair out?  Five minutes after hearing from
my client's lawyer that my program encountered an unexpected singular
matrix and shot down a helicopter?  Or permanently?

None of these are appealing.  With regard to the last, why should I
have to give up the additional precision possible within an arithmetic
expression, a benefit derived at every instruction, just because of a
million to one chance of running into an extremely freaky effect?

Having slept on this question, I now firmly believe the optimizer is at
fault here.  The optimizer's crime is *failure to preserve precision
casts.*  For the case at hand, b = 8.1 - a/10.; f(K-(K/b)*b);, the
unoptimized code casts b to double precision.  The optimized code
however is schizophrenic about b's precision.  After the assignment
8.1-a/10 is stored in two locations, the FPU stack and variable b,
where it has respective precisions extended and double.  The optimizer
pays no attention to this difference and simply takes b from wherever
works best.  This is insidious.  That the choice can be affected by
downstream instructions makes it insidiously insidious.  I could go
bald.

In order to preserve the semantics implied by the unoptimized code, the
optimizer should either take b from the variable or round the FPU entry
to double precision.  The half microsecond of computer time saved by
capitalizing on schizophrenia is nothing compared to hours of lost
programmer time tracking down the mind-bogglingly inscrutable behavior
possible when this is not done.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug57----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
From pratt@Sunburn.Stanford.EDU Sat Dec 24 14:51:41 PST 1994
Article: 15994 of comp.arch
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch
Subject: Re: Whose Pentium bug is this?  Intel's, gnu's, or the programmer's?
Date: 24 Dec 1994 22:42:38 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 22
Message-ID: <3di84u$b4e@Radon.Stanford.EDU>
References: <3dgoc7$i3a@Radon.Stanford.EDU> <D> <D1BA31.EBw@dsg.tandem.com> <3di57j$9mb@Radon.Stanford.EDU>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:27358 comp.arch:15994

In article <3di57j$9mb@Radon.Stanford.EDU>,
Vaughan R. Pratt <pratt@Sunburn.Stanford.EDU> wrote:
>
>Having slept on this question, I now firmly believe the optimizer is at
>fault here.  The optimizer's crime is *failure to preserve precision
>casts.*  For the case at hand, b = 8.1 - a/10.; f(K-(K/b)*b);, the
>unoptimized code casts b to double precision.  The optimized code
>however is schizophrenic about b's precision.

At Ron Newman's suggestion I tried declaring b as long double (should
have tried this earlier).  This cleared up the apparent randomness by
removing the precision casts and associated schizophrenia.  Now with -O
either on or off it prints "here today, here today, " as one would
predict.  When b was double, the unoptimized code printed "gone
tomorrow\ngone tomorrow\n" as expected while the optimized code printed
either that or "here today, gone tomorrow\n" depending on the phase of
the moon.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html
				What you see when you add isn't what you've got
				Four quarters make a dollar, ten dimes do not


=================================================================
-------------------------PENTIUM REPORT # bug58----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 16000 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic
Subject: Re: TECHNICAL: FDIV bug according to Intel, IBM and PC Week...
Date: 25 Dec 1994 06:47:56 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 77
Message-ID: <3dj4is$nhh@Radon.Stanford.EDU>
References: <3diutq$ing@news.primenet.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:27404 comp.arch:16000 comp.arch.arithmetic:927

In article <3diutq$ing@news.primenet.com>,
Lawson English <english@primenet.com> wrote:
>
>A comparison of the analyses of the FDIV bug by Intel, IBM and PC Week 
>leads to one conclusion:
>
>None of these addresses the most likely problem scenarios:
>
>"bruised-integer divisions" whose numerator and denominator both lie 
>between 10 and 100.

It might be appropriate here to repeat the concluding paragraph of my
bug1 post (from boole.stanford.edu:/pub/FDIV).

+While this scenario puts the FDIV bug in what looks at least to me like
+a pretty bad light, I do not claim it is anything like the worst case
+scenario.  The two essential factors for any scenario are the rate at
+which that scenario triggers the FDIV bug, and how often the essence of
+that scenario arises in the real world.  The "damage index" of a
+scenario is the product of those two factors.  A scenario that accounts
+for thousands of hours of actual Pentium time but that only triggers
+the bug say every ten minutes may well have a higher damage index than
+my scenario above, whose strength is that it triggers the bug every few
+milliseconds but whose weakness is the uncertainty as to how likely it
+really is in practice.  You must admit however that it is not a
+completely implausible scenario.

The point is that *neither* of the extremes represented by random data
and SBI's (small bruised integers) need be the worst case.  You have to
weigh (at least) three things: the error rate, the error size, and the
likelihood of encountering that scenario.

What Lawson English calls "the most likely problem scenarios" are what
I would prefer to call "very bad case scenarios should they ever
arise." By the same token pure random data is "a very good case
scenario should it ever arise."

Neither the Intel scenario nor my SBI scenario is *the* worst or best
case.  To get the worst case, repeatedly calculate 4195835/3145727; for
the best case, simply change this to 4195836/3145727, whose error rate
is zero (and back up to one again at 4195836/3145727.25).

I would prefer to rephrase "most likely problem scenario" to "scenario
with the highest damage index."  It is easy to classify scenarios by
how bad they would be *if* encountered.  The difficulty in assessing
the damage index of a scenario is in determining how commonly that
scenario is encountered (one hundredth vs. one millionth of all
computer usage), and how serious are the consequences of errors (a
little bank or a big one, a four-seater plane or a jumbo jet, etc.).

>the above 
>suggests that we have yet to get even a rough estimate for the potential 
>"real world" problems that can arise from the FDIV bug.

This is exactly right.

>My impression is that the potential problem for financial spreadsheet 
>users is probably at least an order of magnitude worse than what PC Week has 
>suggested.

I have *no* idea.  I neither wish to play Ralph Nader nor encourage an
era of casual disrespect for the consequences of small arithmetic
errors either by virtue of their accumulation or their unfortunate
location.

The potential magnitude of the problem combined with the enormous
uncertainty makes the question of its significance a most urgent one.
Thus far Intel's position that it leaves most if not all users none the
worse for wear has been neither refuted nor proved in a convincing
way.  Whether these small division errors present a serious risk to
some or are no worse than a piddling nuisance to everyone still remains
to be determined.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html
				What you see when you add isn't what you've got
				Four quarters make a dollar, ten dimes do not


=================================================================
-------------------------PENTIUM REPORT # bug59----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 27472 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Intel - The coverup continues
Date: 26 Dec 1994 01:31:00 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 20
Message-ID: <3dl6ck$md4@Radon.Stanford.EDU>
References: <3dg3ne$h52@also.hooked.net>
NNTP-Posting-Host: sunburn.stanford.edu

In article <3dg3ne$h52@also.hooked.net>,
Keith McDonald <kmacdona@also.hooked.net> wrote:
>What independent observers? I have yet to see
>one person not connected with IBM, AMD, Motorolla,
>or Apple (like yourself) back their "study."

Let me encourage you to read comp.sys.intel more closely, to which I
have posted some 50 100-line messages (on average) reporting on my
independent study of the problem.  These are also available by
anonymous FTP from Boole.stanford.edu collected in one file as
/pub/FDIV/bugs, also available from mathworks.com's pentium directory
as Pratt.txt.

(i)  You should find my research backs IBM's study quite well.
(ii) I have no connection with IBM.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html
				What you see when you add isn't what you've got
				Four quarters make a dollar, ten dimes do not


=================================================================
-------------------------PENTIUM REPORT # bug60----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 27504 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic
Subject: Re: TECHNICAL: FDIV bug according to Intel, IBM and PC Week...
Date: 26 Dec 1994 07:17:06 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 117
Message-ID: <3dlqli$1ur@Radon.Stanford.EDU>
References: <3diutq$ing@news.primenet.com> <3dj4is$nhh@Radon.Stanford.EDU> <3dje1n$7om@news.primenet.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:27504 comp.arch:16007 comp.arch.arithmetic:933

Summary: PC Week may well not have focused on the most serious
manifestations of the Pentium bug.  However the goal is not so much to
come up with a single number characterizing how serious the bug is, but
rather to give users enough insight into its nature to permit them to
assess its seriousness for their own applications.  It is possible to
alert people to the need to pay attention to the problem, a service
that PC Week seems to have rendered its readers, without depending on
examples that "push the badness envelope."

In article <3dje1n$7om@news.primenet.com>,
Lawson English <english@primenet.com> wrote:
>
>I didn't say what I meant to say, or at least not terribly well.
>
>My point was that PC Week took you at your word with the phrase "small 
>bruised integers," and evaluated the risk for the smallest integers 
>possible. (single digit).

I should double-check here: you mean the PC Week white paper,
http://www.ziff.com/~pcweek/reviews/dec_1994/wh_paper.html, the one
posted here by Allen Kim on Dec. 20?

That paper talks about calculations involving numbers of the
form a.b for single digits a and b.  Is that what you mean by "single
digit integers"?

Numbers of the form a.b are a "natural" source of superlight bruising
by virtue of (i) being decimal and hence not all exactly representable
in binary, and (ii) having a better than one in ten chance of producing
an integer or a "small" binary rational, meaning m/2^n where m and n
are small integers, when added or subtracted.  When three or more are
added, or two or more subtracted, any resulting integer or binary
rational has a chance of about one in two of not being exact, in which
case it will be superlightly bruised.

Now the populations I talked about on December 3 in bug1 and bug2 are a
bit different.  Instead of starting with unbruised decimals a.b and
letting bruising arise naturally by addition and subtraction, I started
with already bruised small integers, with no particular model of what
caused the bruising.  Bruising can happen in various ways and at
various rates and magnitudes.  To understand this aspect of bruising
one would need to explore a lot of scenarios that might lead to
bruising, not just the one that PC Week focused on in their December 16
article.

It seemed to me that the most useful information would be obtained by
directly studying the impact of bruising, *independently of* its cause,
on error rates and magnitudes, as a function of the magnitude of the
bruising.  One could then extrapolate from this raw data to any
particular scenario involve bruising at a given rate and level.  My
experiments yielded the following basic properties of bruised
integers.

1.  The level of bruising that irritated the bug the most is around
10^-6.  Higher than 10^-4 and the bug no longer recognized it as
bruising since it was interfering with the requisite pattern of 4
special bits followed by 8 1's.

2.  For integers in the range 1 to 1000, with 10^-6 bruising, an error
greater than 10^-5 happens every 70,000 divisions, while one greater
than 10^-7 happens every 2,500 divisions.

3.  Reducing the limit from 1000 to 100 reduces the above 1/70,000 rate
to 1/2,000 and the above 1/2,500 rate to 1/200.

I agree this is a lot higher than anything PC Week found.  I also agree
with those posters who have questioned PC Week's interpretation of
their own results as placing them *between* IBM and Intel.  If anything
it makes PC Week more extreme than IBM (but not as extreme as the above
data on pure sources of small bruised integers).

It is natural to ask whether small bruised integers can arise in
practice and in what quantities and at what levels of brusing.  One way
that they can arise naturally is from any data source that does its own
decimal truncation.  Such a source might well produce 10^-6 bruising at
a very high rate.  And one can come up with other scenarios for
bruising at these levels.  One such is the Pentium itself, whose errors
are at just the right level to really arouse the bug: the Pentium will
get much more angry with its own 10^-5 to 10^-7 errors than those
arising via normal truncation on any machine (superlight bruising) at
around 10^-18.

But in view of the enormous variability of both applications and data
encountered by those applications, I don't see that a terribly useful
purpose can be served by speculating on what *might* happen.  Far
better for each user to examine their own application to determine
whether bruising is likely to constitute a serious problem for them.
Until a user has determined such statistics for the typical ranges of
data encountered in one's own application (as opposed to what Intel
says you encounter), that user cannot rule out with certainty error
rates and magnitudes as high as those indicated in item 3 above.

Another point to bear in mind is that there may lurk yet higher rates
via some other mechanism than small bruised integers.  Users need to be
on the lookout for such situations.

So yes, PC Week may have understated the seriousness of the problem.
But I think even the picture they have painted is serious enough that
users should be very concerned as a result of the PC Week article.

PC Week's main purpose seems to be the same as mine: to send a warning
signal alerting unsuspecting users to pay close attention to their
applications.  This should include learning more about the bug than PC
Week has to offer, and hopefully people will install signposts to
indicate where to find such additional information.  The IBM report is
one such place, mathworks.com's ftp/www site is another, my /pub/FDIV
directory on Boole.Stanford.EDU is yet another, and hopefully other
helpful information on the bug will come to light as more people join
in its investigation and more is learned about it.

There is a lot more to this business than pure technology, a point Intel
made in its December 20 repentium.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html
				What you see when you add isn't what you've got
				Four quarters make a dollar, ten dimes do not


=================================================================
-------------------------PENTIUM REPORT # bug61----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 27510 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: TECHNICAL: What is the simplest example ?
Date: 26 Dec 1994 07:48:57 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 15
Message-ID: <3dlsh9$2ms@Radon.Stanford.EDU>
References: <3dasr6$7ad@news.primenet.com> <3dgrg6$jjn@Radon.Stanford.EDU> <3di9lh$bkp@radon.stanford.edu> <3dif6o$pv3@news.jf.intel.com>
NNTP-Posting-Host: sunburn.stanford.edu

In article <3dif6o$pv3@news.jf.intel.com>,
Kumar Anshumali <kanshuma@ichips.intel.com> wrote:
>So, what is the bottomline ? It is that you *need* a statistical measure of user-space 
>of numbers and computations. The only published studies on this so far are Intel, IBM 
>and PC Week. Do you have some new usage data ? 

I was unaware of this.  Could you be more specific as to what "usage
data" is currently in print from those sources, particularly from IBM
and PC Week?  I'm aware of Intel's argument that typical users use
random numbers.  What other usage data were you referring to?
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html
				What you see when you add isn't what you've got
				Four quarters make a dollar, ten dimes do not


=================================================================
-------------------------PENTIUM REPORT # bug62----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 27591 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: "The Official Truth" from Intel appears to be winning
Date: 26 Dec 1994 20:52:33 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 80
Message-ID: <3dnaeh$svf@Radon.Stanford.EDU>
References: <D1FEw5.40M@murdoch.acc.Virginia.EDU>
NNTP-Posting-Host: sunburn.stanford.edu

In article <D1FEw5.40M@murdoch.acc.Virginia.EDU>,
Greg Fife <fife@uvacs.cs.Virginia.EDU> wrote:
>
>Another part of the problem is that this bug has no immediate or direct
>impact on the average citizen.  It is a lot harder to tell a good story about
>an inaccurate Computational Fluid Dynamics result than it is to tell a good
>story about a broken word processor.  The typical home user doesn't even own
>a Pentium, and they don't see this particular problem as a threat even if they
>were to own one.  Thus, they don't care.

The typical home Pentium owner probably *doesn't* need to care.

The Pentium is like a car whose exhaust emission has a 10% level of
carbon monoxide.  California legislation today limits it to 1%, but
imagine the response you would get from the auto trade journals in 1935
if you tried making a fuss about a car that you felt had an unduly high
level of CO emissions.  Everyone would be asking who's ever been
injured by such a small amount of CO injected into the great outdoors.
Sure people have died, intentionally or not, by venting it into their
garage, but that's not the typical situation, that's asking for trouble
by standing under a meteor.

Many Pentium users don't give a hoot and probably don't need to.
Provided they keep their Pentium pollution in their own closed-system
incinerator there's no reason they should.  It would cost Intel a
bundle to replace all those processors, and I see no point in making
life hard for Intel by getting users worked up unnecessarily about a
problem they're unlikely to be seriously affected by.

Of much greater concern is Pentium pollution which is vented into the
"great outdoors," namely those systems that don't live in a closed
little world of their own.

The industry has correctly intuited that there's a great deal of
capacity out there to absorb a steady stream of numerical pollution,
especially given how small that stream is, just like a 10% level of CO
streaming out of a car exhaust.  What the industry has *not* done
howver is to compare that capacity to the likely pollution rate and to
establish with some reasonable certainty that the several million
polluting Pentiums out there will not present a serious hazard over
their expected lifetime.  The nearest it has come to this is to perform
some naive back-of-the-envelope calculations that we have all seen can
vary by many orders of magnitude depending on whose envelope you read.

The truth of the matter is that no one has a clue, and most simply have
their head in the sand and refuse to admit that there could be any
problem.

We don't yet know what the full range of Pentium hazards are.  The
small-bruised-integer hazard is new as of this month.  A hazard that
had not occurred to anyone until a couple of days ago is that the
Pentium error rate can feed on itself, as follows.  The worst level of
bruising for small integers, around 10^-5-10^-7, is also the level of
the Pentium pollution.  Thus if you have a million Pentiums sending
data to each other, whether by bus, serial line, ethernet, or mag tape,
then in addition to normal truncation-induced bruising at the 10^-18
level, there will also be Pentium-induced bruising at the 10^-6 level.
At *that* level of bruising the bruising rates for bruised integers
less than 100 goes through the roof, hitting a rate of one >10^-7 error
every 200 divisions by bruised integers and one >10^-5 error every
2,000 such divisions.  A network of Pentiums is like a saloonful of
hooligans ready for a barroom brawl.

Other hazards that have been pointed out this month include any
processes that can be understood as either differentiation, where small
changes in x or y can lead to large changes in x/y (the whiplash
example), or as integration, where a steady stream of small errors can
accumulate to form a large error.

We really have not come to grips with the full range of hazards
presented by the Pentium.  In view of how bad the errors *can* get
under certain not entirely unnatural circumstances, anyone who insists
that a buggy Pentium is safe for use in applications where it can
affect either people or other computers has a responsibility to
demonstrate that none of the known hazardous modes of the Pentium
present a serious risk for that application.  Whether they also have to
rule out as-yet-undiscovered hazards is a good question.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug63----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 27633 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: clari.net.talk,comp.sys.intel
Subject: Re: "Intel Boo-Boo Worries Some" -- AP/ClariNet article
Date: 27 Dec 1994 01:23:33 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 88
Message-ID: <3dnqal$78s@Radon.Stanford.EDU>
References: <mad-scientistsURa4_4DN@clarinet.com> <mad-scientistsUR928_4DN@clarinet.com> <1994Dec26.143453.10150@news.media.mit.edu>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU clari.net.talk:879 comp.sys.intel:27633

In article <1994Dec26.143453.10150@news.media.mit.edu>,
Ron Newman <rnewman@media.mit.edu> wrote:
>From an un-bylined Associated Press article that was posted to the
>newsgroups clari.biz.features, clari.tw.computers,and
>clari.living.consumer with the Message-ID
><mad-scientistsUR928_4DN@clarinet.com>.  The article appeared in this
>morning's (December 26) Boston Globe, and presumably in many other
>newspapers today.
>
>>	For each of the 1,700 combinations that could go wrong, 9
>>billion others can be divided accurately.
>
>This is surely inaccurate.  If the bug occurs for a particular
>pair of floating-point numbers, then it will also occur if either or
>both numbers are scaled by any power of 2.  That's because it depends
>only on the base-2 mantissas of the numbers involved, not on their
>base-2 exponents.

For each of the 1,700 reporters capable of getting this statistic
exactly right, 9 billion others will find some way to screw it up.

Another problem with the above is that the number 1738 is for
single-precision in, single-precision out, where the rate is not one in
9 billion but in 40 billion.  The 1 in 9 billion rate is for
double-precision out (and any precision in).

Not that it is going to solve this reporting problem, but it occurred
to me after posting my 3x3 matrix of error counts that this matrix was
what is sometimes called an outer or exterior product, by which I mean
one representable as uv^T where u is a column vector and v^T is a row
vector.  This means that one does not need to know all 9 numbers to
recreate the table, only 6, three in the vector u, three in v.

The column vector u is that of total population, i.e. the number of
possible operand pairs, with vector entries for single precision,
double precision, and extended precision.  The row vector v is that of
error rates, with entries 1 in 40 billion for single precision, 1 in 9
billion for double, and 1 in 7 billion for extended.  (You did pay for
extended precision when you bought your Pentium, so don't let Intel get
away with the unqualified statement that the error rate is one in 9
billion for random operands, that's only for double precision errors.)

To remember the population sizes it suffices to know that each operand
has 23, 52, or 63 bits of information.  (The table I posted previously
was based on the last being 64, but I did not then know that 64
included the constantly-1 leading bit.)  The total population of all
pairs is then 4 to the number of bits per operand, e.g. 4^23 = 7.037*10^13.

Remembering the double precision error rate is easy: everyone knows the
magic number 1 in 9 billion, which understates the rate by only half a
percent.  Giving the single precision rate as 1 in 40 billion
overstates it by 1%, as does giving the extended precision rate as 1 in
7 billion.

Now you have all 6 numbers, which form respectively the right hand
column ("TOTAL") and bottom row ("ERROR RATE") of the following table.
To get the entry on row i, column j, multiply the total for row i by
the rate for column j, e.g. 2.470*10^-11 * 7.037*10^13 = 1738.


			    NUMBER OF PENTIUM ERRORS
		     by precision of operands and quotient

				 Quotient
				Precision
		____________________/\__________________
               /                                        \
              ---------------------------------------------------------
             |     Single     |   Double     |  Extended  |   TOTAL    |
             |---------------------------------------------------------|
             |                |              |            |         13 |
       Single|         1738   |     7863     |    9915    | 7.037*10   |
Operand      |                |              |            |            |
Precision    |           20   |          21  |         21 |         31 |
       Double|   5.009*10     |  2.266*10    | 2.858*10   | 2.028*10   |
             |                |              |            |            |
             |           27   |          27  |         28 |         37 |
     Extended|   2.101*10     |  9.502*10    | 1.199*10   | 8.507*10   |
             |                |              |            |            |
              ---------------------------------------------------------
             |                |              |            |            |
             |           -11  |          -10 |         -10|            |
   ERROR RATE|   2.470*10     |  1.117*10    | 1.409*10   |            |
             |                |              |            |            |
              ---------------------------------------------------------
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug64----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 27796 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Santa brought me a GOOD PENTIUM!
Date: 28 Dec 1994 01:52:27 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 53
Distribution: usa
Message-ID: <3dqgcr$h0n@Radon.Stanford.EDU>
References: <evanmD1C07G.HqF@netcom.com> <1994Dec25.020010.4514@mercury.ncat.edu> <3dltcv$36s@Radon.Stanford.EDU> <D1GwnA.Enn@dsg.tandem.com>
NNTP-Posting-Host: sunburn.stanford.edu

In article <D1GwnA.Enn@dsg.tandem.com>,
Anil Das <anil@nskernel.tandem.com> wrote:
>In article <3dltcv$36s@Radon.Stanford.EDU>, pratt@Sunburn.Stanford.EDU (Vaughan
>R. Pratt) writes:
>> Same here (December 5).  How many who joined the queue on or around
>> that date who have/haven't got their replacement yet?  (I'm not
>> complaining, I get to keep finding out interesting new things about the
>> bug while I'm waiting.)
>
>	Looks like once Prof. Pratt and other technical people
>get there replacement Pentiums, studies about the bug will stop,
>except maybe inside Intel. I guess that is not very important
>now that Intel is replacing the chips on demand.

I was about ready to call Intel to ask how things were progressing when
as luck would have it a Fed Ex truck showed up this afternoon (Dec.
27) with my replacement.  It bore a ship date of Dec. 23 (from Oregon)
and said "Priority Overnight."  I guess Christmas must be that holiday
which outlaws such miracles.

It took about one minute to replace, and bingo!  Reliable arithmetic.
The flaky chip was fun for a short fling (and none of my data seems to
have been contaminated), but for the long haul I like my FPU to be
boringly reliable.

But the flaky one at least had a quiet fan.  The fan that came already
mounted on my Repentium has a whine that completely drowns out its
predecessor.  Also its plug was the large 3/4" size, not the little
.2" one I'd been plugging into the two-pin header right beside the CPU
on my Intel Plato motherboard.  After I've run it for two days (infant
mortality check), I'll kill two birds with the one stone by switching
fans.  The heatsink seems to be *very* solidly epoxied to the lid but
the fan should unscrew easily from it.

I am very grateful to Intel for replacing this at no charge at all, not
even the return shipment.  I've written off the time invested in
investigating the bug against the "curiosity driven" side of my
research, hopefully also with a possible benefit to at least some
people hoping to make a more informed decision about whether to buy,
use, or replace a buggy Pentium.  And I was brought up to date somewhat
on the state of the art in FPU design, IEEE arithmetic, and the proper
management of mixed precision.

We still have some buggy Pentiums around the department, in fact I have
a X-window open on one right now, so at some point when I'm in need of
excitement I may yield to the temptation of another fling with fickle
floating point, whose table was so cruelly deep-fived.  Meanwhile back
to writing about duality theory and its applications.

This post is bug64, btw.  Very appropriate.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug65----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 28153 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,sci.math.num-analysis,comp.arch
Subject: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Contaminated Data Persists)
Date: 30 Dec 1994 06:26:58 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 381
Distribution: inet
Message-ID: <3e097i$952@Radon.Stanford.EDU>
References: <3dn52g$ohe@newsbf02.news.aol.com> <hbakerD1FuHy.GJC@netcom.com> <3dp96r$h7r@mhaal.inhouse.compuserve.com> <hbaker-2712940850570001@192.0.2.1>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:28153 sci.math.num-analysis:10442 comp.arch:16045

Summary: We give an empirically determined upper bound on the time it
would take a chain reaction in a critical mass of Pentiums to melt down
a database, together with the program used to determine it.

In article <hbaker-2712940850570001@192.0.2.1>,
Henry Baker <hbaker@netcom.com> wrote:
>In article <3dp96r$h7r@mhaal.inhouse.compuserve.com>, "M. L."
><mlane@csi.compuserve.com> wrote:
>> > No, the fallout from the Pentium disaster is likely to be worse than
>> > Chernobyl (in the U.S., at least), since most of the contaminated data
>> > is here.
>> This is without a doubt the most idiotic statement found in the
>> sea of stupidity flowing through this news group.
>Perhaps you've never attempted to do any serious floating point work.
>I once worked at a research hospital where we did multiple regression
>analyses that went on for days.  We were attempting to separate out
>the influences from different variables in very noisy data.
>A problem like the Pentium bug is the worst possible problem.  It isn't
>large enough to be 'obviously' wrong, but you have to re-do _all_ of
>your calculations.

In the physical sense it is absurd to compare the Pentium with
Chernobyl.  I've been sitting beside mine for weeks; had it been
remotely comparable to Chernobyl I wouldn't be able to lift a finger to
type.

Of course that's not what Henry had in mind, he meant a suitable
analogy in the data processing world.  What I found myself wondering
was, just how broad does this analogy have to be in order to be apt?
On investigating this I found that the analogy can be taken far more
literally than even M. Lane presumably had in mind.

Chernobyl happened when its operators lost control of the reactor's
chain reaction while conducting an ill-considered experiment.  When the
world's Pentiums ship each other flawed data, a natural question to ask
is whether the data processing equivalent of a sustainable and
uncontrolled chain reaction is possible.  Continuing the analogy, is
the Pentium's equivalent of a nuclear half-life (the time between
errors) sufficiently short, and the decay products (the error
magnitudes) sufficiently strong that a chain reaction in a sufficiently
dense mass of Pentiums (relative to the number of participating
non-Pentiums) could both start in a reasonable time and sustain
itself?

Now we have seen many examples that trigger the bug with a probability
greater than the official one in nine billion.  The most questionable
examples are those based on specific quotients such as 4195835/3145727
and 4.999999/14.999999, or somewhat more generic computations such as
x/(d.d-d.d); many have wondered just how often these examples are
really encountered in everyday computation, while many others are
convinced that they are quite likely enough to present a real risk.
Also questionable are the examples that embed some understanding of the
nonuniformities in the distribution of the bad operands, e.g. my
small-bruised-integer measurements.  Just how at-risk really is a
typical program that is ignorant of those nonuniformities?

With these concerns in mind, I designed the following "lab experiment,"
with a Pentium as the lab.  The design of the experiment is not at all
specific to the Pentium, and is simple enough both to implement in a
reasonable time and to at least partially understand its nature.  Its
purpose is to allow one to observe the rate at which errors appear, and
to study their propagation.

The upshot was that errors appeared about a thousand times sooner than
the random model predicts, and they propagated energetically enough to
sustain a chain reaction essentially analogous to those taking place in
atomic bombs and nuclear reactors.

The experiment simulated a computational world of indeterminate scope,
one that could correspond either to a distributed database run by many
computers or a centralized one run by one computer.  This database's
knowledge is obtained partly from "axioms" or "sensory input", i.e.
clean data from outside the scope of the experiment, and partly by
"deduction," i.e.  computation performed within its scope.  The latter
is where the Pentium-induced damage arises.

I had originally considered taking the "sensory input" to consist of
financial amounts such as $9.98.  However the more complicated the
source the more suspect it becomes in terms of reflecting a possible
bias towards Pentium-bad numbers.  So in the end I settled for a
constant incoming stream of 1's, on the theory that if the number 1 all
by itself is Pentium-bad, things are much worse than we thought!

"Deduction" was simply the inference of new numbers from previously
obtained ones via arithmetic operations chosen at random.  All
divisions were by flawed Pentium, corresponding to all computers
participating in the maintenance of this database being such.  I did
not investigate in this experiment what lesser density of Pentiums
might still constitute a critical mass for a chain reaction.

In more detail, I first set up a numeric database consisting of a
single number, namely 1.  I then set a flawed Pentium (not mine, it's
been fixed) to work adding new numbers to this database.  These numbers
came from two sources.  The first source was the database itself; two
numbers were selected at random from the database, then combined by an
operation randomly selected from among the four operations of addition,
subtraction, multiplication, and division, then stored in the
database.  The second source was a stream of 1's, constituting "fresh
data" moderating any chain reaction that might start up like a graphite
rod in a reactor core.  These sources were weighted equally for the
groups plus-minus, times-divide, and the ones, one third each.  This
was achieved simply by viewing these as six sources: four for the
arithmetic operations and two for the stream of 1's treated as a pair
of such streams.  These six sources were sampled randomly, more
precisely pseudorandomly with a uniform distribution using the Unix
library integer rand() function.

I did not want the database to fill up with improbably large or tiny
numbers, whose absurd sizes might raise eyebrows; on the other hand I
also did not want an artificial cliff in the distribution, whose
precise location might raise questions.  I therefore wrote a macro
ablg(x) to compute (int)(fabs(log_2(x)/32)) fast and used it in
((rnd=(rnd%15)+1)%(ablg(cl)+ablg(di)+1) == 0) as a fast stochastic
filter of high-magnitude-exponent numbers whose effect is first felt on
positive numbers lying outside [2^-31,2^32] (and symmetrically for
negative numbers).  This filter rolls off gently, with probability 1/3
of retaining a number in [2^32,2^96] (and symmetrically in
[2^-95,2^-31]), down to 1/15 for [2^416,2^480] (in general 1/(2n+1) for
[2^{32(2n-1)},2^{32(2n+1)}] for n=1,...,7), and probability 0
thereafter (via the %15).

After the database fills up, the program creates space for each new
number by overwriting a randomly selected old number.  Although I have
32Mb, I set the storage limit to 750,000 double-precision numbers
(.75*8*2 = 12 Mbytes, there being two copies of the database) so that
the program would run on 16Mb machines without swapping (the accesses
are highly nonlocal and would cause thrashing if swapping happened).

To measure errors I ran two such databases, clean and dirty, in
lockstep, performing the same operations on both using the same operand
addresses and making all decisions identically for both (e.g the above
stochastic filter resolves disagreement between the clean and dirty
numbers by in effect taking their geometric mean).  The only difference
is that clean division uses the Mathworks' Matlab solution.  (For this
sort of work a dirty Pentium is much faster than a clean one because it
is much faster to emulate clean division on a dirty Pentium than vice
versa.)

Before giving the results of this experiment it is worth considering
what might plausibly have been expected.  Because no effort was made to
produce Pentium-sensitive numbers of the kind familiar from the IBM
report and my small-bruised-integer response functions (bug1 and bug2
in /pub/FDIV/individual.bugs on boole.stanford.edu), one might expect
only one out of every 9 billion divisions, and hence one out of every
6*9 = 54 billion numbers produced, would be in error.  One would
therefore expect to wait about four days for the first error, and only
then would the interesting question arise as to whether that error
would have time to clone itself sufficiently before it and its first
few clones were all overwritten by the steady inbound flow of clean
data.  Even if the error survived, its expected size would be very
small: only one in every 40 or so errors is in the 5th decimal digit,
and hence occurs only every 160 days at full speed (or every 40*27,000
~ one million years at the 1000 divisions per day rate).

In fact what happened was that my database effectively "melted down"
altogether in 280 "cycles" taking 32 minutes, with 20% of its 750,000
numbers having a relative error more than 0.1 and more than 6% having a
relative error more than 1 (i.e. meaningless).  Although a few more
cycles would have seen yet further deterioration, my experiment was
programmed to stop at that point: enough is enough!

A "cycle" is defined here as the arrival of just enough numbers to fill
the database, less a few filtered-out oversize numbers.  On an isolated
Pentium a cycle thus consists of 750,000 consecutive operations
(including 1's), most of which put a number into the database.  My P90
gets through a cycle in 7 seconds; were it not for other issues like
locality and propagation delays a network of a thousand Pentiums would
accomplish a cycle in 7 milliseconds.  Meltdown took 32 minutes on my
machine, but if the same operations were to be performed in parallel it
could occur in seconds in a sufficiently large and fast network of
Pentiums.  (Ordinarily a sequential machine can create long sequential
dependencies that would invalidate this reasoning by creating long
error propagation paths that parallel computers would not have time to
develop, but the random access pattern of this reactor creates
dependency chains of an expected length sufficiently low as to be
consistent with those arising in much more parallel computation,
usually an essential component of "embarrassingly parallelizable.")

Here is the printout detailing the meltdown.  The vertical axis is
time, increasing downwards linearly; each line is printed after 20
cycles (so line 9 comes at the end of cycle 180).  The horizontal axis
is space: each of the 50 or so dots on a line denotes 2% of the
database, the whole space being ordered left to right by decreasing
relative error, with hexadecimal digits denoting precision boundaries:
6 for 10^-6, c for 10^-12, etc.

   1  0123456789a.b.c.d................................................
   2  0123456789.a.b.c.d................................................
   3  012345.6.7.8.9.a.b.c.d...............................................
   4  012345.6.7.8.9.a.b...c...d..........................................
   5  012345.6.7.8.9..a....b.......c....d.................................
   6  0123.4.5.6.7.8...9.....a......b.....c..d............................
   7  012.3.4.5.6.7...8........9......a....b..c.d...........................
   8  012.3.4.5..6....7.......8.......9...a..b.c.d...........................
   9  .0.1.2.3.4..5......6.......7.....8...9.a.b.c.d..........................
  10  .0.1.2.3..4.....5.........6....7..8.9.a.b.c.d.........................
  11  .0.1.2...3......4.......5.....6..7.8.9.a.b.c.d.........................
  12  .0.1...2........3......4....5..6.7.8.9.a.b.c.d........................
  13  .0....1......2........3...4..5.6.7.8.9.a.b.c.d........................
  14  ....0........1......2....3..4.5.6.7.8.9.a.b.c.d........................

The dots immediately following a hex digit or at the start of the line
actually represent anywhere from a single entry to 2% of the database
(whence the slight line-to-line variations in number of dots), the
other dots represent exactly 2%.  On line 6, the three dots between 8
and 9 indicate that between 4% and 6% of the database has relative
errors between 10^-8 and 10^-9, while the single dot between 3 and 4 on
line 6 means that there is at least one error in the range 10^-3 down
to 10^-4.  Absence of a dot, e.g. between 2 and 3 on line 6, means no
errors at all in the range 10^-2 to 10^-3.  Dots to the right of d
denote errors less than 10^-13, including correct results.  Because 1/3
of the incoming data consists of 1's, which are automatically correct,
this region does not shrink below 1/3 of the whole database, which is
why d slows down instead of going all the way across the page.

The boundaries start out squished together in the upper left.  Each
boundary takes turns rushing across the no-man's land in the middle, to
join its comrades huddling together in the lower right.  In terms of
error populations this means that the correctness of the database
deteriorates explosively; errors blow up by an order of magnitude every
twenty cycles or so.  The experiment stops when more than 2% of the
data has unit relative error (is meaningless); here it had reached 6-8%
by the time this stopping condition was examined, at which point at
least 20% of the data base had relative errors above 0.1.  Had it not
been stopped the error explosion would have continued at the same speed
and the data base would in a few dozen more cycles be a mixture of 1's
and completely meaningless numbers, total meltdown.

Caveats.  There are several important caveats in interpreting these
results.  The most basic caveat is that any chain reaction, whether
physical or computational, is highly sensitive to the prevailing
conditions.  Thus small changes to various parameters may increase or
decrease both the speed of onset of the reaction and its ability to
sustain itself.  In particular I would not expect it would take very
much of a dilution, in the form of either non-Pentiums or clean
Pentiums contributing to the process, or a different mix of incoming
data and operations performed on the data, to moderate this reaction to
below the level at which it can sustain itself.  Thus few if any
real-world computing environments would be likely to provide the right
conditions for this chain reaction.  With the large number and variety
of today's environments however the possibility that some might be at
some risk of "going critical" should be kept in mind.

Offsetting these reassuring considerations is the less reassuring
possibility that the incoming data may include either many small dollar
amounts, shown by the IBM work to be at well above average risk on the
Pentium, or my explicitly bruised integers, e.g. rationals as decimals
truncated by mandate, or data read from a decimal calculator showing
4.999999 or from an analog-to-digital device that is positioned almost
on an integer.  When the stream of 1's is replaced by such
Pentium-sensitive data the onset of a chain reaction could conceivably
be much faster, and/or the critical mass of Pentiums (as measured by
the rate of flawed divisions) required to sustain it could conceivably
be much lower.

All these variations almost certainly will have some effect.  How much
awaits further experimental investigation, as does the question of
which variations have any likelihood of arising in real computing
environments.

As an independent application, by avoiding Pentium-specific knowledge
the experiment might make a useful general-purpose FPU test when left
to run for a week, provided its "clean" source is obtained from a
reference machine.

Here is the program, which may be freely distributed and modified
without my permission.  It runs under Linux but should require little
or no modification for other OS's (on a Pentium of course).  It would
ease my mind a bit if when you hack it up you put your name, date, and
hack at the top, in reverse chronological order, so I get neither the
credit nor the blame due you.

==========clip here=== reactor.c ======

/* Reactor simulator; generates chain reaction using Pentium
 * division errors as decay products.
 * Please log changes/author/date here in reverse chronological order
 * Author: V. Pratt, 12/29/94, pratt@cs.stanford.edu
 */
#include <stdlib.h>
#include <math.h>
#include <stdio.h>

#define MAXW 750000	/* steady-state population (should fit in 16Mb) */
#define MAXE 14		/* report error distribution down to 10^-14 */
#define PR   20		/* # cycles between error distribution reports */
#define ablg(x) abs(16-(0x7c&(2+(((char*)&(x))[7])))/4)
			/* fast approx. to (int)(fabs(log_2(x)/32)) */

int tick, cycle, W, grp[MAXE+1];
double clean[MAXW], dirty[MAXW];

main(argc, argv)
	char *argv[];
{
	int opd1, opd2, opr, ans, rnd, relcnt, watch;
	int i, j, e;
	double cl, di, rel, tem;
	watch = rnd = relcnt = 0;
	if (argc == 2)		/* any argument -> watch errors that */
		watch = 1;	/* are large relative to recent examples */
	rel = 1e-15;		/* initial setting for "relatively large" */
	/* initialize world with one element, 1.0 */
	clean[0] = dirty[0] = 1.0, W = 1;
	for (cycle = 1; grp[0] < MAXW/50; cycle++) {
		for (tick = 0; tick < MAXW; tick++) {
			/* Choose random expression "opd1 opr opd2" */
			opd1 = rand()%W;
			opd2 = rand()%W;
			opr = rand()%6;
			/* Evaluate expression */
			switch (opr) {
			case 0:	/* add operands */
				cl = clean[opd1] + clean[opd2];
				di = dirty[opd1] + dirty[opd2];
				break;
			case 1:	/* subtract operands */
				cl = clean[opd1] - clean[opd2];
				di = dirty[opd1] - dirty[opd2];
				break;
			case 2:	/* multiply operands */
				cl = clean[opd1] * clean[opd2];
				di = dirty[opd1] * dirty[opd2];
				break;
			case 3:	/* divide operands */
				cl = at_risk(clean[opd2])?
					((15./16)*clean[opd1])/
					((15./16)*clean[opd2]):
				     clean[opd1]/clean[opd2];
				di = dirty[opd1] / dirty[opd2];
				break;
			default:/* output 1.0 */
				cl = di = 1.0;
				break;
			}
			/* reduce probability of values far from 1 */
			if ((rnd=(rnd%15)+1)%(ablg(cl)+ablg(di)+1) == 0) {
				/* result passed, store it */
				ans = W<MAXW? W++: rand()%MAXW;
				clean[ans] = cl;
				dirty[ans] = di;
			}
			/* if watching, show relatively large discrepancies */
			if (watch && fabs(cl-di) > rel*fabs(cl)) {
				printf("%4d:%-8d%18.14g %18.14g %7.2g\n",
					cycle, tick, cl, di, (cl-di)/cl);
				if (((relcnt++)%10) == 0)
					rel *= 2;
			}
		}
		/* Every PR cycles, print distribution of errors */
		if ((cycle%PR) == 0) {
			printf("%4d  ", cycle/PR);
			for (i = 0; i <= MAXE; i++)
				grp[i] = 0;
			for (i = 0; i < W; i++) {
				tem = (clean[i]-dirty[i])/clean[i];
				e = .301*(1026-(((short*)&tem)[3]&0x7ff0)/16);
				e = e < 0? 0: e > MAXE? MAXE: e;
				grp[e]++;
			}
			for (i = 0; i <= MAXE; i++) {
				for (j = 0; j < grp[i]; j += W/48+1)
					printf(".");
				printf(i<MAXE? "%x": "\n", i);
			}
			fflush(stdout); /*needed when stdout is a file! Bug? */
		}
	}
}

at_risk(double x)
{
	return	(((char*)(&x))[5]&0xfc)==0xfc	&&
		((0x2492>>(((short*)(&x))[3]&0xf))&1);
}

===== end of reactor.c ==========
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/FDIV/README
4.999999/14.999999 has 6 3's, not 4   http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug66----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 10467 of sci.math.num-analysis:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,sci.math.num-analysis,comp.arch
Subject: Re: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Contaminated Data Persists)
Date: 1 Jan 1995 09:59:24 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 343
Distribution: inet
Message-ID: <3e5uds$9v5@Radon.Stanford.EDU>
References: <3dn52g$ohe@newsbf02.news.aol.com> <hbaker-2712940850570001@192.0.2.1> <3e097i$952@radon.stanford.edu> <phrD1np8y.M1K@netcom.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:28381 sci.math.num-analysis:10467 comp.arch:16078

In article <phrD1np8y.M1K@netcom.com>, Paul Rubin <phr@netcom.com> wrote:
>In article <3e097i$952@radon.stanford.edu>,
>Vaughan R. Pratt <pratt@Sunburn.Stanford.EDU> wrote:
>>In fact what happened was that my database effectively "melted down"
>>altogether in 280 "cycles" taking 32 minutes, with 20% of its 750,000
>
>Is this an interesting result?  It sounds like your program computes
>the evolution of what must be a chaotic system (your database) (is
>*that* an interesting result?).  You've shown that after a certain
>amount of time the dirty-Pentium results differ substantially
>from correct IEEE results, but the IEEE results might *also*
>be totally numerically meaningless.

Oops.  Indeed.  It's not terribly interesting to know that A took n
cycles to melt down relative to B if both had long before then melted
down relative to the truth.

>What happens if you do the same calculation with extended precision
>(i.e. storing all numbers as 80-bit floats) and comparing the results
>with "clean" double (64-bit) precision?

Sigh.  It turns out that if you take this hit on every instruction (in
contrast to the hit from the FDIV bug, which though much bigger in
magnitude is much smaller in rate, about one in every few million
instructions in this setup as it turns out) then the meltdown of double
precision relative to extended precision takes 160 cycles, in contrast
to 280 cycles for the Pentium error relative to double precision.

I had put off doing this experiment because I had been expecting this
difference to be so small as to take far longer to propagate, but in
practice it is not.  In hindsight this was a big mistake.

Databases don't seem to melt down relative to the world, not even on
account of using single precision arithmetic.  Why they don't is an
excellent question.  I thought about this for a bit and decided that a
basic difference between my database and a real one is that mine is
untyped: it lets any two things be combined in any way.  Although we
aren't counting apples and oranges, one could imagine that each of the
1's streaming in counted one kumquat, the type Brian Reid would use
here.  Now it is fine to add 1 kumquat plus 1 kumquat to get 2 two
kumquats.  And it is also fine to multiply them, in which case you get
one square kumquat.  What is bad is to add kumquats to square
kumquats.  But it is ok to multiply them, getting cubic kumquats.  And
you can divide a kumquat by a kumquat to get a pure number, call this
type *unit*.  If you now divide 3 cubic kumquats into 12 units, you get
4 units per cubic kumquat.  And so on.

So I converted my untyped reactor into a typed one, somewhat
arbitrarily set limits of plus and minus 3 (cubic and cubic^-1) for the
types, and only permitted correctly typed operations.  This seemed to
bring the *size* of numbers under some control; they didn't seem to
want to fluctuate so much.  It also postponed the meltdown *a little*,
getting to 200 instead of 160.  But this was still much sooner than
with the Pentium numbers, which got to 280 cycles before reaching the
corresponding point.

So what other differences might there be between my database and a real
one?  Well, a number that has been obtained by a long chain of
calculations is at great risk of accumulating a lot of errors along the
way.  I therefore decided to not trust such numbers and to omit them
from the database.  I defined the depth of each incoming 1 as 0, and
the depth of each calculated number as 1 plus the maximum depth of its
operands, and arbitrarily set 100 as the maximum permissible depth.

This had the following dramatic effect (see my original message on
chain reactions for the meaning of this printout).

   1  012345.6.7.8.9.a.b.c.d................................................
   2  012345.6.7.8.9.a.b..c...d............................................
   3  0123456.7.8.9.a.b.c.d................................................
   4  01234567.8.9.a.b.c.d................................................
   5  0123456.7.8.9.a.b.c.d................................................
   6  01234567.8.9.a.b.c.d................................................
   7  0123456789.a.b.c.d................................................
   8  01234567.8.9.a.b.c.d................................................
   9  012345678.9.a.b.c.d................................................
  10  012345678.9.a.b.c.d................................................
  11  01234567.8.9.a.b.c.d................................................
  12  0123456789.a.b.c.d................................................
  13  01234567.89.a.b.c.d................................................
  14  0123456789a.b.c.d................................................
  15  0123456789a.b.c.d................................................
  16  012345678.9.a.b.c.d................................................
  17  012345678.9.a.b.c.d................................................
  18  012345678.9.a.b.c.d................................................
  19  01234567.89.a.b.c.d................................................

For the first 20 cycles (line 1) the database started to melt down as
before, with errors reaching above 10^-6.  But at cycle 40 (line 2)
things had not gotten any worse, and after that the errors that had
initially rushed down to 10^-5 started to get cleaned out by the
incoming clean 1's.  By cycle 140 (line 7) all errors greater than 10-9
were gone.  Thereafter occasional errors would creep in, all less than
10^-7, but sometimes there would be no errors bigger than 10^-10, as at
cycles 280 and 300.

So avoiding long dependency chains seems like an excellent strategy for
avoiding database meltdown.  Types in contrast merely slowed down the
onset of meltdown by about 20%.  I've always been a bit atheistic about
types (10 years at typeless MIT can do that to one :), and I've also
long had a deep mistrust of long proofs (I've never met a long proof I
understood).  These quantitative observations about meltdown provide
some kind of confirmation of both of my intuitions.

Now, back to the Pentium bug.  With the reference database now acting
more stably, it becomes a much more meaningful exercise to compare
clean and Pentium-flawed division with respect to the rate at which the
latter can melt things down.  I chose to run both databases at double
precision, assuming that the outcome would not be very different from
running both at extended precision (= long double) (I don't know how
true this is).

Well, the outcome was mixed.  On the one hand, an error appeared on
average once every million divisions, at least for the first 200
cycles.  Judging by their behavior (they tended to come in bunches
within a few cycles, with gaps of dozens of error-free cycles in
between) I would estimate that a third of these were FDIV errors and
the other two thirds inherited those errors by arithmetic before being
wiped out by the stream of 1's.  This corresponds to 1 FDIV error in 3
million divisions, which is 3,000 times the 1 in 9 billion rate
assuming random data.  This is significant given that, although the
incoming data was not at all random, being all 1's, the operands and
operations used in obtaining new values from old were completely
random.  So even though no Pentium-specific knowledge at all was built
into this model, erroneous operands were found by the model at a
relatively high rate.  The errors covered the spectrum fairly uniformly
(by exponent) from 10^-6 down to 10^-13.

But then I ran the experiment to 400 cycles and no further errors
appeared, diluting all the numbers of the preceding paragraph by a
factor of 2.  But then from cycles 402 to 420, 160 errors appeared!  So
the number of errors can fluctuate wildly.  But this particular big
bunch of errors all stayed within the range 10^-10 to 10^-14, as such
pretty inconsequential.

On the other hand all the above FDIV-induced errors were swept up by
the replacement process much more quickly than they were generated.  In
constrast, the constant stream of truncation errors at double precision
as measured relative to extended precision came faster than they could
be swept up, maintaining a population of errors that extended down to
10^-10 with occasional excursions down to 10^-8 after stabilizing, as
we saw above.

I then did another experiment, changing the source of numbers from
constant 1 to random numbers of the form d.d, e.g. 6.7 or 0.2.  These
seem particularly bad in the context x/(d.d-d.d), and so might be
expected to give more errors.  And in fact cycles 11-36 showed 120
errors; had we not seen the 160 errors in cycles in 402-420 with 1's as
inputs, we might conclude that d.d was much worse than 1's.  But then
in cycles 37-200, the only errors were a batch of 9 errors in 74-78 and
one error in cycle 139.  The enormous variation within experiments thus
seems to dominate any possible variation one might suspect between
these two experiments.  Hence for this reactor program at least one
cannot infer much of an important difference between the sources 1 and
d.d.

My interpretation of all this is that, from a very pragmatic point of
view, Intel could afford to weaken its 1 in 9 billion claim to 1 in a
few million, while at the same time pointing to the distribution of
errors, which are known to occur uniformly across the spectrum of
leading bit 14 down to leading bit 52, as greatly diluting the apparent
seriousness of this thousand times higher rate.  In particular, for at
least this random model of data replenishment, such errors propagate
only a little, averaging few offspring in their brief lifetimes, with
however occasional big bursts.

The obvious caveat here is that my reactor program doesn't claim to
represent faithfully *any* real world application, which could perform
either better or worse than reactor.c with regard to their response to
the Pentium FDIV error.  For example a nonlinear least-squares fitting
program could reasonably be expected to be far more sensitive to such
errors than this naive model of database maintenance.

Nevertheless it seems to me a very interesting data point serving to
calibrate "high" error rates such as one in a few million (as opposed
to 1 in 9 billion) as to the seriousness of their ultimate impact on
relatively insensitive programs such as this one, namely not very
serious.  Despite its artificial and simplistic nature, my program may
be a sufficiently indicative model of what to expect from a "typical"
application, for those times when you *have* to have some expectation
and you have no a priori reason to expect circumstances sufficiently
different as to indicate a substantially different outcome.

I cannot recommend such an expectation for anything that puts either
people or large sums of money at any risk.  But for many relatively low
risk applications this sort of reasoning may be a perfectly adequate
justification for continuing to use a buggy Pentium.

Again I enclose the program.  Please bring any bugs in it to my
attention, in case they are responsible for any erroneous conclusions.
Remove the comments around the two instances of "long," and simplify
clean division by eliminating the at_risk clause, if you want to do the
double-vs-extended experiment on any system (not necessarily an x86)
that distinguishes long double from double.  (On a big-endian system
definitely remove at_risk since it assumes little-endian order.)

=======clip here====reactor.c========

/* Reactor simulator; generates chain reaction using Pentium
 * division errors as decay products.
 * Author: V. Pratt, 12/29/94, pratt@cs.stanford.edu
 */
#include <stdlib.h>
#include <math.h>
#include <stdio.h>

#define MAXW   750000	/* steady-state population (should fit in 16Mb) */
#define MAXE   14	/* report error distribution down to 10^-14 */
#define PR     20	/* # cycles between error distribution reports */
#define TYPLIM 3	/* no types higher than cubic kq or per cubic kq */
#define MAXDEP 100	/* no dependency chains longer than 100 */
#define ablg(x) abs(16-(0x7c&(2+(((char*)&(x))[7])))/4)
			/* fast approx. to (int)(fabs(log_2(x)/32)) */

int tick, cycle, W, grp[MAXE+1];
/* long */ double clean[MAXW];
double dirty[MAXW];
char typ[MAXW];
int depth[MAXW];

main(argc, argv)
	char *argv[];
{
	int opd1, opd2, opr, ans, rnd, relcnt, watch;
	int i, j, e, ty, de;
	/* long */ double cl;
	double di, cld, rel, tem;
	watch = rnd = relcnt = 0;
	if (argc == 2)		/* any argument -> watch errors that */
		watch = 1;	/* are large relative to recent examples */
	rel = 1e-16;		/* initial setting for "relatively large" */
	/* initialize world with one element, 1.0 */
	clean[0] = dirty[0] = typ[0] = W = 1;
	depth[0] = 0;
	for (cycle = 1; grp[0] < MAXW/50; cycle++) {
		for (tick = 0; tick < MAXW; tick++) {
			/* Choose random expression "opd1 opr opd2" */
			opd1 = rand()%W;
			while (depth[opd1] >= MAXDEP && ++opd1 < W);
			if (opd1 == W)
				continue;
			opd2 = rand()%W;
			opr = rand()%6;
			ty = typ[opd1];
			i = -1;
			/* require correct types and bounded depth */
			switch (opr) {
			case 0: case 1:
				while ((typ[opd2] != ty ||
					depth[opd2] >= MAXDEP) && ++opd2 < W);
				break;
			case 2: i = 1;
			case 3: do { ty = typ[opd1] + i*typ[opd2];
				} while ((ty<-TYPLIM || TYPLIM<ty ||
					depth[opd2] >= MAXDEP) && ++opd2<W);
				break;
			default:ty = 1;
				de = 0;
				break;
			}
			if (opd2 == W)
				continue;
			if (opr < 4)
				de = 1+(depth[opd1]>depth[opd2]?
					depth[opd1]: depth[opd2]);
			/* Evaluate expression */
			cl = di = 0;
			switch (opr) {
			case 0:	/* add operands */
				cl = clean[opd1] + clean[opd2];
				di = dirty[opd1] + dirty[opd2];
				break;
			case 1:	/* subtract operands */
				cl = clean[opd1] - clean[opd2];
				di = dirty[opd1] - dirty[opd2];
				break;
			case 2:	/* multiply operands */
				cl = clean[opd1] * clean[opd2];
				di = dirty[opd1] * dirty[opd2];
				break;
			case 3:	/* divide operands */
				cl = at_risk(clean[opd2])?
					((15./16)*clean[opd1])/
					((15./16)*clean[opd2]):
				     clean[opd1]/clean[opd2];
				di = dirty[opd1] / dirty[opd2];
				break;
			default:/* output 1.0 */
				cl = di = 1.0;
				break;
			}
			cld = cl;
			/* reduce probability of values far from 1 
			if ((rnd=(rnd%15)+1)%(ablg(cld)+ablg(di)+1) == 0) { */
			if (!(ablg(cld) || ablg(di))) {
				/* result passed, store it */
				ans = W<MAXW? W++: rand()%MAXW;
				clean[ans] = cl;
				dirty[ans] = di;
				typ[ans] = ty;
				depth[ans] = de;
				/* if watching show discrepancies */
				if (watch && fabs(cld-di) > rel*fabs(cld)) {
			   printf("%4d:%-8d%2d %18.14g %18.14g %7.2g%12d\n",
					cycle, tick, ty, cld,
					di, (cld-di)/cld, depth[ans]);
				   fflush(stdout);
				   if (((relcnt++)%10) == 0)
					rel *= 2;
				}
			}
		}
		/* Every PR cycles, print distribution of errors */
		if ((cycle%PR) == 0) {
			printf("%4d  ", cycle/PR);
			for (i = 0; i <= MAXE; i++)
				grp[i] = 0;
			for (i = 0; i < W; i++) {
				tem = (clean[i]-dirty[i])/clean[i];
				e = .301*(1026-(((short*)&tem)[3]&0x7ff0)/16);
				e = e < 0? 0: e > MAXE? MAXE: e;
				grp[e]++;
			}
			for (i = 0; i <= MAXE; i++) {
				for (j = 0; j < grp[i]; j += W/48+1)
					printf(".");
				printf(i<MAXE? "%x": "\n", i);
			}
			fflush(stdout);
		}
	}
}

at_risk(double x)
{
	return	(((char*)(&x))[5]&0xfc)==0xfc	&&
		((0x2492>>(((short*)(&x))[3]&0xf))&1);
}

=========end of reactor.c==========
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/ABSTRACTS
				      http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug67----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 16081 of comp.arch:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,sci.math.num-analysis,comp.arch
Subject: Re: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Contaminated Data Persists)
Date: 1 Jan 1995 20:58:06 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 95
Distribution: inet
Message-ID: <3e750u$rv1@Radon.Stanford.EDU>
References: <3dn52g$ohe@newsbf02.news.aol.com> <3e097i$952@radon.stanford.edu> <phrD1np8y.M1K@netcom.com> <D1q41r.39E@dsg.tandem.com>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:28427 sci.math.num-analysis:10476 comp.arch:16081

In article <D1q41r.39E@dsg.tandem.com>,
Anil Das <anil@nskernel.tandem.com> wrote:
>
>	With (correct) double precision math the only way
>21E7 operations can give relative errors more than 1
>(or even 1E-8) is for a number of subtractions of nearly equal
>quantities to occur. Maybe the experiment should be rerun after
>removing the subtract operation?

While ten million is a theoretical upper bound on chain lengths, the
observed upper bound was roughly 200,000 times less (idle conjecture:
750,000/e less?).  This was still enough to let the considerable
heating effect of subtraction do its work, to my great surprise not to
mention mortification (*so* embarrassing...).

I'm not sure how meaningful the results are without subtraction, but I
tried it anyway for both situations, double vs. extended precision.
and Pentium divide vs. correct.  For the former the (many) errors got
no worse than 5*10^-14 at 280 cycles, i.e. not much error propagation.
For the latter, in 240 cycles there were 459 errors, i.e. not many and
they kept being swept up within a few cycles.  Most were under 10^-10,
but there were five relatively large ones: 1.5e-08, 3.4e-06, 6e-08,
4.3e-08, and 6e-08.

Other experiments suggest themselves here, such as greatly diluting the
1's to see what happens when their mopping-up or moderating influence
is removed; those few but large errors mentioned just above might make
a much bigger nuisance of themselves then.  However a more meaningful
experiment than simply removing subtraction might be to allow it but
put some limit on how small the difference can get relative to the
operands, e.g. .001 or .5.  This could be taken as a crude model of the
improved stability obtained for ill-conditioned data by careful
pivoting, Householder triangulations, etc.  I would guess that the .001
limit would not make a big difference but that .5 would not be terribly
different from no subtraction at all with regard to rate of meltdown.
Too many fun experiments possible, too little time.

>	a) Has any attempt been made to run the experiment
>with a different random number generator or the same random
>number generator with a different seed? That will show how
>sensitive the results are to initial conditions. The more
>sensitive, the more chaotic the whole thing is.

I ran quite a few experiments with other parameters being varied, which
achieves the same end.  In general the setup is quite chaotic as I
indicated under "Caveats" in my original message.  I did in fact try
varying only the seed early on, just to be sure, and saw the same
chaotic variations as when varying other parameters.

>	b) Is a count of the actual number of erroneous divisions
>that occured available?
>
>	di = dirty[opd1] / dirty[opd2];
>	if (at_risk(clean[opd2])) {
>		cl = ((15./16)*clean[opd1]) / ((15./16)*clean[opd2]) ;
>		if (cl != di && dirty[opd1] == clean[opd1] 
>			&& dirty[opd2] == clean[opd2])
>			bad_division_count++ ;

No, something like this would be a good addition, to distinguish
original from inherited FDIV errors.  But I would prefer to count all
FDIV bugs, not just those on clean data as per the above, since it is
possible for the bug to "feed on itself."  Tim Coe showed me a very
nice example of this, which I've expressed here in C:

main()
{
	int i;
	double x = 9.2, y = 19.4;
	x -= 4.2, y -= 4.4;
	for (i = 0; i < 4; i++) {
		printf("want %20.15f\n got %20.15f  %g\n",
			(1.01*x)/(1.01*y), x/y, (x - (x/y)*y)/x);
		x = (x/y)*y;
	}
}

This starts out with x = 9.2 - 4.2, y = 19.4 - 4.4, and repeatedly
divides x by y and then multiplies the y back in.  Here's how the
quotient and relative error looks after doing this up to 4 times---it
stabilizes at the 4th step at a relative error of 3*10^-6.

want    0.333333333333333
 got    0.333333333332363  2.91038e-12
want    0.333333333332363
 got    0.333333329358720  1.19209e-08
want    0.333333329358720
 got    0.333332312106116  3.05176e-06
want    0.333332312106116
 got    0.333332312106116  0

Your term "cyberentomologist" has a catchy ring to it.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/ABSTRACTS
				      http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug68----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 28507 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: No one seems to understand FP arithmetic :
Date: 2 Jan 1995 22:26:32 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 206
Message-ID: <3e9uio$cf2@Radon.Stanford.EDU>
References: <hbaker-2812940858100001@192.0.2.1> <1995Jan1.112135.2002@zcon.com>
NNTP-Posting-Host: sunburn.stanford.edu

In article <1995Jan1.112135.2002@zcon.com>,
Syed Zaeem Hosain <szh@zcon.com> wrote:
>
>>Perhaps if someone can communicate in 1000 words or less what
>>the shouting is about, we can more easily avoid embarrassments
>>like this in the future.
>
>
>Sure thing. The flaw is that there are a set of numbers for which the

Beautifully written and way under budget (500 words).  Some picky
points, and some not so picky ones in the second half.  I think I can
get these close to the 1000 word limit.

>division instructions (FDIV, FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR,
>FPREM, FPREM1) and the instructions that use the division algorithms
>(FPTAN, FPATAN, FYL2X and FYL2XP1) in the Pentium processor will not
>provide correct results starting in the 9th bit (approx 4th decimal
>digit) onwards

No, 14th bit (counting the hidden mantissa bit as bit 0), 5th decimal
(a relative error of 6*10^-5).  For x normalized (i.e. in [1,2)), when
an absolute error e is added to x, the best definition of "bit position
of the error e" is the distance of e's leading 1 from the binary point,
with e=1/2 being an error in bit 1.  When the bit position of the error
is defined as the first position where x differs from x+e, one gets
meaningless results, as in the case x = 1.0111111, e = 0.0000001.
Comparing x+e = 1.1000000 bit by bit with x locates the error in bit 1
when it is obviously in bit 7.

>- with about equal probability that the error will be in
>any one of the bits after that position.
>Specifically numbers where the
>divisor has the following bit patterns in the most significant bits:
>1.0001, 1.0100, 1.0111, 1.1010 and 1.1101.

...and the next six bits must be 1, and another eight or more 1's
greatly increases the probability of error.

>and incorrect numerical calculations under certain conditions. If
>looked at in the space of all potential floating point calculations
>(randomly distributed), about 1 in 9 billion divisors provides an
>incorrect result.

No, 1 in 50,000 divisors provides an incorrect result (159 single
precision divisors out of 2^23 yield single precision errors, 168
single precision divisors yield double or extended precision errors).
Your figure is for operand pairs (divisor and dividend together), and
furthermore is for the case of double precision results (see the table
in boole.stanford.edu:/pub/FDIV/individual.bugs/bug63, Internet posting
of Dec. 27).  For single and extended precision results 9 billion is
replaced by respectively 40 billion and 7 billion.  (These rates are
unexpectedly more accurate than their rounding to integers would
suggest, being off by only +.5%, -1.2%, and -1.4% respectively.)

>However, this does *not* mean that the probability of
>occurrence of the flaw is random at that rate - if you happen to be
>using one of the faulty divisors for calculations, it will occur
>consistently for *every* division done with that number. Thus the
>error rate can be a lot higher - depending on the situation. And
>unknowingly so!

Any nontrivial distribution has the property you cite.  What matters is
whether either distribution, faulty pairs or faulty divisors, has a
positive correlation with distributions of operands that could arise in
practice.  Some distributions are quite harmless, in fact the irony is
that *most* (>99.99%) of the possible distributions of faulty operands
would cause no more of a problem than Intel has claimed based on the 1
in 9 billion figure, in fact even less when one takes into account that
the significance of better than 70% of those errors is at best
theological given the combination of their low rate and size.

It is therefore extremely unfortunate for the Pentium, for Intel, and
for Intel's customers, that the bad divisors have a particularly deadly
distribution that lands them hard up against certain small integers.
When computing with random expressions applied iteratively starting
from data having relatively few digits (the constant 1 will do all by
itself!), error rates are a thousand times higher than they would have
been had the problem pairs been randomly distributed.  And when the
operand pairs are "small bruised integers," for which plausible
scenarios exist that produce these in large quantities, the rates jump
to a million times higher!

In inexplicably sticking to its widely advertised "one in 27,000 years"
figure, Intel commits not one but two fundamental "off by a million"
exaggerations, namely in two rates, the rate of divisions and the rate
of errors in those divisions.

The former exaggeration is the more blatant.  A spreadsheet user doing
only 1000 divides per day does not need a Pentium except to keep up
with the Jones.  If Pentiums are to be taken seriously as workstations
as opposed to mere calculators, that rate must be increased a
thousand-fold to a million-fold, the latter being about the rate at
which divisions were done on Toon Moene's numerical weather prediction
program, on which he has been studying the impact of the Pentium bug.

For the latter exaggeration, Intel assumes randomly distributed inputs,
where their analysis is indeed valid and furthermore is almost
certainly applicable to many applications.  This assumption is grossly
invalid however for applications of the types mentioned above that can
give rise to thousand-fold or even million-fold increases in rate.

Intel's two exaggerations combine multiplicatively to give a net
exaggeration of anywhere between a thousand and a trillion depending on
circumstances.  But how could a trillion even be possible?  Surely not
even the scenario mentioned above in which one repeatedly performs the
same flawed division over and over could go that high.  Indeed it
could, in fact here Intel's exaggeration can be easily calculated to be
a *thousand* trillion, and even worse (taking distribution of
magnitudes into account) if you pick a pair that produces the maximum
error.  But that's a scenario we can all agree with Intel president
Andy Grove would be like running to stand under a meteorite.

>Intel has finally decided to do this replacement, no questions asked,
>while [apparently still] continually grumbling about how they are being
>forced to do so. This "we are put-upon", "the Internet forced us"
>martyr act is getting a little thin.

No large company to my knowledge has worked as hard as Intel to draw
the wrath of the Internet in just a few months.  Consider.

(1)  Intel kept the flaw a secret from its customers for four months.

(2)  When discovered by Intel customer Thomas Nicely, the flaw was
minimized by Intel in terms of 27,000 years and 1 in 9 billion pairs.

(3)  Intel initially declined replacement processors for customers who
could not pass both a numerical literacy test (what exactly is your
problem, exactly?) and a need test (has this problem affected you and
if so how?).

In boole.stanford.edu:/pub/FDIV/individual.bugs/bug31, Internet posting
of Dec. 18, under the respective headings Past, Present, Future, I
argued that (1) might be hard to make stick; in any event it is water
under the bridge except for those bent on teaching Intel and/or the
industry a lesson.  And I argued that (3) was merely a PR-problem,
being the absolutely worst conceivable way of coping with an
understandably limited supply of good chips.  No one has to explain how
a fuel injector works in order to get it replaced.  And replacing items
at risk of failure is preventive maintenance; any company turning down
those of its customer cluey enough to request specific preventive
maintenance is asking to lose its cluey customers.  A more
customer-aware solution to the supply problem would do wonders for
Intel's PR while changing nothing of any significance given the limited
supply.

Intel's "repenting" of (3) on Dec. 20 was therefore the perfect PR move,
in three basic ways.

First, it acknowledged that there was a problem, confirming what had up
to that point been only a vague growing suspicion in the public's
mind.  More significantly, it identified for the public what the
problem must have been.  This effectively drew attention away from (1)
and (2), whose seriousness had previously been no more clearly defined
in the public's mind than that of (3).  And the acknowledgement defused
a crowd situation that was threatening to get out of control.

Second, it satisfied those who had been complaining about (3), by
acknowledging explicitly what customers had previously only been
permitted to read between the lines, namely that they had to wait their
turn for their chip and that there would be a long delay.  (It seems
highly unlikely that Intel could have successfully continued to
stonewall requests for replacements once supplies had improved.)

Third, it was not an unreasonably expensive move.  (i) Intel cannot
replace chips any faster than it had been doing before, so that
particular cost is not going to increase.  (ii) The "charge" mentioned
in the announcement was one that could at best have been postponed
under the circumstances.  (iii) The loss in goodwill per denied
customer far exceeds the gain in revenues per chip saved: Intel *makes*
money with this move.

Optimists are calling Intel's overnight change from Charles to Di a
shrewd move.  Pessimists question whether catching the ball once after
three fumbles qualifies one for the Heisman.  Either way it was a good
move.

This leaves just (2).

My expertise is in neither the legal issues raised by (1) nor the PR
issues raised by (3).  Rather it is in numbers, which is the issue
raised by (2).

When *any* organization, be it Intel, IBM, or Boys Town, puts other
people's money and lives at unknown but potentially great risk by
quoting numbers that are off by multiple orders of magnitude, whether
by a thousand or a trillion, in order to save its own bacon, this is
gross irresponsibility.  When it declines to enter into any public
discussion of these risks this is further abdication of that
responsibility.

There has been much talk on the net about how errors of 10^-5 can grow
by propagating.   Intel is living proof of this: the tiny .00001 error
in the Pentium has propagated around Intel and emerged in press
releases as potentially a trillion-fold error.  Now *that's* meltdown.

Intel needs to address these issues, not by pronouncements issued from
behind closed doors but in a public forum where the extent of agreement
or disagreement of the affected parties can be judged by the public.
As long as there are millions of these flawed Pentiums hard at work
every day of the year, with no official recommendation for what exactly
to put on their warning labels beyond "Intel Inside", *all* of us are
potentially affected parties.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/ABSTRACTS
				      http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug69----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 28525 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Cont
Date: 3 Jan 1995 00:24:29 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 101
Message-ID: <3ea5ft$g6f@Radon.Stanford.EDU>
References: <3dn52g$ohe@newsbf02.news.aol.com> <phrD1np8y.M1K@netcom.com> <3e5uds$9v5@Radon.Stanford.EDU> <5d5eE776NgB@me-tech.PFM-Mainz.DE>
NNTP-Posting-Host: sunburn.stanford.edu

In article <5d5eE776NgB@me-tech.PFM-Mainz.DE>,
Michael Schmidt <mschmidt@me-tech.PFM-Mainz.DE> wrote:
>  In article  <3e5uds$9v5@Radon.Stanford.EDU>
>              pratt@Sunburn.Stanford.EDU   (Vaughan R. Pratt)
>>
>> This had the following dramatic effect (see my original message on
>> chain reactions for the meaning of this printout).
>>
>>    1  012345.6.7.8.9.a.b.c.d......................................
>...some stuff deleted...
>
>I think it's important enough and would be fine (I can imagine  
>enough others would appreciate it too) if you could repost your  
>whole original message, where the meanings are explained, to the  
>comp.sys.intel newsgroup.

Let me just do it from scratch.  Anyone who wants the original message
can get it by anonymous ftp as bug65, available in
/pub/FDIV/individual.bugs on boole.stanford.edu.  The above example
comes from bug66.

For starters let me just read it out loud.  "At cycle 20 (= line 1) the
database is only mildly corrupted, with a very small scattering of
errors across much of the spectrum, none greater than 10^-5 though it
contains at least one relative error greater than 10^-6.  At least 98%
of the database is intact in the sense of containing no relative errors
larger than 10^-13."

Now here's how I got that out of the above.

Each nonleading dot in a run of dots represents about 2% of the
database entries (100/48 to be exact), arranged in order of increasing
accuracy (decreasing relative error) from left to right.  (2% of
750,000 entries is 15,000 errors.)  The leading dot in each run denotes
less, namely anywhere from a single entry to 2% of the database.

Each hex digit (0-9a-d) marks an error size boundary; digit 8 is the
boundary for relative errors of size about 10^-8.  (Note that since
digits take up space they displace dots, which therefore move around a
little as an artifact of this method of display, depending on where the
digits are concentrated.)  Adjacent hex digits, e.g. 23, indicate
complete absence of errors in that decade, namely between 10^-2 and 10^-3.

So in the line above, the most erroneous entry is between 10^-5 and
10^-6, and it is at least consistent with these rules, though very
unlikely, that there are only 8 entries to the left of d (i.e. less
accurate than 10^-13).  The other extreme would seem to be that 16% of
the database is left of d.  This seems like pretty crummy notation, and
it was really only meant for eyeballing the general effect, but there
is however a trick to reading more into this.  There were 48
non-leading dots in the above line in my original message (you deleted
ten), all to the right of d, but that accounts for the whole database
to within a dot.  Hence less than 2% of the entries can be left of the
d.

The number at the start of the line is the cycle number divided by 20
(more recently I changed the program to not divide by 20).  (A cycle is
the number of steps required to refresh the database with as many
numbers as fit in the database; Mercedes claims their climate control
does this to the car's air in 20 seconds, but they certainly don't
claim to have replaced every molecule in that time, or even in 20
minutes = 60 cycles.)  So the example is a snapshot at the end of cycle
20.

At cycle 280 of the same run we see

  14  0123456789a.b.c.d................................................

which indicates that all errors above 10^-10 have been swept out of the
database, mostly by the incoming 1's constituting 1/3 of the incoming
numbers (computing x/x will also do this but with very low probability;
x-x will not do it at all because 0's are discarded along with other
quantities below 2^-32).  The source of errors in that run were
double-precision truncations, measured using extended precision as the
standard, and they came from the arithmetic operations supplying 2/3's
of the incoming numbers.  Though tiny they were enormous in quantity
compared to FDIV errors, giving them enough "momentum" to sweep down to
10^-10 by being magnified by arithmetic before the diluting effect of
the 1's was finally able to sweep them all away.  But this is a chaotic
process and some errors would from time to time reach 10^-7, though
after cycle 120 none got past that threshold (at least none were there
at each 20-cycle snapshot).

I learned a lot about doing experiments from this experiment, but the
most useful thing I learned about the Pentium itself was I think the
fact that if you start from low precision data (and it doesn't seem to
matter too much whether it is exclusively the number 1, or random
decimal numbers of the form d.dd) and compute at random while
respecting type (don't add kumquats to square kumquats) and keeping
dependency chains short (< 100), then the kind of numbers you end up
with, when selected among at random and fed into FDIV, trigger the bug
every few million divisions.  This is three orders of magnitude higher
than for random data in Intel's sense, operands selected from the 2^23
mantissas with equal probability.  I'd like to measure this more
carefully and under a greater variety of conditions just to be sure,
but I certainly did not have to wait 24 days for an FDIV error as the
IBM report suggests, they popped up every couple of minutes or so!
27,000 years?  Give me a break.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/ABSTRACTS
				      http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug70----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 28609 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel,sci.math.num-analysis,comp.arch
Subject: Re: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Contaminated Data Persists)
Date: 3 Jan 1995 18:15:14 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 39
Distribution: inet
Message-ID: <3ec47i$ki5@Radon.Stanford.EDU>
References: <3dn52g$ohe@newsbf02.news.aol.com> <D1q41r.39E@dsg.tandem.com> <3e750u$rv1@Radon.Stanford.EDU> <3ebjds$dkl@doc.armltd.co.uk>
NNTP-Posting-Host: sunburn.stanford.edu
Xref: Radon.Stanford.EDU comp.sys.intel:28609 sci.math.num-analysis:10504 comp.arch:16099

In article <3ebjds$dkl@doc.armltd.co.uk>,
David Seal <dseal@armltd.co.uk> wrote:

>pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes:
>
>>Other experiments suggest themselves here, such as greatly diluting the
>>...
>>Too many fun experiments possible, too little time.
>
>One that suggests itself to me: run three versions in parallel, one
>for buggy Pentium, one for correct double precision and one for
>correct extended precision. Prune a calculation out of the database if
>the inaccuracy of the correct double precision result relative to the
>correct extended precision result becomes too large.
>
>The rationale behind this: it is an attempt to model the (somewhat
>questionable) assumption that the programs people run have been
>designed to be well-behaved in the presence of double precision
>rounding errors: as long as double precision remains accurate relative
>to extended precision, the calculation is likely to be one that you
>could reasonably expect to come up in people's programs. 

A roughly equivalent effect turned out to be achieved by keeping
dependency chains below 100.  Above 200 the double-precision-truncation
errors were blown up to the "danger region" of relative errors above
10^-5.  (Basis: this experiment is intended to factor in considerations
of error propagation, and not too many engineering applications depend
directly on the 6th decimal digit.)

Pruning only those computations actually creating these errors rather
than the more draconian measure of pruning all that are at any risk of
them due to much arithmetic may possibly give different results.  I'm
not sure which corresponds more closely to actual computational
practice; I suppose one could interpret careful pivoting as being of
the former kind, but other computations avoid this sort of magnification
by avoiding excessively chaining of dependencies.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/ABSTRACTS
				      http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug71----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 28631 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Intel does not deliver has promised!
Date: 3 Jan 1995 20:59:53 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 36
Message-ID: <3ecds9$srd@Radon.Stanford.EDU>
References: <3ec0ou$r7b@goudron.gel.ulaval.ca>
NNTP-Posting-Host: sunburn.stanford.edu

In article <3ec0ou$r7b@goudron.gel.ulaval.ca>,
Jean-Sebastien Dion <jsdion@etu.gel.ulaval.ca> wrote:
>
>I don't know how many of you received their new pentium chip, but I really
>think it's very rare.

I received two phone calls early in December, one to ask if I needed a
replacement and establish my eligibility, the other to get my address
and Visa number.  That was it as far as phone conversations with Intel
directly concerning the replacement were concerned.

On Dec. 27 a Fed-Ex truck arrived with a Priority Overnight package
shipped from Oregon Dec. 23, containing a fixed chip (build date Dec.
21).

Just five minutes ago (Jan. 3) the same Fed-Ex truck and driver arrived
with a package labeled with the same product code shipped Dec. 30 (build
date Dec. 23).

Conclusions.

1.  I've inadvertently signed up with the Pentium-of-the-week club.
We'll see what Jan. 10th brings. :)

2.  Fed-Ex seems reliable:  "Priority Overnight" apparently means
four days.

3.  Build-to-ship went from 2 days to 7, but Xmas may be a factor.
Does build-to-ship give any indication of volume?

I was planning to buy another Pentium anyway, my real research being
concurrency.  Intel closes another sale.  At this rate I'll be up to
200 megaflops by Easter.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/ABSTRACTS
				      http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug72----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Newsgroups: comp.sys.intel
Subject: Re: How many clocks?
Summary: 
Expires: 
References: <013873XIAEGMJREVNGYB@abq-ros.com> <3er7tr$o8f@vkhdsu01.hda.hydro.com>
Sender: 
Followup-To: 
Distribution: 
Organization: Computer Science Department,  Stanford University.
Keywords: 
Cc: 


In article <3er7tr$o8f@vkhdsu01.hda.hydro.com>,
Terje Mathisen <Terje.Mathisen@hda.hydro.com> wrote:
>
>The clock counts are 19, 33 and 39 for single, double and extended prec
>...
>  cycles = ((bits_in_mantissa + 1) >> 1) + 6;

Assuming 24, 53, and 64 bits in the respective mantissas (counting the
leading bit), this formula gives 18, 33, 38 cycles.  The formula
bits/2 + 7 gives the desired result.

>(The first iteration on the SRT radix-4 divider will only generate a single
>bit, while all subsequent cycles gives two more bits.)

I think I know what you mean, but it should be pointed out that the
first iteration can generate two bits.  Consider the case
1.1111/1.0000, for which P > 5/3D and hence for which the plot on page
6 of the Intel white paper shows that digit +2 is mandatory.  The
example in Figure 4-2 indicates that at the first iteration the
operands are aligned.  (rp(0) in the figure seems to be a typo, I think
it should be P(0).)  Hence 1.1111/1.0000 (or any dividend above
1.10000) forces this case to produce the quotient 10, two bits, at the
first iteration.

I think what you mean is that ultimately (possibly not until the very
end if the next 30 quotient digits are 0) that leading 1 is wiped out by
a subtraction.  Hence it should not be counted as one of the developed
quotient bits for the purpose of stopping when you have enough quotient
bits.

But then the case 1.0000/1.1111 (more generally, whenever D<P) wipes out
both leading bits, so to get a full 64 bits you need an additional
quotient bit.

But the complexity doesn't stop there either.  Don't assume that 2n
bits of precision can be developed in n cycles starting from the
production of the first 1 that ultimately survives.  For a messy
function like division, even n+1 or n+2 cycles is pushing it.  The
register in which the quotient is developed might well be more than 64
bits, quite apart from the precision of the operands (which too may be
more than 64 bits for use by the transcendentals).  In that case an
additional 1 or even 2 of those 7 overhead cycles may belong to the
division process proper.  For IEEE correct rounding, I believe a
comparison between the remainder and the divisor at the end is
mandatory (anyone know for sure?); this will be after the division
proper and will be part of the surviving 5 or 6 non-division cycles.
Article 29449 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: How many clocks?
Date: 9 Jan 1995 18:40:06 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 43
Message-ID: <3ervu6$5ag@Radon.Stanford.EDU>
References: <013873XIAEGMJREVNGYB@abq-ros.com> <3er7tr$o8f@vkhdsu01.hda.hydro.com>
NNTP-Posting-Host: sunburn.stanford.edu


In article <3er7tr$o8f@vkhdsu01.hda.hydro.com>,
Terje Mathisen <Terje.Mathisen@hda.hydro.com> wrote:
>
>The clock counts are 19, 33 and 39 for single, double and extended prec
>...
>  cycles = ((bits_in_mantissa + 1) >> 1) + 6;

Assuming 24, 53, and 64 bits in the respective mantissas (counting the
leading bit), this formula gives 18, 33, 38 cycles.  The formula
bits/2 + 7 gives the desired result.

>(The first iteration on the SRT radix-4 divider will only generate a single
>bit, while all subsequent cycles gives two more bits.)

I think I know what you mean, but it should be pointed out that the
first iteration can generate two bits.  Consider the case
1.1111/1.0000, for which P > 5/3D and hence for which the plot on page
6 of the Intel white paper shows that digit +2 is mandatory.  The
example in Figure 4-2 indicates that at the first iteration the
operands are aligned.  (rp(0) in the figure seems to be a typo, I think
it should be P(0).)  Hence 1.1111/1.0000 (or any dividend above
1.10000) forces this case to produce the quotient 10, two bits, at the
first iteration.

I think what you mean is that ultimately (possibly not until the very
end if the next 30 quotient digits are 0) that leading 1 is guaranteed
to be wiped out by a subtraction.  Hence it should not be counted as
one of the developed quotient bits for the purpose of stopping when you
have enough quotient bits.

But if that's what you mean, then the other extreme 1.0000/1.1111 (more
generally, whenever D<P) wipes out both leading bits, so to get a full
64 bits you yet another quotient bit.  So to determine how many cycles
are needed, one must assume that the first iteration may generate *no*
bits!

Information theoretically the loss in the leading bits is exactly 2
bits.  This follows from the residual uncertainty of a factor of 4 in
D/P after they are aligned, the endpoints being 1/2- and 2-/1.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/ABSTRACTS
				      http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug73----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 31029 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: PRESIDENT OF INTEL SPEAKS ABOUT BUG
Date: 20 Jan 1995 09:30:47 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 183
Message-ID: <3fnvs7$qu4@Radon.Stanford.EDU>
References: <3f9qlq$9ds@ornews.intel.com> <dclineD2GzB6.B2D@netcom.com> <3fh47f$mgq@news.jf.intel.com> <D2MwxG.5L9@dsg.tandem.com>
NNTP-Posting-Host: sunburn.stanford.edu

In article <D2MwxG.5L9@dsg.tandem.com>,
Anil Das <anil@nskernel.tandem.com> wrote:
>In article <3fh47f$mgq@news.jf.intel.com>, proussel@pdx144.intel.com (Patrice
>Roussel) writes:
>> All the entries of the guess PLA of the Pentium were exercized (something
>> you
>> can do with few thousand (dividend,divisor) pairs). Unfortunately the
>> problem was that some entries were missing :-(.
>
>	What is meant by `missing' here? If dividend/divisor pairs
>that use those entries where tested, they will come out as 0 instead
>of 2, and the problem would have been found, right?

This depends on how they came to be 0.  The Intel white paper
attributes the missing entries to an error in a script written to
download entries into a PLA, in which case such a test would presumably
have caught this.

However at a lecture I gave on the Pentium bug a week ago at Stanford,
William Kahan who was in the audience said that the five entries were
missing because it had been proved (erroneously) that they were never
accessed by the division algorithm, like all the entries above them in
the table.  Obviously one cannot test any of the inaccessible entries
by running the algorithm, so if you believe these five are inaccessible
then you will of course not provide a test case for them.

Having heard such conflicting accounts from Intel and Kahan, neither of
whom I am competent to contradict, I am now thoroughly mystified as to
the real origin of the bug.  At this point the only apparently relevant
fact about the table that I am 100% certain of is the following.

In the five erroneous columns of the PD-plot (these being the ones we
can measure, thanks to the bug), the six thresholds (three each in the
positive and negative halves) are set to values I have tabulated
below.  The relevant fact is that these values would be perfectly
symmetric about -1 had the erroneous threshold (at the top of the
positive half) been set correctly, namely one cell higher.  This higher
threshold would then not illegally penetrate the 8/3 D line (shown in
the Intel white paper), the root cause of the problem.

(The reason -1 is the appropriate midpoint is that sampling truncates
the partial remainder towards -oo in both halves, by an amount between
0 and 2 thanks to the redundant carry-save representation of the
partial remainder.  Had an irredundant representation been used the
truncation would have been by an amount between 0 and 1 and the
appropriate midpoint would have been -0.5.  Had the sampled partial
remainder been truncated towards oo instead of -oo the midpoint would
have been positive instead of negative.)

In order for the bug to have been caused by a faulty proof, both those
who wrote the proof and those who read it would need to believe that a
correct table could have the inner two threshold pairs (between
absolute-value quotient digits 0 and 1, and between 1 and 2) perfectly
symmetric yet have the outer pair (between 2 and "3") asymmetric.  I
have been unable to come up with any plausible failure mode for the
proof which would reasonably account for confidence in such a
peculiarly isolated asymmetry.

Perhaps none of the parties examining the table noticed that the inner
two pairs of thresholds were symmetric.  Or perhaps some reason was
advanced for why there should be a single exception to the symmetry
that somehow everyone involved found convincing.  I find neither of
these scenarios terribly plausible, and am therefore mystified as to
what form a faulty proof could have taken.

Here are the observable thresholds, originally told to me by Tim Coe
and which I have since confirmed independently for myself by verifying
that any other choice of thresholds would be inconsistent with the
observed behavior of the buggy Pentium.  I've laid this out here as a
transposed PD-plot, with the five observable divisors (17,20,23,26,29)
indexing five rows (oriented as columns in the white paper), and with
the entries (quotient digits) in each region of the table indexing the
columns.  The entries in my table are the ranges of chopped (sampled)
partial remainder which yield the indicated quotient digits.  The six
vertical bars in the table separate the seven regions and constitute
the thresholds between the regions.  (An unambiguous way to realize
these as numeric thresholds might be as the mean of the two adjoining
chopped partial remainders, e.g. at D=17 the 1-2 threshold, between 11
and 12, could be considered to be set at 11.5.)

    \    q = 0("-3")    -2         -1        0       1        2      0("3")
  D_5\---------------------------------------------------------------------
17=1.0001   ...-26 | -25..-14 | -13..-5 |  -4..2  | 3..11 | 12..22 | 23...
                   |          |         |         |       |        |
20=1.0100   ...-30 | -29..-16 | -15..-6 |  -5..3  | 4..13 | 14..26 | 27...
                   |          |         |         |       |        |
23=1.0111   ...-34 | -33..-18 | -17..-6 |  -5..3  | 4..15 | 16..30 | 31...
                   |          |         |         |       |        |
26=1.1010   ...-38 | -37..-20 | -19..-7 |  -6..4  | 5..17 | 18..34 | 35...
                   |          |         |         |       |        |
29=1.1101   ...-42 | -41..-22 | -21..-7 |  -6..4  | 5..19 | 20..38 | 39...
		   ok	      ok        ok       ok       ok      ouch

The first five thresholds are marked as ok.  The sixth, marked ouch, is
one too low.

Although the 11 other chopped divisors, namely D=16,18,19,21,22,24,25,
27,28,30,31, are not observable via the bug, Tim conjectures that their
thresholds are those of their "ceilings" defined as the smallest higher
buggy divisor, e.g. D=18 and 19 have the exact same thresholds as
D=20.  This leaves D=30 and 31 with unmeasured thresholds, but by
extending the evident arithmetic progressions these are reasonably
conjectured to be

31=1.1111   ...-46 | -45..-24 | -23..-8 |  -7..5  | 6..21 | 22..42 | 43...

With that regularity the table would only require 6 "fat" columns (one
per three chopped divisors, less one at each end) instead of 16 (one
per chopped divisor), shrinking the table to 6/16 its original size.
Sampling the divisor to determine which of the 6 columns to use would
only be done once, at setup.

In contrast, sampling the partial remainder is done at every cycle of
the algorithm, and it may well be preferable to keep separate positive
and negative halves of the table to avoid the extra time required to
<test the chopped partial remainder for sign and if negative form its
1's-complement (NOT) and subtract 1>.

That the actual table is asymmetric is very strong evidence that the
positive and negative halves are represented separately.  (But not
conclusive evidence---a simple logic circuit could in principle have
been used to lower the top threshold by 1 for positive divisors.  Such
an ad-hoc addition to the logic is however almost surely excluded on
the ground that it would serve no purpose even if the lower threshold
*had* been correct.)

Here is a slick and easily proved closed form formula for the entries
of the positive half of a *correct* table.  If D (between 16 and 31) is
the chopped divisor, let d = D/3 + 1 (between 6 and 11, serving as an
index to the 6 "fat" columns), and let p be the chopped partial
remainder, p >= 0.   Take the PD-plot table entry at row p and fat
column d to be given by the closed form formula

		(d <= 2*p) + (2*d <= p)

where Booleans are numerically 0 if false and 1 if true (as in C).

Thus at D = 17 we have d = 17/3 + 1 = 5 + 1 = 6, and we then get for
p=2,3,11,12 the values

		 p	d <= 2*p      2*d <= p     (d<=2*p) + (2*d<=p)	
		----------------------------------------------------
		 2	   0		   0		    0
		 3	   1		   0		    1
		11	   1		   0		    1
		12	   1		   1		    2

These four examples show that this closed form formula locates the two
inner thresholds exactly where the Pentium puts them, at least for
D=17; similar calculations confirm that this formula tracks the
Pentium's inner two thresholds perfectly at the other divisors.

One can imagine ways to evaluate this formula sufficiently fast during
each division cycle as to obviate the need for a precomputed table of
its values.

This formula makes the inaccessible entries 2.  If it is desired to
reduce power by setting these to something else, e.g. 0 or 3, this can
be safely accomplished using the easily proved fact that, with one
exception, they are precisely those entries satisfying 4*d <= p.  Note
that this formula does *not* produce the notorious bug, which
corresponds to using the more complicated formula 4*d <= p+1 (in the
positive half), which compromises the rightmost third of each fat
column.  The one exception mentioned above is for d = 11, p = 43, where
thanks to the absence of column D=32 this more complicated formula does
not produce the bug, else there would be 6 groups of problem divisors
instead of 5.  However I would imagine the tiny economy of setting one
additional entry to 0 is not worth the trouble and attendant risk of
error; stick to 4*d <= p uniformly and let d=11, p=43 stay at 2.

The negative half of the table is obtained simply by reflecting the
positive half about -1, also easily proved correct (an outline of this
proof is in the parenthetical paragraph near the beginning of this
message starting "The reason").

Had these simple and easily derived formulas been known for the PD-plot
table, all this trouble would have been avoided.  There is *nothing* so
effective as simplicity to avoid bugs, whether in computers,
mathematics, or any other area where embarrassing and/or expensive
mistakes are possible.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/ABSTRACTS
				      http://boole.stanford.edu/boole.html


=================================================================
-------------------------PENTIUM REPORT # bug74----------------------
			Vaughan R. Pratt
		   Department of Computer Science
			Stanford University
			Stanford, CA 94304
		       pratt@cs.stanford.edu
======================================================================
Article 31190 of comp.sys.intel:
Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt
From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt)
Newsgroups: comp.sys.intel
Subject: Re: Pentium Math Error
Date: 21 Jan 1995 20:03:26 GMT
Organization: Computer Science Department,  Stanford University.
Lines: 34
Message-ID: <3frpae$ovk@Radon.Stanford.EDU>
References: <nobody-140195153420@via-annex2-3.cl.msu.edu> <Pine.OSF.3.91.950121122429.19455B-100000@alpha.dsu.edu>
NNTP-Posting-Host: sunburn.stanford.edu

In article <Pine.OSF.3.91.950121122429.19455B-100000@alpha.dsu.edu>,
Warren Rath  <rathw@alpha.dsu.edu> wrote:
>
>Now ask your self how many times does the calculation of 	        
>   4,195,835 / 3,145,727 actually come up in computing??

Once every seventeen comp.sys.intel posts, or once a minute. :)

I prefer 5/15 with a millionth knocked off top and bottom, as being
easier to remember than 4195835/3145727.  4.999999/14.999999 should
have 6 3's but the Pentium only produces 4 of them.

If you take the number 1 and randomly add, subtract, multiply, or
divide it by itself, and save the result, and keep doing this with a
mix of more 1's and these results, then in the course of an hour you
will see the bug more than a dozen times.

This is the dual experiment to Intel's.  Intel takes a program
consisting of a single instruction, FDIV, sends pairs of random numbers
to it, and gets an error rate of 1 in 9 billion random pairs of
numbers.  In contrast my experiment above takes a database starting
with a single number, 1, sends it to random programs, and gets an error
rate of 1 in five minutes of random computing.

This shows that the error rate depends on both the kind of programs you
use and the kind of data flowing through them.  If your program
consists of just the FDIV instruction and your data is uniformly
distributed then you should be as safe as Intel says.  If you have more
complicated programs and less complicated numbers, the above experiment
indicates that you could be at much greater risk than either Intel or
IBM says.
-- 
Vaughan Pratt			      FTP:  boole.stanford.edu:/pub/ABSTRACTS
				      http://boole.stanford.edu/boole.html