================================================================= -------------------------PENTIUM REPORT # bug1---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: A natural scenario with high FDIV bug probability (was: In Intel's Defense...) Date: 3 Dec 1994 15:20:17 GMT Organization: Computer Science Department, Stanford University. Lines: 194 Message-ID: <3bq2bh$ohc@Radon.Stanford.EDU> References: <3bdieq$lqf@engnews2.Eng.Sun.COM> <3bf666$6k4@hpsystem1.informatik.tu-muenchen.de> <3bnrgo$4gf@hermes.synopsys.com> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15209 comp.sys.intel:20065 In this message I give a scenario in which FDIV bugs are encountered once every three milliseconds or so, in contrast to Intel's default scenario in which they are encountered every 27.000 years or so. Furthermore none of the encountered bugs involve obscure numbers: all of them take the form of small "bruised" rationals like 5/15, which when bruised as 4.999999/14.999999 yields 0.333329 on the Pentium and 0.33333329 elsewhere. Moreover quite a few of the errors encountered are very large, as with this example. We leave the plausibility of our scenario for the reader to judge; my intent is that it come across as something that could well happen. By way of motivation: In article <3bnrgo$4gf@hermes.synopsys.com>, Joe Buck wrote: >I am really quite amazed at all of this fuss. In one case out of nine >billion, you get what in essence is a single-precision division instead of >a double-precision division. The vast majority of Pentium users never do >anything that requires even full single-precision: they run spreadsheets, >write documents, and play games. Why should Intel pay about 1 billion >(that's nine zeros) dollars to get all these lusers a new chip? One in nine billion, NOT. It is a common fallacy that real numbers arising in a user's program are uniformly distributed. Their actual distribution depends critically on both where the program gets its data and what it does with it, as with the scenario described below. Correct to within single-precision, NOT. Unless of course you had in mind a 16-bit word length; other examples of ~16-bit errors have already been given, but the form of the ones we give here such as the above-cited 4.999999/14.999999 makes them particularly dramatic. In this message I give a simple and plausible scenario, not involving number theory, cryptography, differential equations, or matrix inversions, but simply divisions by integers of one to three digits, represented as reals, that have been slightly "bruised" by a process all of us have seen many times. In this scenario the FDIV bug reveals itself, not once every 27,000 years, but rather once every three milliseconds or so, the rate at which my Pentium was encountering them. The critical postulate here is that the integers encountered in this scenario have been "bruised" very slightly due to other processing and then truncated, for whatever reason (e.g. the numbers might have been obtained automatically from a decimal calculator), to some preordained number of decimal digits, the *precision*. All integers are subjected uniformly to this treatment, for a fixed precision. Thus if the precision is say 6, and we are dividing say 7 by 18, then 7 is actually 6.999999 and 18 is 17.999999, that is, the same quantity, here 10^-6, is subtracted from both operands. There are a million pairs of integers i,j with 1 <= i,j <= 1000. The following table shows, as a function of the decimal precision going down, and the *tolerance* (how far off a division must be to count as a wrong answer) going across, how many of the one million possible divisions i/j are wrong. For these quotients the IEEE-correct answers can be expected to have a relative error of better than 10^-17. The table below defines "wrong" as having a relative error of at least 10^-15 for the first column, at least 10^-13 for the second, etc. (That is, going across the table we become less fussy and hence recognize fewer quotients as wrong.) Tolerance (10^-): 15 13 11 9 7 5 4 -------------------------------------------------------------- Decimal precision Number wrong, out of a million -------------------------------------------------------------- 15 4 4 0 0 0 0 0 14 101 101 29 0 0 0 0 13 871 864 252 0 0 0 0 12 876 867 573 0 0 0 0 11 871 868 822 52 0 0 0 10 875 871 852 211 0 0 0 9 886 885 875 527 6 0 0 8 875 875 873 762 69 0 0 7 833 833 833 818 258 0 0 6 627 627 627 623 426 14 0 5 233 233 233 232 205 10 0 4 43 43 43 42 41 2 0 3 0 0 0 0 0 0 0 With 6 digits of precision the truncation error is one millionth, an easily remembered quantity. For this case, the reader may wish to memorize two or three of the worst offenders, if only for production at cocktail parties. My favorite is 5/15, that is, 4.999999/14.999999, appraised by the Pentium at 0.33332922 when other appraisers would insist on 0.33333329 as a fairer answer. ^^ Another bad case is 7/48 (6.999999/47.999999), for which the Pentium guesses 0.14583204 when a more inspired guess would be 0.14583332. ^^ One more candidate: 9/54 (8.999999/53.999999), where the Pentium gambles on 0.16666439 and loses petulantly to 0.16666665. ^^ Let me emphasize the three essential features of examples such as these. First, the basic arithmetic is with quite small integers (albeit slightly "bruised"), greatly increasing the likelihood that you will encounter these exact errors, whether or not you ever notice them. Second, the bruising can be by any simple amount, like a millionth or a ten-millionth, it does not have to be an obscure quantity like 0.000000142883, greatly increasing the likelihood that the kind of bruising arising in your situation will be bad for you. Third, the error can be applied uniformly to both operands of FDIV, one does not need to tweak the errors of the operands separately, giving you an even better chance of encountering one of these errors. An odd feature of these three examples that for want of time I have not pursued and that may or may not have a uniform explanation is that in every case the error can be described approximately by saying that the Pentium omitted two of the repeating digits, underlined with ^^ in the above. This yields a simple heuristic for emulating a Pentium on these three examples: calculate 4.999999/14.999999 or whatever on your pocket calculator, then delete two of the repeating digits; the result is good to one digit beyond the repeating string (actually two in the 5/15 case). It should be noted that the relative error in these three examples is considerably greater than the precision. Only 26 pairs survive at tolerance 10^-5. Here for the record are the 14 wrong'uns for precision 6: 5/15, 5/30, 5/60, 5/120, 9/54, 9/108, 9/216, 9/432, 9/864, 10/60, 10/120, 10/240, 10/480, and 10/960. (7/48 is off by only 0.9*10^-5, just barely missing out on tolerance 10^-5.) For precision 5 the 10 bad guys are 18/27, 20/30, 33/144, 36/108, 40/120, 44/192, 72/432, 72/864, 80/480, and 80/960. And for precision 4 the two divisions are 82/96 and 120/288; specifically, 81.9999/95.9999 yields 0.854156 when we expected 0.8541665, while 119.9999/287.9999 yields 0.416656 and we wanted 0.4166665. (Hmm, the delete-two-repeats rule works here too; very interesting...) Although we have been assuming radix ten for our notion of precision, it is not special, and any other radix should yield similar results, albeit with some other assortment of small rational erring divisions. In particular if we had evolved thirteen fingers instead of ten, the corresponding table going from 11 digits down to 3 digits at tolerance 10^-9 would have been 0, 31, 211, 547, 802, 784, 417, 109, 0. While this scenario puts the FDIV bug in what looks at least to me like a pretty bad light, I do not claim it is anything like the worst case scenario. The two essential factors for any scenario are the rate at which that scenario triggers the FDIV bug, and how often the essence of that scenario arises in the real world. The "damage index" of a scenario is the product of those two factors. A scenario that accounts for thousands of hours of actual Pentium time but that only triggers the bug say every ten minutes may well have a higher damage index than my scenario above, whose strength is that it triggers the bug every few milliseconds but whose weakness is the uncertainty as to how likely it really is in practice. You must admit however that it is not a completely implausible scenario. -o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o I append here the code that prints the above table. For those who think if not harmonize in C, this is where to look for the exact meanings of "precision" and "tolerance". #include #include main(argc, argv) char *argv[]; { int i, j, cnt, digits; double prec, tol, ibruis, jbruis, pent, true; for (digits = 15, prec = 1e-15; digits > 2; digits--, prec *= 10) { printf("%4.d\t\t", digits); for (tol = 1e-15; tol < 1e-4; tol *= 100) { cnt = 0; for (i = 1; i <= 1000; i++) for (j = 1; j <= 1000; j++) { ibruis = i - prec, jbruis = j - prec; pent = ibruis/jbruis; true = (1.11*ibruis)/(1.11*jbruis); cnt += (fabs(pent - true) > tol*true); } printf("%4d ", cnt); } printf("\n"); } } My heuristic for getting the true quotient, namely true = (1.11*ibruis)/(1.11*jbruis), is the quick and dirty one of randomizing the operands just enough to move the probabilities of hitting the FDIV bug at least somewhat closer to Intel's estimate of once in 27,000 years. When I replaced 1.11 by 1.01 the results were identical, which along with the Intel estimate seemed good enough for a program not intended for building planes. -- Vaughan Pratt http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug1.ans---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15289 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!agate!howland.reston.ans.net!swrinde!sgiblab!chronos.synopsys.com!news.synopsys.com!jbuck From: jbuck@synopsys.com (Joe Buck) Newsgroups: comp.arch,comp.sys.intel Subject: Re: A natural scenario with high FDIV bug probability (was: In Intel's Defense...) Date: 5 Dec 1994 18:45:07 GMT Organization: Synopsys Inc., Mountain View, CA 94043-4033 Lines: 26 Message-ID: <3bvn3j$nbf@hermes.synopsys.com> References: <3bdieq$lqf@engnews2.Eng.Sun.COM> <3bnrgo$4gf@hermes.synopsys.com> <3bq2bh$ohc@radon.stanford.edu> NNTP-Posting-Host: deerslayer.synopsys.com Xref: Radon.Stanford.EDU comp.arch:15289 comp.sys.intel:20638 pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes: >In this message I give a scenario in which FDIV bugs are encountered >once every three milliseconds or so, in contrast to Intel's default >scenario in which they are encountered every 27.000 years or so. >Furthermore none of the encountered bugs involve obscure numbers: all >of them take the form of small "bruised" rationals like 5/15, which >when bruised as 4.999999/14.999999 yields 0.333329 on the Pentium and >0.33333329 elsewhere. Moreover quite a few of the errors encountered >are very large, as with this example. From Vaughan Pratt's analysis, it appears that the Pentium bug is far more serious than Prof. Nicely's analysis, or Intel's remarks, would indicate, because the quotients that cause the bugs are far more likely to occur than a random distribution would indicate. I withdraw my previous remarks that the fuss has been overblown. I was basing them on Nicely's remarks rather than on Intel's, though both simply computed the fraction of divisor pairs that yield wrong answers and both underestimated the maximum possible error considerably. I think that the likelihood of hitting these values is definitely going to be application-dependent, but it certainly seems that these near-integer values are going to be far more common than random. -- -- Joe Buck (not speaking for Synopsys, Inc) Phone: +1 415 694 1729 ================================================================= -------------------------PENTIUM REPORT # bug2---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.arch,comp.sys.intel Subject: Re: A natural scenario with high FDIV bug probability (was: In Intel's Defense...) Summary: Expires: References: <3bdieq$lqf@engnews2.Eng.Sun.COM> <3bnrgo$4gf@hermes.synopsys.com> <3bq2bh$ohc@Radon.Stanford.EDU> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article <3bq2bh$ohc@Radon.Stanford.EDU>, Vaughan R. Pratt wrote: >In this message I give a scenario in which FDIV bugs are encountered In that message I showed that, by dividing "bruised" rationals with numerator and denominator at most 1000, the probability that a randomly chosen such rational would generate the FDIV bug was just under one in a thousand. It later occurred to me to ask for the corresponding proportion when the limit was set to 100. It turns out to be just under one in a hundred. However as the limit is further decreased the rate does not increase further but stays roughly constant. When the limit is changed from 20 to 10 it drops to exactly zero. Here is the table for the 10,000 bruised divisions i/j, see my previous message under this heading for the meanings of the entries. Tolerance 15 13 11 9 7 5 Precision --------------------------------------- 15 4 4 0 0 0 0 14 86 86 24 0 0 0 13 78 78 52 0 0 0 12 78 78 76 0 0 0 11 78 78 77 17 0 0 10 78 78 78 52 0 0 9 80 80 80 73 2 0 8 78 78 78 78 23 0 7 71 71 71 71 51 0 6 54 54 54 54 51 5 <--- 5 10 10 10 10 10 2 4 1 1 1 1 1 1 3 0 0 0 0 0 0 To illustrate, the 5/15 example is one of the 5 items the arrow is pointing to, because the Pentium computes 4.999999/14.999999 (precision 6) with a relative error worse than the tolerance for that column of 10^-5. It shows up in both this table and the previous table because both the numerator 5 and the denominator 15 are below 100, and hence below 1000. At precision 6, for which the typical division considered for this table is 43.999999/87.999999 (that one is among the innocent majority), 51 of the 10,000 divisions are off by worse than one part in ten million, while 5 of them are off by worse than one part in a hundred thousand. The former statistic means that if you do a lot of divisions involving very small (less than 100) bruised integers, you have more than a half a percent chance *at each division* of getting an error worse than that expected from using single precision arithmetic. If you were doing mainly divisions, this would mean that you would see the FDIV bug every few hundred microseconds, a far cry from Intel's 27,000 years. (With the limit on numerator and denominator set at 1000, my previous message had reported FDIV bugs coming at a rate of one every three milliseconds for randomly chosen small bruised rationals.) This bug is maximally insidious: it is about as large as it could possibly be without actually triggering warning bells when people review their columns of data. In this way tiny errors of one part in a hundred thousand can over a long period of time sneak into the trillions of calculations done around the world and there is no practical way to spot them short of doing a massively detailed audit of a kind that would be have to be tailored to the Pentium FDIV bug and would be entirely unnecessary for a reliable floating point unit. Intel may argue that errors of one in a hundred thousand are too small for the average business to care about. I would argue on the basis of the statistics of my last two messages that the Pentium has the ability to do to the information superhighway what the Loma Prieta earthquake did to San Francisco's highways. That many of us did not witness that damage at first hand does not mean that it was not serious. ================================================================= -------------------------PENTIUM REPORT # bug3---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 20292 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch Subject: The Bad Scenario for FDIV--Simplified Account (was: Intel's own description) Date: 4 Dec 1994 07:56:12 GMT Organization: Computer Science Department, Stanford University. Lines: 55 Message-ID: <3brsms$dae@Radon.Stanford.EDU> References: <3bq8re$sck@nntp.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:20292 comp.arch:15232 In article , Tim Arheit AKA Fool Boy wrote: >The problem is that my software doesn't use random number as data. One >of the postes in this group shows that simple division of ~integer numbers >shows a significant error (something on the order of 100 times less accurate) >4.99999999/14.999999 (5/15) (I think these are some of the example >numbers given, it's in this group somewhere) generated significant error. >The point is that integer constants in a floating point equation are far >from random. Let me correct a minor detail and at the same time repackage what I said in two earlier messages in a hopefully more digestible and memorizable form. The exact example here subtracted one millionth (easy to remember) from each of 5 and 15, i.e. 4.999999/14.999999. The correct value is 0.33333329, Pentium math makes it 0.333329, wrong in the fifth decimal place, specifically a relative error of 1.2 in 10^5. Other such rationals are 7/48 and 9/54. For each of these, if you subtract one millionth from both numerator and denominator before dividing, you again obtain a relative error of about one in a hundred thousand. Among rationals with numerator and denominator both bounded by 1000, each decremented by a millionth, there are 14 that behave as badly as the three examples above, and an additional 412 that the Pentium evaluates with a relative error between 10^-5 and 10^-7. (For comparison, using single precision instead of double yields a maximum relative error of 6*10^-8, so all these errors are worse than would result merely from using single precision instead of double.) Hence if you compute with randomly chosen approximate (in the above sense) rationals from that population, the probability of an error in the fifth place is one in 70,000, while the probability of an error in the sixth or seventh place is one in 2,000. The P90 can do better than two divisions per microsecond, at which rate, assuming the above population of approximate rationals, the small error (sixth or seventh place) occurs once a millisecond on average while the large error (fifth place) occurs once every 35 milliseconds. When the division operands are bounded by 100 instead of 1000, the 1/2,000 figure for the small error drops to 1/200 (a factor of 10 worse) while the 1/70,000 rate for the large error drops to 1/2,000 (a factor of 35 worse), i.e. the large errors now happen once a millisecond. Detailed tables of such errors, along with a program for verifying my conclusions for yourself, can be found in my two earlier messages on this subject on comp.arch,comp.sys.intel. Search for the phrase "ibruis" (a variable in my program) or the subject "A natural scenario with high FDIV bug probability (was: In Intel's Defense...)". -- Vaughan Pratt http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug4---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.sys.ibm.pc.hardware.chips,comp.sys.ibm.pc.hardware.systems,comp.sys.ibm.pc.hardware.misc,comp.sys.intel,sci.engr,sci.engr.civil Subject: Re: Intel Pentium Bug: Scary Business? Summary: Expires: References: <3bfclk$q28@rc1.vub.ac.be> <3boqfn$crj@Radon.Stanford.EDU> <1994Dec04.044853.177619@zeus.aix.calpoly.edu> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article , H. Peter Anvin wrote: >Followup to: <1994Dec04.044853.177619@zeus.aix.calpoly.edu> >By author: jcenter@tuba.aix.calpoly.edu (The Free Radical) >In newsgroup: comp.sys.intel > >> Or is the case one of necessary precision. I only use about 3 or 4 sig >> figs in nearly all of my calculations. If this is the case, can anyone >> (in any branch of engineering) describe what they are doing and why they >> need precision to 9 or so figures. > >You *use* 3-4 significant figures. I would suggest taking a course in >numerical methods which cover issues like stability, roundoff-error >propagation etc, and you'll see why the number of significant figures >matter. The point was made earlier that lots of good science was done with slide rules in the precalculator days. If you only ever use the first four digits of your *rational* calculations (those involving only the four basic functions, +-*/) then the P90 will not let you down, in the sense that there are no known 4th-place errors, and it is very unlikely there are any. There are *two* premises here, (i) the precision of the result and (ii) how it is obtained. Change *either* and you leave the safety of the slide rule world and become vulnerable to the Pentium's FDIV bug. (i) If you ever pay attention to the fifth digit of your answer then you are at risk. How much at risk? Intel says you will see their bug once every 27,000 years. This is preposterous. In recent posts on comp.sys.intel,comp.arch I showed that divisions involving numbers of the form a small integer (<1000) minus 10^-6 (or any other fixed quantity of that general size) were at extraordinary risk. For a program doing only divisions of randomly chosen such numbers, 5TH PLACE ERRORS OCCUR EVERY 50 MILLISECONDS at least. If "small integer" means <100 then that error rate for this sort of division jumps dramatically by more than an order of magnitude. (So if all Pentiums did this sort of work that would total more than a billion 5th place errors a second!) Are reals that are very close to integers any more likely than other reals? Absolutely: (a) integers (reals with fractional part .000000) occur as the result of a computation far more often than reals with any other given fixed fractional part such as .708342, and (b) floating point results of computations represented in floating point typically include small errors. These two factors conspire to make bruised integers, those a tad less than an integer, far more likely than most other reals. (ii) If instead of individual rational calculations, i.e. plus-minus-time-divide, you get your 4-digit numbers from more complex calculations, you again become vulnerable. This time it's because those functions may be using FDIV during those complicated calculations. Even though you only consume 4 digits, the 3rd and 4th places of your answer (and for sufficiently long or fragile types of computation the second or even first place) can be easily affected by 5th place errors in intermediate results. With yet longer or more fragile computations the first four places start to depend on the 6th and 7th places. With the sorts of numbers considered above, a 6th or 7th place FDIV error (and hence a 4th or even 1st place error later in the computation) is ten times more likely than a 5th place error. For other values of both the error place and the amount by which these small integers are decreased. See the tables in my earlier posts for the dependency of the error rate on these two parameters. I should emphasize that these statistics are for quantities a hair below a small integer. The probability of encountering the FDIV bug on the Pentium when dividing two randomly chosen floating point numbers is considerably lower (but nowhere near as low as Intel tells us, see below). However floating point numbers arising in many kinds of real-world computation are not randomly distributed, and some scenarios can be expected to greatly increase the probability of dividing by these Pentium-bad numbers. Do not let anyone tell you such scenarios hardly ever happen. Ask them how they know this. Intel's calculation of 27,000 years per error assumes that in all computations, the numbers being fed to FDIV are uniformly distributed, meaning that all floating point numbers are equally likely. There are few if any real-world situations where such complete randomness is in fact the case. On the other hand, as a method of estimating the rate of FDIV bugs, this assumption should nevertheless yield a good estimate of the expected rate in many situations, with two caveats. NOT 27,000 years. My tables show that 7th place errors occur at a rate of 40 per million for small integers bruised by 10^-4. Hence for uniformly distributed reals the rate cannot be less than 10^-8 times this, since 10^-4*10^-4 out of such pairs of reals will be within 10^-4 below an integer. This comes to approximately one such error per month, or about an error a second for Pentiums collectively. 27,000 years is irresponsibly wishful thinking. And NOT in all situations. A crucial question that Intel has ignored to date is, for how many situations does this assumption of a uniform distribution grossly underestimate the error rate? I cannot answer this, BUT NEITHER CAN INTEL. Intel's public statements to date depend not only dubious estimates for the uniform case but also on the proportion of nonuniform cases being negligibly small. Intel has given no reason for the latter. The greater likelihood of bruised integers over arbitrary reals strongly suggests that the error rate in such situations is several orders of magnitude greater than for uniform distributions. Until the likely or even marginally likely occurrence of FDIV-hazardous situations has been ruled out, Intel's public reassurances that the Pentium is safe in practice are grossly irresponsible. The responsible thing for Intel to do is to state the full range of BOTH RATES AND MAGNITUDES of FDIV errors that may occur in code performing some useful function. Ideally Intel would also give criteria for estimating approximately where along that range various applications typically land, but there is unfortunately little hard data today on which to base such an estimate. Hence the only data Intel can responsibly use at this time in any public assessment of the Pentium FDIV bug is the FULL RANGE of both the rates and magnitudes of errors possible in useful code. Lacking further data, any less than that constitutes gross wishful thinking and as such an abdication of public responsibility. ================================================================= -------------------------PENTIUM REPORT # bug5---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 20369 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: How to find FDIV bug example Date: 4 Dec 1994 20:14:31 GMT Organization: Computer Science Department, Stanford University. Lines: 38 Message-ID: <3bt7v7$5t5@Radon.Stanford.EDU> References: NNTP-Posting-Host: sunburn.stanford.edu In article , Sivasankar Chander wrote: >Remember, the more the "nice" numbers you have, the better the chances >of getting a good jury award in a class-action suit. For errors in the 5th place, here is a collection of "nice numbers" I haven't posted before. [oops, just noticed this list buried in an earlier post, oh well. -vp] 18/27 20/30 33/144 36/108 40/120 44/192 72/432 72/864 80/480 80/960 Actually the Pentium computes all these divisions correctly, but if you subtract .00001 from all of the above integers then the divisions when performed on a Pentium all have relative errors of about .000012. For example the Pentium evaluates 17.99999/26.99999 as 0.6666575 when it should have obtained 0.66666654, If 18/27 is there, why isn't 18/54 on the list? Because 17.99999/53.99999 is computed correctly on the Pentium, you have to go to 17.999999/53.99998 to see an error. Pretty nice, but not as nice as when both operands end in a simple string of 9's, the case for all ten examples above. There are many more such "nice numbers" with varying degrees of error, all worse than the error resulting from using single precision. See my previous posts especially my table of error rates as a function of magnitude of error and how much is knocked off the numerator and denominator. -- Vaughan Pratt http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug6---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.sys.intel Subject: Whiplash potential of the Pentium (Was Re: FDIV probability issue) Summary: Expires: References: <1994Dec4.175212@west.cscwc.pima.edu> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article , Dik T. Winter wrote: >In article <1994Dec4.175212@west.cscwc.pima.edu> 103t_english@west.cscwc.pima.edu writes: > > In article , siva@bond.bocaraton.ibm.com (Sivasankar Chander) writes: > > > Thanks, and I don't want to spoil your euphoria, but Tim Coe found all > > > > You've missed a significant point here: the bugs are easy to find if you go > > >You are missing something. Tim Coe did not come up with a "bug-generator- >algorithm"; he came up with a Pentium FDIV emulator that predicted many >bugs. I hope no one will think I'm trying to pour crude oil on this storm in a refined teacup if I describe the path from Tim Coe's analysis to my particular "bruised integer" bug-generator, which hopefully will clarify the respective contributions. I'll then extend that path into a new (I think) topic, how the FDIV bug impacts real-time controllers---bottom line: worry (if only theoretically) about whiplash. All this of course is moot for the lucky few receiving clean machines now and for all software incorporating the Moler et al workaround, but I rather expect this still leaves a substantial population of users desperately seeking a better grasp of exactly how concerned they should be. After getting lost wandering around comp.sys.intel when I was first pointed at this issue last week, about which I previously knew nothing, I eventually stumbled on Tim's illuminating table of where 90% of the problem divisors fell, namely 3.0 > divisor >= 3.0 - 36*(2^-22) 9.0 > divisor >= 9.0 - 36*(2^-20) 15.0 > divisor >= 15.0 - 36*(2^-20) 21.0 > divisor >= 21.0 - 36*(2^-19) 27.0 > divisor >= 27.0 - 36*(2^-19) This was a breath of fresh air. After staring at this table for a while it became clear that if the Pentium were at any risk from the bug, it would have to be where the divisor was a hair below an integer (at least for divisors > 13.5, the largest noninteger matching the above description). Now integers tend to come up a lot in comparison to other reals in many applications, even if represented in floating point. This got me to wondering what the 27,000 year figure turned into when the division operands were integers. Well, no bug there for integers below about 250K since the table says the divisor has to be *strictly below* these magic numbers in an interval too narrow to hit any integers. And for integers above 250K, although the interval widens to >1, the proportion of those integers that are bad FDIV divisors becomes very small. But if a small integer is slightly bruised by computation, then a nontrivial risk emerges. How big? Well, down in the small integers a sizable proportion of the integers fit Coe's table, i.e. are 3 times a power of 2 times 1, 3, 5, 7, or 9. In fact 35 of them fit, namely the 5 numbers 3,6,9,12,15 plus the 30 numbers {18,21,24,27,30}*{1,2,4,8,16,32}. Call such integers pentads (including the ones beyond 1000 obtained by extending the above powers of two beyond 32---each such power of two gives another 5 pentads that I'll call an *octave*). 35 out of 1000 gives us our first hint that the <1000 region may be a sufficiently rich source of FDIV bugs to cause some serious grief. But merely being a hair under a pentad, while necessary, is not sufficient for an FDIV error. The Coe table tells where all the risk is concentrated, but it gives no clue as to the *magnitude* of the risk! To measure this risk empirically I wrote a program that knocked 10^-6 off all one million pairs of integers from 1 to 1000, did the million divisions on a Pentium, and checked which were buggy and by how much. There were 627 bugs, of which 426 had a relative error worse than 10^-7, and 14 worse than 10^-5. So if you're unlucky enough to do *all* your computing with such bruised integers, and you do lots of divisions (the Pentium can do two million divisions a second) you'll see the big 10^-5 errors 14*2 = 28 times a second or once every 35 microseconds, and the smaller errors, between 10^-5 and 10^-7, 627*2 = 1254 times a second or more than one a millisecond! For more dilute concentrations of these unlucky numbers and slower rates of division, starting from this basic limiting case you can easily work out the corresponding chances yourself, which are in direct proportion to each of those quantities. For example if you are a spreadsheet user who does 1000 FDIV's a day, Intel's official estimate, with all numbers small bruised integers, you will see one small error every 1000/627 = 1.6 days, and one 10^-5 error every 1000/14 = 70 days. If only 1% of your numbers are bruised integers you're now up to 160 days and 7000 days (20 years) respectively. That's IF the software you invoke only does 1000 FDIV's a day, a question you might well ask yourself. After getting these basic numbers it was natural to try knocking other small amounts off, and to tabulate the magnitudes of the resulting errors more systematically. This led to the program I posted on December 3 that generated my two-dimensional table, whose vertical axis P (precision) says how much we knock off the two operands before giving them to the Pentium, and whose horizontal axis T (tolerance) gives the amount the Pentium knocks off in response, and whose entries give the number of relative errors greater than T over the million operations. I also looked at what happens when other numbers besides powers of ten are knocked off: the answer is that there is nothing magic about radix ten and the statistics remain about the same for any amount knocked off having the same general magnitude as any given power of ten. Sometimes the Pentium knocks off more than we do, as when we knock 10^-6 off the divisor and it replies by knocking 10^-5 off the quotient (way beyond what it should have *added* to the quotient). In this case Pentium division is not antimonotone when it should be, that is, decreasing the denominator *decreases* the quotient when it should *increase* it. Consider for example the true arithmetic fact 4.999999/15 < 4.999999/14.999999. This inequality holds because the left-hand denominator is larger than the right-hand denominator. The Pentium however considers this inequality to be false because when 15 is decremented by 10^-6, the quotient instead of going *up* ~10^-7 instead goes *down* by ~10^-5. (The very nice question of whether such a thing could happen at all, which I hadn't thought to consider before, was posed to me by Norman Hirsch.) Such nonmonotonicity of any rational (+-*/) operator provides a delightfully simple example of how a 5th place error can become a leading-bit error with a single Pentium instruction, namely comparison---the above comparison yields truth-value 0 on the Pentium when it should have yielded 1. This is in striking contrast to the dozens or hundreds of instructions one usually imagines being required to eventually propagate a 5th place error up to the leading bit. REAL-TIME CONTROLLERS This nonmonotonicity may sound like a minor issue, but it is at the heart of a far more serious issue that has thus far received no attention to my knowledge, namely the suitability of the Pentium for real-time control applications in the light of the FDIV bug. Fix an unlucky (for FDIV) constant c and consider variables x and y where y depends on x according to the relationship y = c/x. Now let x increase smoothly and watch what happens to y. Whenever x approaches a pentad p from below, suddenly y goes haywire: instead of continuing to approach c/p smoothly from above as predicted by mathematics, y starts jumping around like crazy, making moves dy thousands of times larger than one would infer from the corresponding moves dx made by x. Furthermore these moves are made at a terrific rate, because the values where c/x is wrong in that bad region are somewhat chaotically distributed (more on this distribution in a future message). Hence the derivative dy/dx goes sky high, potentially inducing enormous forces. If c/x is fed *directly* to a 16-bit DAC (CD players in fact use 18-bit DACS to control the analog audio channel going to the speaker voice coil), at worst its lower order bit is affected and these "sky-high derivatives" won't be noticed---they're only sky high because the dx in dy/dx is *extremely* tiny when dy is large. But if c/x is instead fed into some formula that picks off a small segment of the total curve and feeds *that* to the DAC, it magnifies the FDIV noise in inverse proportion to the size of that segment. A designer expecting no noise from FDIV might have done this as a side effect of copying some formula and plugging in some values whose numerical analysis aspects he didn't think too much about, but which shouldn't matter because the arithmetic would work perfectly smoothly for a correct double-precision FDIV and a 16-bit DAC. But with the Pentium, suddenly those sky-high derivatives from the buggy FDIV that were just theoretical before are now magnified to the point where the 4th or higher bit of the DAC may well be affected and the forces become whatever the servomotor is capable of applying. If you're in a Pentium-controlled elevator, fasten your seatbelt. Over 99% (99.96% in the case of the top pentad of each octave) of the bad divisors just below each pentad p are in the interval [kp,p) where k is the number .9999975 (this is how the Coe table looks when translated from bit-oriented computerese into simple math, with k tweaked slightly to reconcile it with my own empirical measurements of where the bad divisors live). So when p is the pentad 960, the interval is [959.9975,960). Hence you only have to keep your seatbelt fastened for a miniscule fraction of the distance between pentads, about 1/40,000. So if the ride takes you through one pentad per hour, e.g. going from pentad 864 to pentad 960, the rough spots only last a tenth of a second, just enough time to snap your head back and forward again once. One FDIV error per 27,000 years doesn't sound so bad. One whiplash per hour is a more daunting prospect. Let me emphasize that this is pure theory, there are no reports to date of any Pentium-controlled machinery actually doing any such thing. And for events lasting only a tenth of a second and occurring only once an hour, most likely at much smaller magnitudes than in the high-magnification scenario hypothesized above, such glitches could occur many times a day in each of thousands of controllers around the world and yet pass unnoticed at the typical noise levels of real-world controllers. This is not to downplay the above risks so much as to emphasize the extreme difficulty of deciding whether they are at all significant. ================================================================= -------------------------PENTIUM REPORT # bug7---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15442 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: Why there is no worst FDIV bug---large vs. likely Date: 8 Dec 1994 21:20:31 GMT Organization: Computer Science Department, Stanford University. Lines: 100 Message-ID: <3c7tav$g0m@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15442 comp.sys.intel:21785 This is a brief note to highlight an important point that is getting lost in the volume of technical traffic (to say nothing of the nontechnical) on the severity of the FDIV bug. There are two fundamental criteria for judging an arithmetic error, magnitude and frequency. (Those who follow my work on the duality of time and information, see my .sig http, will understand the sense of "fundamental" I mean here; physicists should identify magnitude with time and frequency with energy, forming a conjugate pair.) Any time you have two or more criteria for judging something, it becomes possible to have no worst case. Magnitude. The worst FDIV bug with regard to magnitude is Tim Coe's pair 4195835/3145727, for which the Pentium gets 1.333739 instead of the correct 1.33382045. The relative error here is .999 times 2^-14. This is the largest error observed to date, and 2^-14 may well be the maximum possible error for this bug. Frequency. The rate at which errors are encountered, which for this bug is extremely dependent on the application, has to do with the number and distribution of occurrences of the bug in operand space. Thus one would not a priori expect that it made sense to talk about any single bug as relevant to frequency. Nevertheless the single pair 4.999999/14.999999, more memorable as 5/15 with a millionth shaved off each operand, yielding a 2^-16 error, does tell us something. It does not tell us about the total number of bugs---after all, without further information it could be the only bug. Rather, it tells us something about the likelihood of encountering an FDIV bug. Suppose the 5/15 pair and the Coe pair were the only two bugs. While Tim's pair hurts more four times as badly as mine, I think I can safely leave it to the reader to dream up plausible applications where my pair is encountered at least four times as often as Tim's, e.g. data obtained from a data logging device that is using an analog-to-digital converter and measuring numbers that for some reason concentrate around integers, or obtaining data from a decimal calculator that retains only six digits after the point. An abstract way of putting this is to say that 4.999999/14.999999 has low *Kolmogorow complexity*. The Kolmogorow complexity of any finite bit pattern is the size in bits of the smallest Turing machine that, started on a blank tape, writes down that pattern. Even though 4.999999/14.999999 is as long as 4195835/3145727, it should have lower Kolmogorow complexity on a non-Pentium. (But moving the Coe pair to a Pentium decreases its Kolmogorow complexity in principle if not in practice because the Pentium can describe it as that pair of odd integers x,y maximizing x-(x/y)*y; Kolmogorow complexity ignores running time. In practice today's architectures trade things off to represent pairs of numbers rather more compactly than the above program, but this need not hold for all architectures.) This is the sense in which there is no worst pair. Instead there are two worst pairs, one unambiguously demonstrating how large the error can get, the other somewhat smaller in magnitude but more likely to be encountered in practice, depending heavily on the application. This "two" is "up to isomorphism" as they say in algebra. The Coe pair has many siblings that do the same job: just scale either operand by a pwer of two. Likewise my pair has many siblings, which however are explicitly *not* obtainable by scaling as I pointed out earlier, rather they are those bugs of similar structural simplicity enumerated in my table posted earlier. 5/15 is simply the most appealing (to me) of the 800 or so small (operands < 1000) fractions that are problematic in this sense, making them at least cousins to 5/15. Of these, 26 are siblings in that they have relative errors of at least 10^-5. Had all the large bugs been exceedingly unlikely, and had all the likely bugs all been of very low magnitude, I would not dispute so strenuously the emerging (this week) industry consensus that the FDIV bug is not serious other than politically for the computer supply side. This is however not the case: the 5/15 bug shows that customers can experience fairly large errors, a quarter of the largest possible, in fairly likely numbers. It is important to bring this consideration to the attention of industry leaders, who in this week's news have been downplaying the significance of the bug. Until it is brought to their attention, one cannot accuse them of *deliberately* putting their own concerns ahead of their customers' by burying their collective heads in the sand in this way. The strongest accusation possible is that they are doing this but unintentionally and without malice. There is a chain of responsibility here. A whole industry does not listen to one person; rather the business community looks to the technical community as a whole for advice on the matter. I therefore call on the academic community to reach a meeting of the minds on the severity of the problem so that it can present a united front to industry on this matter. One way *you* can help here is to forward this message, not (necessarily) to management and the media, but to those of your technical colleagues who are capable of assessing the technical merits of the above arguments but who lack the time required to read Usenet, particularly those newsgroups carrying the highly contagious FDIV bug, which having infected the Pentium is now infecting many news groups. -- Vaughan Pratt http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug7.5---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 21229 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Intel Bug Fix Delay for Good Reason Date: 7 Dec 1994 05:59:09 GMT Organization: Computer Science Department, Stanford University. Lines: 34 Message-ID: <3c3ivd$q71@Radon.Stanford.EDU> References: <3atfbj$rd5@jabba.cybernetics.net> <3bcm6oINN37ml@rs1.rrz.Uni-Koeln.DE> NNTP-Posting-Host: sunburn.stanford.edu In article , Ron Christian x5545 wrote: >When I first heard about the bug, I thought it would be more a spin issue >than a functional one. But looking at the most common bug test posted here >has convinced me otherwise. Lots of trivial algorithms have a point where >the value of A moves through the value of B in equations such as "C-(C/A)*B", >and the Pentium introduces a significant disparity at the point where A equals >B. (256 does not equal 0.) You don't have to be doing intricate scientific >research to notice such a thing. The first thing that ocurred to me was that >this bug could throw my wife's freight claims off by a significant amount. Sorry but this logic won't wash. First, without knowing more about C and B, the odds of your wife hitting an FDIV error in this scenario are not worth worrying about. The figure of 27,000 years is not applicable because those people limited to calculating only 1000 FDIV's per day put themselves at a significant disadvantage to those of us earning more than 2 cents an hour. An error a month or a day might be more like it. but even so, with relative errors of size 10^-5 your wife will never know what hit her (in the good sense :). Second, in talking about A as the one variable in the above (in the sense that you seem to be holding B and C constant), you have a hyperbola that passes through the origin. Now your argument that an FDIV error in C/A will cause an enormous *relative* error at A=B is irrelevant without further information because we have no basis for attaching significance to this relative error, which is calculated by dividing the noise (error) found at points along your hyperbola by the distance of those points from the origin. In order to justify this reasoning you first have to attach some significance to distance from the origin. An example of what I mean is 0 centigrade; that water freezes there is irrelevant when there's no water nearby. -- Vaughan Pratt http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug8---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15446 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: Re: Why there is no worst FDIV bug---large vs. likely Date: 9 Dec 1994 02:28:07 GMT Organization: Computer Science Department, Stanford University. Lines: 29 Message-ID: <3c8fbn$mi0@Radon.Stanford.EDU> References: <3c7tav$g0m@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15446 comp.sys.intel:21849 ==================================================================== FOR EASE OF REFERENCE I HAVE MADE MY FDIV POSTS AVAILABLE BY FTP AT boole.stanford.edu:/pub/PENTIUM ==================================================================== In article <3c7tav$g0m@Radon.Stanford.EDU>, Vaughan R. Pratt wrote: >4.999999/14.999999 is as long as 4195835/3145727, it should have lower >Kolmogorow complexity on a non-Pentium. Oops, off by one: at 15 digits the first pair is a digit longer than the second. :) This leads into a natural question: what's the shortest integer manifestation of the bug, or is Coe's pair it? The answer is that there are shorter pairs. I went looking for 13-digit pairs just now and came up with several, including 41811115/49151, which is 850.6666 but on the Pentium is 850.6640, an error of 3 parts per million. I would be extremely interested in seeing a 12-digit pair. This apparent difficulty of finding integer pairs with fewer than 13 digits is in contrast to the number of 3-digit integer pairs (5/15, 9/54) for the variant of division in which one first subtracts a millionth from each operand. Two such pairs might not seem like a lot, but it is a high proportion of the fewer than 2000 candidates. -- Vaughan Pratt http://boole.stanford.edu/boole.html "Eureka" -- Greek for "Oops, forgot my bathrobe again" ================================================================= -------------------------PENTIUM REPORT # bug9---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.arch,comp.sys.intel Subject: Re: Why there is no worst FDIV bug---large vs. likely Summary: Expires: References: <3c7tav$g0m@Radon.Stanford.EDU> <3c8fbn$mi0@Radon.Stanford.EDU> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: ==================================================================== For ease of reference i have made my FDIV posts available by ftp as boole.stanford.edu:/pub/FDIV/bug* ==================================================================== In article <3c8fbn$mi0@Radon.Stanford.EDU>, Vaughan R. Pratt wrote: >This apparent difficulty of finding integer pairs with fewer than 13 >digits is in contrast to the number of 3-digit integer pairs (5/15, >9/54) for the variant of division in which one first subtracts a >millionth from each operand. Two such pairs might not seem like a lot, >but it is a high proportion of the fewer than 2000 candidates. I'm starting to sound like Intel :), that should have been 6 out of 2000: relative error 5/15 1.2207e-05 5/30 1.2207e-05 5/60 1.2207e-05 7/48 8.71931e-06 7/96 8.71931e-06 9/54 1.35634e-05 This is a complete list. All buggy three digit pairs have large relative errors as the list indicates, there are none with relative errors in the range 10^-6 to 10^-18, and none are of the form dd/d. I'm now looking at the error rate when you perturb integers by uniformly distributed small amounts, as opposed to by exactly .000001. This corresponds to a situation where the values are random integers being sampled in a randomly noisy environment. A preliminary look indicates that varying the perturbation randomly from one division to the next is not much different from the fixed .000001 truncation. I will shortly post a 3D table showing error rate as a function of maximum size integer, maximum perturbation of those integers, and tolerance (minimum resulting error needed in order to register a bug occurrence), along with the program that generated the table. ================================================================= -------------------------PENTIUM REPORT # bug10---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.arch Subject: Re: Try this one on the Pentium Summary: Expires: References: Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article , reginald law wrote: >1048958.75/2.9999991 > >On the 486 you get 349653.021xxx > >On a FDIV Pentium you get 349631.688xx > >Looks like a huge error to me. Yes, no one has seen a bigger error than this to date. On the other hand your particular example has been known for some time now. If you multiply your numerator by 2^2 and denominator by 2^20 you get essentially the Coe pair, 4195835/3145727, whose relative error (the only error measure that counts here) is .9996*2^-14, aka 6e-5. The thing about these FDIV bugs is, once you've found that x/y is a bug, you know right away that 2x/y and x/2y are bugs as well, and more generally when you multiply or divide either operand of a buggy pair by any power of 2 you get another bug. Note that this rule does not apply to my bruised integers (integers where you subtract exactly .000001) because the bruise does not scale with the integer. The Coe pair showed up in yesterday's San Jose Mercury News on page 2G, just below W.J. Sander's statement that "To me, it's a disclosure issue, not a technical issue." Jerry Sanders is of course CEO of a determined Intel competitor, AMD, but his bias notwithstanding he's hit the nail on the head (as have 4195835 contributors to comp.sys.intel). Actually I would say it was even more an educational issue than a disclosure issue---it is apparently not at all obvious to a lot of people that the error rate can be high, there is a continuing head-in-the-sand belief that the rate is too low to seriously affect anyone. This could not possibly be further from the truth!!! Both the industry and Pentium customers need to be educated as to the magnitude and rate of errors, and the dependence of both these quantities on the their particular application. The tables I have been publishing, including a new one I will post shortly, are intended to better calibrate everyone involved in this mess on these issues. ================================================================= -------------------------PENTIUM REPORT # bug11---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== X-ORIGINAL-NEWSGROUPS: comp.arch,sci.geo.meteorology Newsgroups: sci.geo.meteorology Subject: Re: Pentium bug; Real Applications Summary: Expires: References: <863@moene.indiv.nluug.nl> <3c4uho$1bc@runner.ccr-p.ida.org> <3c6r1d$c2f@news-4.nss.udel.edu> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article <3c6r1d$c2f@news-4.nss.udel.edu>, John D. McCalpin wrote: >In article <3c4uho$1bc@runner.ccr-p.ida.org>, >David desJardins wrote: >> >>I'd certainly offer high odds that you could randomly change any one >>cell value in the middle of the run by an amount much bigger than that, >>and have no measurable effect on the final result. > >Whether you would win or lose depends on how long the model was run. >You would be fairly safe for a 5-day forecast, but not for a seasonal >forecast. As the bet is stated, I think he'd win. Any one random change is *highly* likely to have disappeared at the end of the run. However the bet has little bearing on the Pentium issue since its wording is a bad match to the issue. The problem with the Pentium is not with single random changes, the errors can come at *far* higher rates than people are allowing. If desJardins wants to bet on the situation where random changes are made to the cells at a rate during the computation that reflects the Pentium error rate, he needs to add some conditions to his bet that rule out the scenarios where the rates approach a 1e-5 error every 10 milliseconds, namely running into a lot of slightly noisy integers. ================================================================= -------------------------PENTIUM REPORT # bug12---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.arch,comp.sys.ibm.pc.hardware.chips,comp.sys.ibm.pc.hardware.systems,comp.sys.ibm.pc.hardware.misc,comp.sys.intel,sci.engr,sci.engr.civil Subject: TECHNICAL: Table of FDIV error rates, case of noisy integers Summary: Expires: References: <3c7tav$g0m@Radon.Stanford.EDU> <3c8fbn$mi0@Radon.Stanford.EDU> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: ==================================================================== For ease of reference i have made my FDIV posts available by ftp at boole.stanford.edu:/pub/FDIV/bug* ==================================================================== The following table gives empirically determined estimates of the expected number of errors per million FDIV operations over a wide range of operating conditions, for the case of *noisy integers*. Executive summary: under moderately restrictive but plausible conditions one may encounter relative errors of magnitude 10^-5 at rates of one per 100,000 FDIV's, and errors of magnitude 10^-6 to 10^-7 at ten times that rate and for less restrictive conditions to boot. On Dec. 3 I posted similar tables for "bruised" integers, defined as integers less some fixed amount, e.g. 0.001 or 0.000001, tabulating the actual expected rate however rather than an empirically determined estimate thereof. Whereas a "bruise" in my sense is of fixed size, noise is of constantly varying size, and a noisy integer therefore differs from a bruised integer in that the amount by which the integer is perturbed varies at random from one integer to the next. Hence generating noisy integers requires a random number generator, unlike bruised integers. The following three parameters are varied. 1. Tolerance. This measures the tolerated relative error magnitude: an FDIV error is registered when its relative error exceeds the tolerance. The tabulated tolerances range from 10^-5 to 10^-11 inclusive, stepping by 10^-2 (so four tolerances, 5,7,9,11) 2. Noise. This measures, as a negative power of ten, the maximum amount by which the integer can be increased or decreased. Noise 3 for example means that each integer is incremented by a random (uniformly distributed) number in the interval [-0.001,0.001]. Noise 0 corresponds to completely noninteger data, since the perturbation can be anything in the intervals to the next integer on either side. Noise 1 uses only 1/5 of each interval between two consecutive integers (1/10 in from each endpoint), noise 2 only 1/50, and so on. The tabulated noise levels range from 10^0 = 1 to 10^-13. 3. Maxint. This gives the maximum permissible integer for FDIV operands. Integers are chosen with equal probability from the interval [1,maxint]. The tabulated values of maxint are the positive powers of ten up to 100,000 (five powers). Here now are the number of FDIV errors encountered in a million random samples, for each of the above combinations of (tol,noise,maxint). ================ Tolerance 1e-05 ------- Noise: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ------------------------------------------------------------------------------- max 10: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 int 100: 0 0 0 0 8 79 43 0 0 0 0 0 0 0 1000: 0 0 0 0 2 2 3 0 0 0 0 0 0 0 10000: 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100000: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ================ Tolerance 1e-07 ------- Noise: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ------------------------------------------------------------------------------- max 10: 0 0 0 0 2 61 517 1169 1181 0 0 0 0 0 int 100: 0 0 0 0 24 450 1391 1440 393 3 0 0 0 0 1000: 0 0 0 1 22 63 104 41 8 1 0 0 0 0 10000: 0 0 0 0 1 3 3 2 0 0 0 0 0 0 100000: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ================ Tolerance 1e-09 ------- Noise: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ------------------------------------------------------------------------------- max 10: 0 0 0 0 3 77 571 1379 1535 1520 1423 920 0 0 int 100: 0 0 0 1 28 450 1517 2131 2156 2024 1309 268 0 0 1000: 0 0 0 0 24 118 201 275 223 111 57 1 0 0 10000: 0 0 0 0 2 6 21 14 8 1 0 0 0 0 100000: 0 0 0 0 0 1 0 0 0 0 0 0 0 0 ================ Tolerance 1e-11 ------- Noise: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ------------------------------------------------------------------------------- max 10: 0 0 0 0 4 71 587 1427 1748 1694 1614 1571 1556 1368 int 100: 0 0 0 1 36 506 1584 2167 2297 2277 2272 2162 2140 1333 1000: 0 0 0 2 21 100 182 280 275 274 223 224 137 24 10000: 0 0 0 1 8 8 18 34 19 29 22 8 5 0 100000: 0 0 0 0 0 1 2 0 0 0 0 0 0 0 ========== REMARKS. 1. Noise. The noise must get down to .001 (noise 3) in order to notice anything at all in a million FDIV's. At this noise level, one in a thousand perturbations will be within .000001 of the integer, and these are the most likely causes of error reports. For large errors (tolerance 10^-5) the level of noise generating the most errors is around 10^-5 (noise 5). Very small noises do not trigger very large errors, but they trigger lots of small errors. Over a wide range of noise levels and tolerances, the Pentium is capable of returning errors one or two orders of magnitude greater than the injected noise. 2. Maxint. The highest rates happen for the smallest integers. As the typical integer size increases, the character of this increasing quantity gradually shades over from integer to real, so that at typical size 100,000 errors occur negligibly often. Being a large integer is not much different from having noise 0: in either case any bit pattern can happen. CONCLUSIONS. If for any reason your data contains a significant proportion of slightly perturbed integers, you are at extraordinary risk of a 10^-5 relative error by all reasonable computational standards. An extreme case in this respect is for tolerance 10^-5, noise 10^-5, maxint 100. The table entry of 79 here means that if you first pick two random integers in the range 1 to 100, then for each one add to it a random amount chosen from the range [-0.00001,0.00001], and finally divide the first perturbed integer by the second, you have a chance of just under one in 10,000 of getting an error larger than one in 100,000. (As it happens, the error if it occurs will almost certainly be almost exactly one in 65,000, with the occasonial occurrence of one in 16,000 errors.) For three-digit operands at this tolerance and noise level the error rate drops off sharply, but is still dangerously high. At four digits things calm down considerably; an isolated 10^-5 error was seen at noise 10^-3, but this seems to have been just an unlucky chance. Relatively few applications should be put seriously at risk by these much lower rates, with the caveat of course that any algorithm might just happen to be one of those unlucky few. At tolerance 10^-7 (catering for all errors in the range from 10^-5 to 10^-7) the error rate rises sharply across the board, reaching a rate that does not increase significantly thereafter as the tolerance further decreases to 10^-11. Here for any size operand up to 1000, the error rate is extremely high for a range of noise levels over four orders of magnitude, from 10^-4 to 10^-8. While slide rule users will not be affected by these 10^-6 and 10^-7 errors, the tendency of today's complex algorithms to magnify errors as they propagate can easily blow up these ostensibly minor errors into large errors. BOTTOM LINE. The Pentium may well give many or most users little trouble. However in our demonstration above that certain reasonably general types of application are seriously at risk, the complexity of the governing conditions reveals the great difficulty of deciding whether any given application has hidden deep inside it circumstances capable of evoking the full strength of the bug. The decision whether or not to use the Pentium in any application that touches the floating point unit can then become as complex as the problem solved by the application itself. ========== Here is the program that generated this table (less some of the headings): #include #include #include main(argc, argv) char *argv[]; { int b, i, cnt, maxint, samplesize=1e6; double num, den, tol, noise, randnoisyint(); for (tol = 1e-5; tol >= 1e-11; tol *= .01) { printf("================\nTolerance %g\n-------\n", tol); for (maxint = 10; maxint <= 1e5; maxint *= 10) { printf("%7d: ", maxint), fflush(stdout); for (noise = .999, b = 0; b < 14; noise *= .1, b++) { for (i = 0, cnt = 0; i < samplesize; i++) { num = randnoisyint(maxint, noise); den = randnoisyint(maxint, noise); cnt += (((num-(num/den)*den)/num)>tol); } printf("%5d", cnt), fflush(stdout); } printf("\n"); } } } /* Return random integer in [1,maxint], perturbed by [-1,1]*noise */ double randnoisyint(maxint, noise) int maxint; double noise; { return 1.0 + ((rand()&0x7fffffff)%maxint) + (2*drand48()-1)*noise; } ================================================================= -------------------------PENTIUM REPORT # bug13---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15511 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: Re: Why there is no worst FDIV bug---large vs. likely Date: 10 Dec 1994 07:34:21 GMT Organization: Computer Science Department, Stanford University. Lines: 75 Message-ID: <3cbllt$nqi@Radon.Stanford.EDU> References: <3c7tav$g0m@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15511 comp.sys.intel:22160 In article , Sivasankar Chander wrote: >Vaughan R. Pratt (pratt@Sunburn.Stanford.EDU) wrote: >: ... Even though >: 4.999999/14.999999 is as long as 4195835/3145727, it should have lower >: Kolmogorow complexity on a non-Pentium. > > Actually, Tim Coe found many pairs of the form xx.999999/yy.9999999 also, >but apparently not your 4.99../14.99.. case. However, the *Kolmogorow >complexity* of most of Tim's divisors are much lower than yours. Just look >at the hexadecimal representation of the mantissa. Quite right, I should have gone into this. A common integer divisor for these bugs is one less than 3 times an odd number times a power of 2. Taking the odd number to be 1 presumably minimizes the Kolmogorow complexity *of the divisor*. However I was counting the numerator as part of the pair, and about half of it looks pretty random to me, driving up the total Kolmogorow complexity. Our respective subjective estimates of the Kolmogorow complexity of these examples suffer from being just that: at these very small sizes Kolmogorow complexity is a pretty architecture-dependent measure, one has to go out further along the asymptote before it becomes a meaningful basis for comparisons. > Given that the error rate for uniformly distributed random operands is >of the order of 1e-9, my *expectation* of the smallest decimally represented >pair that fails is of the order of 9 digits (for both x and y). So I sense >that at least one *nice* pair with as little as 5 digits in the dividend >and 4 in the divisor (or (6,3) or (4,5) etc.) is eluding us. Note >that it could be of the form .xxxxxx / y.yy etc. There are getting to be various little languages here, in each of which one can ask for small and/or simple examples. The basic language is just decimal digits and /, in which the Coe pair 4195835/3145727 is stated. You are willing to allow the decimal point as well in your language. My bruised-integer examples assumes a language in which 7/28 denotes 6.999999/27.999999. All these languages except the basic one add questionable artifacts that undermine to some extent the significance of their succinctness. This takes much of the fun out of playing the shortest-example game for all but the basic language. Apropos of which, I recently pointed out that the Coe pair 4195835/3145727 had 14 digits, improved it to 13 digits with 41811115/49151, and asked for a buggy pair of integers with a total of 12 digits. James Shearer pointed out to me that Andreas Kaiser had already some time ago posted 1/3221224323, having only 11 digits. I hadn't thought to look in that region, but when I did I found it contained quite a few 10-digit examples, including 7/402651871 7/805303742 57/50331637 383/3145725 383/6291450 861/4718589 861/9437178 A half hour search turned up no 9-digit examples, With the greater expressiveness of your language, which allows a decimal at an arbitrary location in each operand, your conjecture that there should be a 9-digit example becomes reasonably plausible. For what it's worth, I ran an exhaustive search for two four-digit operands each with a decimal point anywhere and turned up nothing. Is 9 digits with two decimal points better than 10 digits with no point? Yet another of life's imponderable little mysteries. For this pointless exercise I think I prefer the pointless language -- Vaughan Pratt http://boole.stanford.edu/boole.html "Eureka" -- Greek for "Oops, forgot my bathrobe again" ================================================================= -------------------------PENTIUM REPORT # bug14---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15522 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: Re: Why there is no worst FDIV bug---large vs. likely Date: 10 Dec 1994 20:48:18 GMT Organization: Computer Science Department, Stanford University. Lines: 28 Message-ID: <3cd46i$ara@Radon.Stanford.EDU> References: <3c7tav$g0m@Radon.Stanford.EDU> <3c9tii$1tr@duck.ftn.us-ny.citicorp.com> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15522 comp.sys.intel:22251 In article <3c9tii$1tr@duck.ftn.us-ny.citicorp.com>, Renato Ghica wrote: >duh..I'm no math genius, but isn't an error where the result is > 256 instead of 0 a *MUCH* larger magnitude than 2^-14 ? ;-) Only as an absolute error. In situations like this absolute errors (simple differences) are a less meaningful measure of accuracy than relative errors, where one divides the absolute error by the correct amount. Suppose for the sake of argument you weigh a flea and get 10 gms instead of 1 gm, and I weigh the puppy the flea is on and get 1100 gms instead of 1000 gms. Which of us would you say has taken more care to make an accurate measurement? Our respective absolute errors are 9 gms and 900 gms, suggesting you were more careful. Our respective relative errors however are 9 and 0.1. Examples like this demonstrate that relative error agrees with our intuitions about the notion of accuracy in measurement better than does absolute error. I would not have thought to even mention any of this were it not for the remarkable number of posts based on an apparent belief in the superiority of absolute error over relative even in these applications. -- Vaughan Pratt http://boole.stanford.edu/boole.html "Eureka" -- Greek for "Oops, forgot my bathrobe again" ================================================================= -------------------------PENTIUM REPORT # bug15---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15535 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel,sci.math.num-analysis Subject: TECHNICAL (Pentium): Distribution of problem divisors Date: 11 Dec 1994 04:16:43 GMT Organization: Computer Science Department, Stanford University. Lines: 116 Distribution: inet Message-ID: <3cdufb$r9u@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15535 comp.sys.intel:22326 sci.math.num-analysis:10047 In order to get a better grasp of the FDIV error rates, I sampled the distribution of the problem divisors, assuming a random numerator uniformly distributed in [0,1]. This assumption probably masks some patterns that might otherwise leap out and offer an easier way of estimating those probabilities. Even with this masking however, certain patterns were still visible. The pentads as I've been calling them are the five integer multiples of 3 between 18 and 30 inclusive, namely 18,21,24,27,30, together with all reals obtainable from these by repeated doubling or halving. (What I was previously calling a pentad I'll now call an integer pentad, ruling out the pentad 4.5 = 18/4 for example.) The pentads are sufficiently uniformly distributed along the positive real line that every half-open interval [x,2x) contains exactly five pentads. Tim Coe's table of problem FDIV divisors indicates that each lies in some interval [kp,p] where p is a pentad and k<1 is very close to 1. Certainly k>.99978, and for most cases k > .999995. For the pentad p = 7680 = 30*2^8 (as a convenient representative of all pentads that are multiples of 30) I have looked at 10 million FDIV errors arising from randomly chosen divisions and used how they tended to cluster to divide up the interval [7679,7680] into 16 regions. The following table shows, as a function of the divisor d in the erroneous division, what proportion of those ten million errors lies in each range of possible values of p-d. For example if 12345/7679.997 is in error then p-d = 7678-7679.997 = .003, putting this error in the first interval .0000-.0069. For 7680-.0078 < d < 7680-.0069, errors go in the second interval, so on. Note that the proportion is for the whole interval; dividing it by the width of the interval yields the average error *density* in the interval (left to the reader). The numerators used to generate this table were generated at random, uniformly from the interval [0,1]. I have not attempted to correlate the dependence of this distribution on the choice of numerator. p-d proportion of errors in that interval --------- .0000-.0069 .8989954 .0069-.0078 .0000454 .0078-.0084 .0026463 .0084-.0097 .0000260 .0097-.0102 .0001228 .0102-.0156 .0000846 .0156-.0189 .0949042 .0189-.0195 .0000010 .0195-.0201 .0027801 .0201-.0214 .0000120 .0214-.0218 .0001364 .0218-.0312 .0000082 .0312-.0321 .0001803 .0321-.0331 .0000002 .0331-.0335 .0000180 .0335-.1000 .0000391 Sorting this table by error density yields the following ranking of intervals by danger level. .0000-.0069 .8989954 .0156-.0189 .0949042 .0195-.0201 .0027801 .0078-.0084 .0026463 .0312-.0321 .0001803 .0214-.0218 .0001364 .0097-.0102 .0001228 .0102-.0156 .0000846 .0069-.0078 .0000454 .0335-.1000 .0000391 .0084-.0097 .0000260 .0331-.0335 .0000180 .0201-.0214 .0000120 .0218-.0312 .0000082 .0189-.0195 .0000010 .0321-.0331 .0000002 So nearly 90% of the errors occur in the interval 7679.9937 to 7680, while another 9.5% occur in the interval 7679.9811 to 7679.9844, 0.28% in 7679.9799 to 7679.9805, and 0.26% in 7679.9916 to 7679.9922. This accounts for all but 0.067% sprinkled over the other intervals. Within the first interval the distribution is not flat but seems to be piecewise continuous (meaning an occasional discontinuity, i.e. sharp jumps, typically downward) to within the resolution I was using. At p-d=0 it starts out nonzero and rises steadily with small wobbles to reach twice the starting rate at p-d = .002, then drops abruptly to one third of that rate for some time thereafter. It then rolls off somewhat unsteadily down to 0 very briefly, goes up a bit to wobble around some more, then abruptly collapses to almost zero at .0069 where it stays until .0078. All this would be more visualizable if plotted. The binary representation of the boundaries of these intervals is significant. The typical though not universal pattern as p-d increases is that the likelihood of an error increases in general, with sudden downward jumps whenever there is a carry, with bigger jumps for bigger carries, particularly when the carry yields a power of two. I have not had time to repeat this in full for the other four basic pentads, which at the corresponding scaling are those shown in the left column of the following table. However I have looked briefly at them, again with a random numerator in [0,1], and found the following *number* of errors encountered during a 15-second random search in the specified interval, for p-d as shown. Pentad p Number of errors in 15 seconds for p-d in the range: .00 --- .01 --- .02 ---.03 3*256*6 = 4608 45 21 1 3*256*7 = 5376 11 9 0 3*256*8 = 6144 33 13 0 3*256*9 = 6912 13 5 0 3*256*10 = 7680 17 11 0 This shows that for all five of these pentads p, most of the errors happen for p-d in [.000,.020]. -- Vaughan Pratt http://boole.stanford.edu/boole.html "Eureka" -- Greek for "Oops, forgot my bathrobe again" ================================================================= -------------------------PENTIUM REPORT # bug16---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15537 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel,sci.math.num-analysis Subject: TECHNICAL (Pentium): Bad section of the noisy-integer table Date: 11 Dec 1994 09:01:34 GMT Organization: Computer Science Department, Stanford University. Lines: 59 Distribution: inet Message-ID: <3cef5e$4vl@Radon.Stanford.EDU> References: <3c7tav$g0m@Radon.Stanford.EDU> <3c8fbn$mi0@Radon.Stanford.EDU> <3c978o$r25@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15537 comp.sys.intel:22365 sci.math.num-analysis:10049 ==================================================================== For ease of reference i have made my FDIV posts available by ftp at boole.stanford.edu:/pub/FDIV/bug* ==================================================================== Here is a particularly bad section of the noisy-integer table I recently posted, organized as a self-contained and reasonably short account for ease of forwarding and posting elsewhere. This section is so bad that it pretty much subsumes (and looks rather grimmer than) the previous tables I've posted. ============================================ Think of two real numbers x and y in the range 1 to 100, each at most 1/100,000 less than an integer. For example x might be 35.9999947 and y 53.9999982. The possible choices are assumed equally likely. Now calculate x/y on the Pentium. If the difference between the Pentium's answer and the correct value of x/y is more than 1/100,000 of the correct value, declare the Pentium to be wrong. For this situation the Pentium will be wrong with the following probabilities. 1. In the general case, the probability that Pentium will be wrong is one in 3000. 2. In special cases, namely when the two integers that x and y are near appear on the following list, the Pentium will be wrong at least 300 times more often than for the general case. The probability it will be wrong for each such special case is shown to the right of the pair. In all of these cases the probability of being wrong is at least one in ten. 36/54 0.23 40/60 0.21 10/60 0.20 18/54 0.19 10/30 0.19 9/54 0.18 9/27 0.16 20/60 0.15 18/27 0.14 15/36 0.14 36/27 0.12 20/30 0.12 15/72 0.11 10/15 0.10 5/60 0.10 3. In order for the Pentium to be wrong, x and y must be near one of the above 15 pairs or one of the following 34 pairs. For the latter, the probability that the Pentium will be wrong ranges from 0.096 down to 0.001. 40/30 5/30 20/15 5/15 33/72 15/18 40/15 44/96 72/54 33/36 22/96 22/48 80/60 15/9 44/48 22/24 72/27 11/48 11/96 80/30 44/24 11/24 33/18 44/12 22/6 22/12 11/6 80/15 33/9 44/6 11/12 11/3 44/3 22/3 4. If the Pentium is judged wrong when x and y are each at most 1/1,000,000 less than an integer instead of 1/100,000, and the error is at least 1/1,000,000 instead of at least 1/100,000, then the Pentium will now be wrong with probability one in 730, and an additional 181 pairs besides the above can give the wrong answer. In this case 60/72 is the most likely source of a wrong answer, with associated probability 0.34. This brings to 230 the total number of possibly wrong pairs, out of the 10,000 pairs of integers from 1 to 100. -- Vaughan Pratt http://boole.stanford.edu/boole.html "Eureka" -- Greek for "Oops, forgot my bathrobe again" ================================================================= -------------------------PENTIUM REPORT # bug17---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15547 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel,sci.math.num-analysis Subject: Re: TECHNICAL (Pentium): Bad section of the noisy-integer table Date: 11 Dec 1994 17:14:20 GMT Organization: Computer Science Department, Stanford University. Lines: 33 Distribution: inet Message-ID: <3cfc1c$ihs@Radon.Stanford.EDU> References: <3c7tav$g0m@Radon.Stanford.EDU> <3c8fbn$mi0@Radon.Stanford.EDU> <3c978o$r25@Radon.Stanford.EDU> <3cef5e$4vl@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15547 comp.sys.intel:22413 sci.math.num-analysis:10053 In article <3cef5e$4vl@Radon.Stanford.EDU>, Vaughan R. Pratt wrote: > >4. If the Pentium is judged wrong when the error is at least ^ >1/1,000,000 instead of at least 1/100,000, then the Pentium will now be >wrong with probability one in 730, and an additional 181 pairs besides Insert: "x and y are at most 1/1,000,000 less than an integer and" (I forgot that my program changes these simultaneously, as a heuristic for keeping down the number of parameters in this simplification of the more comprehensive picture given in my first post, /pub/FDIV/bug1 on boole.stanford.edu.) Apropos of this, it is well worth bearing in mind the following useful empirically determined rule of thumb about truncated integers, which can be read off the table in my first post (bug1). RULE: For a truncated integer divisor, the Pentium's relative error is at most about two orders of magnitude greater than the amount of the truncation. Please let me know of any counterexamples. The significance of this is that for extremely small truncations, the associated errors will be negligibly small for all but extremely sensitive applications. In particular truncations down near the last two or three bits of double precision significance should affect no one at all, taking into account that these effects are further diluted by the probability of any error at all being much less than 1. -- Vaughan Pratt http://boole.stanford.edu/boole.html "Eureka" -- Greek for "Oops, forgot my bathrobe again" ================================================================= -------------------------PENTIUM REPORT # bug18---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.sys.intel Subject: Re: Pentium bug and small numbers? Summary: Expires: References: <3cdqa4$rce@csc.mc.edu> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article <3cdqa4$rce@csc.mc.edu>, Stephen Wilhelm wrote: >I have seen plenty of examples of numbers that causes the Pentium to >give wrong results. Usually, they are 4195835.0 and 3145727.0, or >some multiple of them. Both are pretty big numbers. I recently heard >a rumor that the FDIV bug also affects some small numbers, but I am >very suspicious of this claim because I have neither seen nor heard of >any examples of this claim. Close. The so-called "small" numbers are divisions like 5/15, which of themselves cause no problem. But if you truncate them for any of various reasons, e.g. limited decimal precision of externally obtained data, then the bug hits you. For 5/15, truncating as 4.999999/14.999999 yields an error in the 16-th significant bit, i.e. > 10^-5. About 800 out of the 10,000 possible two-digit divisions are at risk here, of which 230 are at risk of a 10^-6 error, and 49 at risk of a 10^-5 error. For a quick tour of this situation see my post yesterday on this, also available as boole.stanford.edu:/pub/FDIV/bug16. For mind-numbingly more detailed tables see bug1 treating truncated integers of the above kind, and bug12, "noisy" integers with zero-mean rectangularly distributed noise. The shortest "honest" examples I know of, no truncation tricks or decimal points anywhere, are given in bug13, namely the 10-digit examples 7/402651871, 7/805303742, 57/50331637, 383/3145725, 383/6291450, 861/4718589, and 861/9437178. It is open whether there are any 9-digit examples, and still open even if one allows decimal points (not counting the points themselves). Truncation (what I was previously calling bruising but that term though catchy is less clear and standard than truncation) is the only way proposed so far for getting to the really small examples. ================================================================= -------------------------PENTIUM REPORT # bug19---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15585 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel,sci.math.num-analysis Subject: TECHNICAL (Pentium): Exact number of FDIV error-pair FLOATs (was: Bad section of the noisy-integer table) Date: 12 Dec 1994 18:27:59 GMT Organization: Computer Science Department, Stanford University. Lines: 142 Distribution: inet Message-ID: <3ci4nf$ar2@Radon.Stanford.EDU> References: <3cef5e$4vl@Radon.Stanford.EDU> <872@moene.indiv.nluug.nl> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15585 comp.sys.intel:22652 sci.math.num-analysis:10072 Summary: A search of the 2^46 pairs of single-precision mantissas turned up 1281 erroneous FDIV pairs x/y, under the error criterion that x - (x/y)*y is at least 3 for an exponent that assigns unit weight to the least significant bit. The search was such that there should be fewer than half a dozen error pairs remaining unobserved, which a longer search will produce tomorrow. For Toon Moene's weather forecasting application this corresponds to an error rate of one every four days for uniformly distributed FDIV operands, and far more for operands with even an imperceptible tendency to cluster around integers. In article <872@moene.indiv.nluug.nl>, Toon Moene wrote: >In article <3cef5e$4vl@Radon.Stanford.EDU> pratt@Sunburn.Stanford.EDU >(Vaughan R. Pratt) writes: >> Here is a particularly bad section of the noisy-integer table > >This is very interesting stuff, but for my application it would be far >more useful to simply have a list of dividend / divisor / Pentium-quotient >triples of mantissa's for the cases where the Pentium is wrong. As my >application uses almost exclusively 32-bit floating point arithmetic - and >is stable at that - I'm currently only left with the 8 (!) 32-bit dividend >/ divisor pairs in Tim Coe's text (The Pentium Papers entry at >http://www.mathworks.com). Now 8 error pairs in a sea of (2^24)^2 possible >mantissa pairs is a vanishingly small stick in a haystack. This haystack contains only (2^23)^2 pairs if we stick to positive floats, or 70 trillion. Even so, although the Pentium is a fast machine brute-force search (perebor as Russian computer scientists call it) of that haystack would take 2.25 years at 100 MHz, though a labfull of 100 Pentiums could do it in 8 days. However Tim Coe has shown that the bad divisors lie in at most 5/4096 of the possible single-precision mantissas. By only examining those mantissas the search time on one Pentium is reduced to 24.0 hours (coincidence). Coe claims that over 90% of the bad divisors lie in the top 36/2048 of *that* already small space, and my own observation shows (for pentad 30 at least) that over 99.9% of them lie in the top 64/2048. Limiting the search to that space saves a further factor of 32, bring it down to 45 minutes. I did this. For that region the search program below found 1281 bad pairs. I've started up the 24-hour search and unless the other four pentads are very different I expect fewer than half a dozen more pairs. So this 1281 number is not only a lower bound but I expect will also turn out to be a sufficiently good approximation to the exact number for all practical purposes. >Earlier I wrote >that our Numerical Weather Forecasting code does on the order of 3.5 * >10^9 divisions per forecast (or 4 * 365 * 3.5 * 10^9 ~ 5 * 10^12 divisions >per year), which looked like _something_ against an error occurrence rate >of 1 in 9 * 10^9 - but now it seems that for 32-bit operations the error >occurrence rate is only 8 * 2^-48 ~ 3 * 10^-14. That's quite something ^^ >else than 1 in 9 * 10^9 ... This should be 46 (this is for positives only, I'm assuming that the negatives behave essentially the same), giving an error rate of 1.14 * 10^-13 assuming only 8 erroneous pairs. With 1281 erroneous pairs the rate jumps by a factor of 160 to 1281 * 2^-46 ~ 1.82 * 10^-11. IF your FDIV operands do not cluster around integers *at all* (and let me emphasize the "at all" here so you do not come back a month later and include me with Intel in a suit alleging failure to disclose due to some tendency towards such clustering throwing these numbers *way* off), then your forecast has a probability of 0.064 of containing an error. Thus at 4 forecasts per day you will see an error every four days on average. Not exactly 27,000 years, but then Intel's hypothetical FDIV-averse "average spreadsheet user" is happy to let you tell her whether she'll need an umbrella tomorrow. An error every four days is a butterfly wing that is very unlikely to affect your estimate of the chance of showers. Toon, how do you know you do that many FDIV's? Did you instrument your code to count them? If so, could you conveniently instrument it to count errors? You should certainly see a couple of errors every week, and any more than that would give some idea of how the nonuniformity of your data was affecting things. Of course it is conceivable (but highly unlikely in my opinion) that your data *stays away from* the problem areas, giving a lower error rate. It would be very nice to know for sure. Not that I speak for Intel, but I would be somewhat surprised if they were unwilling to underwrite the associated expenses if that were an obstacle to doing this experiment. Here's the program that produced the above statistic. As a curiosity, the only thing tipping one off to the fact that all the arithmetic is done in floating point is the float declaration itself. I hope the "> 2" is close enough to being the appropriate test here, if not please spend 45 minutes running it again with whatever you believe to be the right test. To see what proportion of 15th-bit errors occur for example, run it with "> 2" replaced by ">= 512." Bear in mind that the "first bit" is the sign bit in IEEE arithmetic, and the least significant bit is thus the 24th bit. In general an n-th bit error is of size 2^(24-n) for this program, e.g. a 15th bit error is of size 2^(24-15) = 512, confirming my ">= 512" above. main() { int wrong = 0; float p, x, y; for (x = 1<<23; x < 1<<24; x++) for (p = 18*(1<<19); p <= 30*(1<<19); p += 3*(1<<19)) for (y = p-1; y >= p-64; y--) wrong += (x - (x/y)*y > 2); printf("%d\n", wrong); } I will do the all-day run next (simply increase the -64 to -2048) and report the exact answer tomorrow, not that anyone is likely to care unless it uncovers an unexpected difference in one of the other four pentads. I have previously posted the distribution for pentad 7680 for the case of double precision, available as /pub/FDIV/bug15 on boole.stanford.edu, which has less than .1% of the errors outside the region in which the 1281 errors were found. This article will be bug19. This program is "embarrassingly parallelizable" as they say, and someone with a dozen Pentiums in their lab could easily beat me to the punch by parallelizing this program thereby getting the exact answer in two hours instead of a day. Let me close with two caveats. First, the single-precision domain is relevant for people like Toon Moene who for whatever reason work there. (But I would not be enormously surprised if the single-precision error rate turned out to be a good predictor of the double-precision Pentium error rate.) Second, let me emphasize yet again that clustering around integers can magnify these low error rates considerably. In order to go from a rate of one error per four days in your forecast to one per four minutes requires only a very tiny bulge around integers in the distribution of your FDIV operands, a bulge you would never notice if plotted on a graph. By the time the bulge became noticeable errors would be happening more than one a second (back of the envelope, your envelope may vary:-). -- Vaughan Pratt http://boole.stanford.edu/boole.html "Eureka" -- Greek for "Oops, forgot my bathrobe again" ================================================================= -------------------------PENTIUM REPORT # bug20---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15647 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch,sci.math.num-analysis Subject: TECHNICAL (Pentium) Distribution of error MAGNITUDES Date: 14 Dec 1994 01:42:18 GMT Organization: Computer Science Department, Stanford University. Lines: 155 Sender: pratt@cs.stanford.edu Distribution: inet Message-ID: <3clihq$1kt@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Keywords: Pentium, FDIV, error, magnitude Xref: Radon.Stanford.EDU comp.sys.intel:23313 comp.arch:15647 sci.math.num-analysis:10114 Thus far I've been writing about the distribution of RATES of FDIV errors on the Pentium, as a function of the distribution of operand *values*, in particular how near an integer they are. In this post I'll address the distribution of MAGNITUDES of those errors, as a function of operand *precisions*, in particular whether the operands are double-precision or something less, meaning that the mantissa is zero in the last n bits for some fixed n; n=29 is for IEEE single-precision. For this note I will fix n at 29 and hopefully will have time to explore smaller (and maybe larger) n later on. As has been pointed out and acted on long ago by both Tim Coe and Mike Carlton (and my face is red for not being aware of this yesterday when I did this), there are sufficiently few single precision mantissas, namely a mere 8 million, that it becomes feasible to explicitly examine every single-precision mantissa pair whose divisor is in one of the 5 danger regions. There are 64 trillion pairs, 5/4096 of which have their divisors in one of those critical regions. This yields a space that can be exhaustively searched by a single Pentium in 24 hours (exactly!), and in exactly 24/p hours on p Pentiums (what the parallel computing community calls an "embarrassingly parallelizable" computation). A further decrease to 1/32 of this space, taking only 45 minutes, is made possible at the cost of overlooking only about 2.5% error pairs. I pointed this out in my previous two posts, where I reported a total of 1312 errors off by 3 or more for numbers scaled to assign unit weight to the least significant bit of a single-precision mantissa, of which 1281 were found by the 45-minute search. The following both subsumes and improves on the 1281 number, while greatly clarifying (at least in my mind) a number of issues about density that I had not previously appreciated. Running the 45-minute search, I collected all errors at *all* magnitudes, even those that were up to 11 bits more precise than even double precision! (Unless I've misunderstood something, the Pentium appears to retain 64 bits of precision throughout the evaluation of the code resulting from the C expression (x/y)*y. I'd love to know more about what's going on here, can someone more knowledgable in this area please explain?) The quick search found 9236 such errors, which should be within 2.5% of what the 24-hour search would produce. I then classified these errors simply according to their exponents. Each such class therefore represents one "octave" of precision, and the full range of possible errors is then 63 octaves, one per bit of precision in the mantissa plus the extra 11 bits mentioned above. The biggest known FDIV error is in the 15th bit (I'm calling the leading bit of the true mantissa the 1st bit, so that an answer of 4 = 100 instead of 5 = 101 would be wrong in the 3rd bit), whence the first 14 octaves are all empty. Starting at octave 15, the 9236 errors found by the quick search are distributed as follows. 15 67 22 130 29 256 36 137 43 231 50 139 57 236 16 105 23 272 30 138 37 226 44 153 51 238 58 146 17 167 24 146 31 262 38 155 45 254 52 142 59 214 18 148 25 242 32 148 39 236 46 143 53 222 60 160 19 273 26 157 33 252 40 139 47 246 54 153 61 236 20 131 27 272 34 148 41 223 48 142 55 239 62 139 21 251 28 131 35 249 42 141 49 233 56 145 63 223 To withing 2.5%, this tells us that, of the 2^46 single-precision pairs of mantissas, 67 of them will produce an error in the 15-th bit. The most-frequently cited Coe pair, 4195835/3145727, is among these because both operands are representable exactly in single precision (4195835 < 16777216 = 2^24, don't forget the hidden leading bit, the true mantissa has 24 bits). In the next octave, consisting of errors of half that magnitude, there are 105 errors. And so on down the line. A very striking feature of this distribution is that, except for the ramp-up at first three entries, it is so beautifully uniform. The one nonuniformity is a "high-frequency oscillation" in which each even octave has around 240 errors while each odd octave has only 140 or so errors. Had we been collecting octaves two at a time (or four at a time had we used the same method on an 1960's IBM 360) we would have seen only a nearly perfectly flat distribution; the 2x or 4x slower sampling would have filtered out the high frequency (with no aliasing or beats since the "buzz" period is exactly 2 octaves). This table is extremely useful for predicting error rates. The likelihood of a worst-case error is 67/2^46 or 1 in a trillion. The likelihood of a 2^-n error is obtained by adding up the number of errors in the octaves up to n+1 inclusive. (This is to within a factor of 2, or certainly at most 4, I haven't yet analyzed the range resulting from variation within an octave and variation of the dividend within its octave.) Thus a 10^-9 error is a 2^-30 error, so we add up the octaves up to 31, getting 3148, and the rate of errors at least this size is then 3148/2^46. This table actually goes a little further: 64 329602833 65 46 I presume that octaves 64 and 65 together count the number of single-precision operands for which (x/y)*y is not equal to x, for the region searched. Although that region closely reflects the populations of octaves 63 and below, octave 64 would be much larger had we examined all 2^46 pairs. As it was the quick search examined 2^23*5*64 = 2684354560 pairs, almost exactly 8 times the population of octave 64, which I can well believe is the theoretically correct number of (x/y)*y != x cases. In which case, well done, Intel, too bad about the missing entries. I would love to know how other products stack up against this, and to this end I append my code generating the above table so people can try it out on their favorite machine to see just what it does at these very high precisions. Note that octaves 51 and below had better be empty if it isn't a Pentium, you are looking to see in which octave the number 2^23*5*64/8 = 335544320 (or so) shows up. This program takes 45 minutes on a Pentium. I haven't tried this on any other machine yet, so there may be glitches. For example a big-endian machine like a Sparc will need [0] instead of [3] in extracting the exponent from err. To mount a 24-hour search on the Pentium that catches the remaining 2.5% of error pairs, change "-64" to "-2048", covering the whole of the five Coe intervals. #include #include int bit[1<<11]; main() { int i; float x, y, p; double err; for (x = 1<<23; x < 1<<24; x++) for (p = 18*(1<<19); p <= 30*(1<<19); p += 3*(1<<19)) for (y = p-1; y >= p-64; y--) if (err = fabs(x - (x/y)*y)) bit[((short*)(&err))[3]/16]++; for (i = (1<<11)-1; i >= 0; i--) if (bit[i]) printf("%d %d\n", 1047-i, bit[i]); } What is needed next is to generalize this program by making the fixed single-precision assumption (implicit in all the non-unity constants in lines 6-8, the three for-loop controls) into a parameter. This way one can study the effect of precision in the operands on the above table. The octave populations must grow monotonically with increasing precision since every 37-bit-precision mantissa is automatically a 38-bit-precision one, and so on. One question this would help one answer is how many double-precision errors there are. The catch is that each additional bit of precision in the operands quadruples the search space. However I believe this can be handled by suitable heuristics. -- Vaughan Pratt http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug21---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 23479 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Pratt to Intel: prove it's safe (Was: Who do you believe?) Date: 14 Dec 1994 09:57:00 GMT Organization: Computer Science Department, Stanford University. Lines: 184 Message-ID: <3cmfhc$ies@Radon.Stanford.EDU> References: <3clu4g$dpe$1@garnet.msen.com> NNTP-Posting-Host: sunburn.stanford.edu Keywords: Intel Pentium Detroit Free Press IBM In article <3clu4g$dpe$1@garnet.msen.com>, Hiawatha Bray wrote: >I'm a columnist for the Detroit Free Press. A couple of weeks ago, I >wrote that the Pentium problem seemed like a tempest in a teacup to me, >though Intel should have come clean a lot sooner. Then the Monday >announcement by IBM got me thinking I'd taken the matter too lightly. > >Now I'm having third thoughts. If the Pentium is as bad as Big Blue >says, thousands of users worldwide would surely have gone howling to >their computer dealers, complaining of bizarre arithmetic errors. For a change of pace from my gruelling postings of tables of numbers, let me wax political for a little bit. The media has asked repeatedly why Intel is being beaten up, and has speculated at length on the phenomenon, quoting AMD's Jerry Sanders, "The Internet is democracy at its ugliest." Intel for its part asks plaintively why is it being made the guinea-pig "in the new era of crisis management," and protests that it has learned nothing about the bug from the Internet except how emotional people can get (San Jose Mercury News 12/12, page 2D). There is an answer, one that is like a short proof of a hard problem. The sort of proof that you don't think of right away, but once you see it you slap your forehead and say, of course, it's obvious, that's what it has to be. First let me clarify my adversarial role in all this. Intel is not my enemy, my only enemies are ignorance and slowness. "What you don't know won't hurt you" only works in a society with no information highway and whose regular highway has a 4 mph speed limit for horseless carriages. Speed is great, and more of it is where we're all headed. But the faster things go, the more alert you have to be to risks. It's that way with Ferraris, and it's that way with Pentiums. If you hear a little tapping sound coming from the engine at 130 mph, it's heads up or take your chances. Intel is an effective player in technology and I very much hope it continues to be. Intel along with many other companies around Silicon Valley is family to me. This said, I must side with IBM (I have no IBM stock or other material incentive to take sides here) and state without qualification that I believe the Pentium to be for all practical purposes exactly as bad as IBM says it is. Now in saying it is bad, IBM is not also saying that thousands of users are currently reporting bizarre arithmetic errors, as implied in your question. What IBM has pointed out are neither anecdotes nor case studies but possible risks. For this they used simple examples that the financial community can easily relate to, and that it can use as a basis for deciding for itself (perhaps not so easily) whether these are risks they are willing to take. There are two basic technical aspects to this bug that are insidious in the extreme. One is the astronomical range of error rates, from one in every 27,000 years if your coffee breaks leave you time for only a thousand divisions a day *and* your data is perfectly uniformly distributed, to one every millisecond while frantically dividing hordes of small bruised integers. The other is the uncertainty of where the world's applications lie along this spectrum. Neither end of the spectrum is a likely candidate for any ordinary application. No Pentium user limits herself to 1000 divisions a day, and no one's data is perfectly random or they could use it for a random number generator. At the other end, no one does a division every third instruction, and no one's data consists exclusively of small bruised integers. Ok, so since no application operates at the extreme low-rate end of the range, where Intel seems to think almost everyone lives, and since no application operates at the worst-case high-rate end either, where the doomsayers like to congregate, where *does* the typical application work and play? No one knows. Not me, not you, not Intel. If Intel can show that people's data are distributed so that their division operands are sufficiently random as to achieve close to the one-in-9-billion errors-per-division rate, let Intel produce its argument. To date it has not come close to doing so. Meanwhile various qualified parties (and not-so-qualified parties put there by an invisible force to make interpretation of all these claims more of a challenge) have pointed out that there exist other points on the error-rate spectrum besides the one Intel continues to cling to grimly. And, moreover, that these other points are not just miles but light-years up this spectrum from Intel's cherished low-rate end. Let me convey some idea of just how hard it is to make any plausible argument about randomness stick. You might look carefully at the actual operands to the division instruction arising in practice, observe that they seem to be spread extremely uniformly across the whole range, and deduce complete randomness. But now suppose that a mere one percent of those operands turn out to be crowded around integers, to within 0.00001 of the nearest integer, the "bruised integer" danger region" for the Pentium's division instruction. Without microscopic examination of the distribution, this little blip so near the integers could pass completely unnoticed. Yet now the proportion of bruised integers has gone up from one in a hundred thousand (for random data) to one in a hundred, this being the percentage now crowding around integers. This increase of a factor of a thousand in bruised integers, in both the divisor and the dividend, will increase the error rate by a thousand squared, that is, a million. Yet this staggering increase has resulted from a one percent change to the distribution, a change almost as invisible as the bug itself. Such extreme sensitivity to the ambient conditions verges on chaos. It is *very* hard to prove what Intel wants us all to take unquestioningly for granted. Now, another automotive analogy. A question has arisen inside the Duesenberg Motor Car Company's as to the safety of the brakes of this year's model. The question involves a very peculiar problem brought to light by an engineer. No car in that line has yet had a brake failure attributable to this problem; moreover it is quite on the cards that none ever will. Unless, that is, some people brake in a certain way that though peculiar is not really all that difficult to do and which you could conceivably imagine your kid doing while out for a spin, for whatever strange reason, even if you've never seen him do it. The sort of car the GOP might once have wished Rosemary Wood drove. Now what would you recommend DMC do under these circumstances? Should they hush up this discovery and cross their fingers that this problem will quite possibly never happen even once, and that if by very bad luck it should ever happen, no one will correlate it with the brake design. Or, should they make every effort to warn every owner of this problem? Or should they mount a tremendously expensive recall of every vehicle? If you think they should insist on recalling the whole line, well, automotive safety being the issue it is in today's litigious society, perhaps this is really their only financially viable option, not to mention the decent thing to do and of course great for their image. In a less litigious world one can imagine DMC simply disclosing their understanding of the situation in full, and pointing out the very obvious fact that there are many other things you can easily do with the pedals that are far more likely to run you into a tree at 90 mph than this obscure brake problem. Here is what we know, here is what we don't know, we have hidden nothing bearing on this risk from you, decide for yourself. But if you think DMC's optimal strategy is simply to deny that the problem could present any real risk, then Intel's handling of this inconsequential flaw that will bite you only on your 27,000th birthday will likewise seem entirely appropriate to you. As long as Intel doesn't know for sure how badly and how often its customers can be tickled, injured (see bug6 of my FDIV papers boole.stanford.edu:/pub/FDIV/bugs for how the bug could cause whiplash), or killed by the bug, it has no business reassuring its customers that they'll never notice a thing. In the spirit of full disclosure Intel should state the uncertainty of the situation. It should point out, or at least allow for the fact, that the pessimists naturally drift to one end of the spectrum and the optimists to the other. And it should strongly resist the understandable temptation to join the most extreme optimists, and instead stick to the facts, in particular to what is actually known about just where along this very very long spectrum applications really do fall. Exactly the same problem comes up with publicly held companies when they communicate with industry analysts. Companies are naturally wildly optimistic about their prospects, it is hard to keep going when you're not. But to communicate that optimism to the stock market through the analysts when it is sufficiently misplaced is to leave yourself wide open to actions by disgruntled stockholders, who will bring your grandiose but naive promises before the jury as witness to your malicious lies. The marketplace does not suffer cockeyed optimists gladly. My great-grandfather, member of Parliament for Collingwood and one of Victoria's most successful promoters of land speculation late last century, was one, and paid for it with six months in jail when the land boom bubble burst in 1892. Intel should conduct its optimistic speculations in private, in public it needs to stick to the facts. Nothing less will mollify this net. Agreed, netters? -- Vaughan Pratt http://boole.stanford.edu/boole.html My every word is copyright, especially those not in the dictionary. ================================================================= -------------------------PENTIUM REPORT # bug22---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 23742 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: sci.math,sci.math.num-analysis,comp.sys.intel Subject: Re: Pentium Chip Calulation Bug Date: 14 Dec 1994 22:08:09 GMT Organization: Computer Science Department, Stanford University. Lines: 51 Distribution: inet Message-ID: <3cnqc9$i60@Radon.Stanford.EDU> References: NNTP-Posting-Host: sunburn.stanford.edu Keywords: FDIV Xref: Radon.Stanford.EDU sci.math:47431 sci.math.num-analysis:10139 comp.sys.intel:23742 In article , Daniel A. Asimov wrote: >Does anyone have precise information about just what class of >arithmetic problems results in a buggy calculation? The nature of the bug is such that divisors a tad under an integer are at far greater risk. The risk is maximized when that integer is 3 times an odd digit times a power of 2 (quantities that are 2^-n times a risky integer are equally at risk) with "tad" no more than 10^-4 and becoming more significant at lower values. The risk is yet further increased when the dividend is near an integer, but the conditions are more complex. See tables governing this phenemonon in boole.stanford.edu:/pub/FDIV/bugs or individual.bugs/bug1 for the shorter file, see also the README. The above will give an excellent idea of error rates, which can exceed one in a thousand for randomly chosen "small bruised integers," defined as numbers of the form n-k where k is some fixed negative power of 10 (or any other radix, 10 is not special here), e.g. 10^-6. While it does not bear on your specific question about rates, the following example I contrived just now has a certain appeal. A useful pair of statistics to keep on a township comprised of sorts A and B is the size s=a+b of the total population and the proportion p=a/(a+b) of members that are of sort A. From this other questions such as the proportion of sort B, the ratio of the two sorts, the number of people of sort A, and the number of sort B, can all be determined. In particular we can reckon a as sp. Now suppose this useful technique is applied in a large city having a=5,505,021 and b=3,932,159. For our records we compute s=9,437,180 and p=0.583333263, sufficient precision for the expected questions about this population. We may later recover a as the product of these two, namely sp=5505020.9746 which we round to the nearest integer, which unsurprisingly is our original a. But if we do our record keeping on a Pentium, when we come to compute p it turns out to be p=0.583328176. Later when we go to retrieve a as sp we find it to be 5504972.996, clearly the integer 5,504,973. But now 48 people have disappeared! The Pentium is a blazingly fast computer, and can easily make 20 million people a second disappear. It is also priced right for quantity purchases, permitting even greater rates. As supplies of the buggy Pentium dry up next year, expect their price to rise sharply as certain countries anxious to please the UN about their human rights compliance snap them up for their bookkeeping. -- Vaughan Pratt http://boole.stanford.edu/boole.html My every word is copyright, especially those not in the dictionary. ================================================================= -------------------------PENTIUM REPORT # bug23---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 23755 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Ne wYork Times does better than before covering Pentium problem Date: 14 Dec 1994 22:41:30 GMT Organization: Computer Science Department, Stanford University. Lines: 37 Message-ID: <3cnsaq$mjs@Radon.Stanford.EDU> References: NNTP-Posting-Host: sunburn.stanford.edu In article , Carl Oppedahl wrote: > >corporations buying machines for use within the company. It identifies a >Vaughan Pratt, a Stanford computer scientist who has conducted his own >analysis of the frequency of errors occurring in Pentium calculations and who >has found it to be "significantly higher" than what Intel has reported. I hope the "be" is not what the article actually said. I have never made a stronger statement than "could be." Intel is the one making the strong statement when they say it can't be, for average users. Not enough is known to support Intel's position; in fact the strange nature of the bug tends ironically to put average users *more* at risk than at least some scientific users. *Quite* enough is known to support my position, which is in full agreement with IBM's statement. In case the media is oversimplifying things for their readers, it should be made clear that there are two different (in fact dual) kinds of noise involved in these statistics, the distribution of operands in a given class of application, and our uncertainty about those distributions, which one might imagine formalizing as a distribution itself, namely of distributions. This is like disjunctive normal form: each disjunct as a conjunction of atoms corresponds to a distribution of values, and the disjunction of those conjunctions corresponds to a distribution of distributions. This is the sense in which the two kinds of noise are dual: it is the De Morgan duality of conjunction and disjunction, but for distributions in place of simple logical connectives. These ideas about nested distributions permeate both quantum mechanics (mixed states consisting of distributions of pure states which are complex-fuzzy conjunctions of possibly contradictory assertions such as Schr"odinger's_cat_is_dead and Schr"odinger's_cat_is_alive) and game theory, important in both economics and computer science. -- Vaughan Pratt http://boole.stanford.edu/boole.html My every word is copyright, especially those not in the dictionary. ================================================================= -------------------------PENTIUM REPORT # bug24---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 24025 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Dataquest Report: The Great Pentium Fire Drill Date: 15 Dec 1994 10:37:59 GMT Organization: Computer Science Department, Stanford University. Lines: 228 Message-ID: <3cp6a7$q5n@Radon.Stanford.EDU> References: <3cnceb$fms@newsbf01.news.aol.com> NNTP-Posting-Host: sunburn.stanford.edu In article , Duncan Murdoch wrote: >That was an interesting article, but it contained one very substantial error: I found this article inaccurate biased rubbish from start to finish. The one thing I did not find offensive in this regard was the article's insistence on the economic havoc a total recall would wreak. This may well be the case, though it is far beyond my competence to judge. What I *have* been recommending is not full recall but full disclosure, which I take to include the acknowledgement that the range of error rates is far wider than Intel, Dataquest, and some other wishful thinkers are allowing. A total recall sounds horrendously expensive and is not the sort of thing that one can productively encourage without a lot better understanding of the tradeoffs than I could muster. Full disclosure is not at all expensive, quite the opposite. First, full disclosure would restore the confidence of us Intel customers in Intel's forthrightness in its dealings with us. Second, it might save some unfortunate user from a watery grave or a repeated experiment. Both are very important. How Intel imagines it is benefitting by refusing to even look in the direction of the high error rates spelled out in the IBM report and my posts is totally beyond my comprehension. Is Intel concerned that the possibility of such rates will turn meekly accepting customers into angry users all demanding new chips. How is hiding its head in the sand about the very possibility of high error rates helping Intel when it is obvious to anyone with half an eye that they are perfectly possible? (Whether they are likely in say Lotus-123 is another matter.) Who does Intel think it's kidding? Here are my detailed comments bearing out my above analysis of this unfortunate article's merits, or lack thereof. The errors are extremely small--in many cases, insignificant--and occur in approximately one out of every billion calculations involving a reciprocal. Wrong. 1/16,000 is an extremely small error for a bathroom scale, it is an extremely large error for even a $3 calculator let alone a $3,000 computer. For the set of all divisors and dividends, we accept Intel's claim of one failure in every nine billion pairs. Qualification: "All" here refers to double precision. For the set of all single-precision divisors and dividends, which the authors seem to be assuming a little later, the failure rate is less than one in forty billion pairs. The most famous calculation, now known as Coe's Ratio, and undoubtedly the most common single division performed on the Pentium, is 4195835/3145727. Strong competition: 1/1, 0/1. Internet correspondence indicates that this is about the largest error obtainable, and that there are approximately 1,738 unique pairs that can generate an error with this big. Wrong. This is the approximate number, depending only on how you judge "maximum error," not on any other uncertainty, of single precision pairs generating an error of any magnitude greater than allowed by the IEEE standard for single-precision results. The number of error pairs generating the maximal error depends on the precision of the operands. For single precision operands it is 67 (depending on exactly how you define "maximal error"), for double precision operands it is on the order of 2 * 10^19. It is extremely difficult to contrive a situation in which the divide error could manifest itself in a way that is material to an end result. Wrong. It took me 10 minutes of looking at Coe's table to come up with a simple and plausible situation producing 10^-5 errors at very high rates. See bug1, my first post on this subject. This is because the error is normally very small, Define "normally." occurs at a rate that is extremely rare, Wrong: it occurs at a rate that depends on the application, and can vary from not at all (which is even less than extremely rare) to as fast as a thousand times a second. This is a fundamental point that Intel refuses to acknowledge. This is unconscionable. and should have no effect on the huge majority of Pentium users. Agreed provided only that "huge" is deleted, which without more data is just more of the wishful thinking pervading this whole miserable article. The only class of user likely to be affected is that of mathematicians working in the field of number theory, the class of user from which the bug was originally reported. This inference is based on a population of one. Number theorists using only integer arithmetic, the arithmetic of numbers, will be unaffected altogether. IBM makes the point that financial users are at some risk. The facts of the IBM report are unquestionable, leaving the interpretation of those facts as the only possible point of rebuttal. Interpreting those facts as that financial users are at *some* risk cannot possibly be considered controversial. The debate is over the extent of that risk, but IBM has chosen not to address that issue, leaving no grounds on which they can be faulted. Clearly, users of a Pentium-based system designing devices where a divide error could conceivably be life-threatening-- for example, a bridge, a ship, or a plane--should consider an upgraded part even though the error is unlikely to occur and, if it does, is effectively certain to be insignificant. Much engineering design work involves iterative calculations that work toward a solution. In these cases, the likelihood of the bug occurring is increased, but its effect is totally negated by the subsequent iterative calculation. "Totally" is not appropriate here. The presumed model is that every individual error must decay after further computation. I would expect that even in this iterative situation a single error *could* ruin your whole day, although I am also comfortable believing it may very likely decay as claimed. I would be even more comfortable hearing all this from an expert. If the flaw were to somehow cause a nonconvergence, the process would be started over with a new set of initial conditions, and the error would be eliminated. While I find this quite plausible, I would again defer to an expert in this area. There is a great variety of types of iterative calculation, and moreover some iterative processes can be sped up with noniterative algorithms, making arguments about the advantages of iterative algorithms for buggy processors specious. Financial users could conceivably be affected by the bug. However, the size of the error is again likely to be immaterial, representing hundreds of dollars in a million- dollar transaction. Ingenuous in the extreme. Transactions depend critically on maintaining sufficient accuracy to avoid diluting tight profit margins. Is Dataquest offering to make up the difference? And why this insistence on using buggy computers for transactions involving millions of dollars. What exactly is the drawback to using reliable computers? Does reliability command that big a premium? Although larger transactions could be affected, no financial institution would allow a transaction to take place without some level of cross-checking. The cross-check process would inevitably surface the error, and the rarity of the bug would prevent compensating errors from occurring or for multiple errors to compound. Upping the ante merely amplifies my previous remarks. What accountant who knows the meaning of the term "job security" would allow a buggy computer remotely near any transaction of this magnitude? In summary, the low level of the error relative to real-world situations, coupled with the extreme rarity of the bug, guarantees that even calculations involving lives and money are safe on a Pentium. As opposed to on what? This is a quotable quote for Newsweek. Earlier Dataquest stated that doing this sort of thing was "clearly" not on, now they seem to be taking that back. In so doing Dataquest is going out on *really* thin ice. Numerologists and some extremely theoretical physicists, however, stand at risk of producing bad results and should request replacements. Speak for yourself. intensive numerical calculations that fill memory with intermediate values could turn up an occasional incorrect answer once every few hundreds of machine years. We believe that this class of error is as significant or more so than the Pentium divide bug, as it occurs more frequently and can result in a much larger change in the number. What about applications with high error rates, like one a day or a minute or a second? Oh, right, these don't exist. Having established that the bug itself is immaterial (and we would be happy to talk about this at great length--it is extremely difficult to contrive a real-world calculation in which it matters), Define real-world. If I contrive one, how do I know you won't disqualify it on the ground that contrived calculations are by definition not real-world? Provided there's no such catch-22, let me contrive the following, taken from bug22, which is as real-world as a contrived example can get. Santa Clara county has 5,505,021 rock lovers and 3,932,159 jazz lovers. (Not the real numbers as Dave Barry would say, but Santa Clara is a real place, I live there.) We record all this in our database in the form of the total population 9,437,180 of such music lovers, along with the proportion that love rock, namely 0.583333263, these two parameters being the most frequently requested. To retrieve the number of rock lovers from our database, a less frequent request but still of interest, we just multiply 9,437,180*0.583333263 = 5505020.9746 which we then round to the nearest integer. Hey, this works great. All well and good had we done this on a true-math computer. On a Pentium the key difference is that when we compute the proportion that love rock, the Pentium pegs it at 0.583328176 (ouch!). When later on we go to look up the number of rock lovers we get 9,437,180*0.583328176 = 5504972.996, which when rounded yields an answer 48 below the correct number. 48 rock lovers have disappeared from Santa Clara. There is a plot afoot. And you say this calculation doesn't matter. Define "matter." ... Why a Total Recall Is Dangerous to the Industry Several writers have suggested the idea of a total recall of the part. We at Dataquest believe that this is an imprudent and irresponsible suggestion; Probably right, but this is getting outside my sphere of competence so I'll pass on the rest of this amateurishly written article. -- Vaughan Pratt http://boole.stanford.edu/boole.html My every word is copyright, especially those not in the dictionary. ================================================================= -------------------------PENTIUM REPORT # bug25---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.sys.intel Subject: Re: Ne wYork Times does better than before covering Pentium problem Summary: Expires: References: <3cnsaq$mjs@Radon.Stanford.EDU> <3cprkd$idc@newstand.syr.edu> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: Readers of comp.sys.intel might be so good as to forward this message to those who should see it but don't have the time to wade through the incredible volume. It corrects a misattribution to me in the New York Times, which is read by many people in a position to understand the technicalities of the error but who are in no position to read comp.sys.intel. In article <3cprkd$idc@newstand.syr.edu>, Mark Levitt wrote: >Vaughan R. Pratt (pratt@Sunburn.Stanford.EDU) wrote: >>I hope the "be" is not what the article actually said. I have never >>made a stronger statement than "could be." > Sorry, the times are says "and found it to be significantly higher..." I took this up with the reporter, John Markoff, who had interviewed me for this article. He understood exactly the problem, acknowledged the error, and apologized profusely. I *really* wish he had said "could be" instead of "to be" here. While this distinction may be lost on the average reader, more technically qualified readers will see it and wonder. This error was greatly compounded by an unfortunate "transition" as editors call it that was added to Markoff's article by an editor and which Markoff did not subsequently catch. The objectionable phrase was "Yet some other scientists said that it was difficult to predict average error rates," in a context that implied that both IBM (in their Pentium study) and I were saying something else. Yet my whole interview with Markoff had dwelt on just how difficult any such prediction was. The existence of high error rate applications that Intel to date has refused point blank to acknowledge makes such predictions much harder than Intel is claiming. Intel is going out on a limb in its strenuous efforts to persuade customers that the range of possible error rates never drops to a point that would cause anyone any concern, allowing them to make a concrete prediction that there is no serious problem. If the complexities of this situation are this far beyond Intel's management to grasp, think what it must be like for the poor customer trying to make sense of the situation. Intel says there's no risk, IBM says there's some risk. Who do you believe? I have not made the sort of prediction that Intel is making, and I see nowhere in IBM's Pentium study where they do either. All that I claim is that there is an astronomically wide range of error rates, and that perfectly plausible applications exist at all points along this spectrum from an error rate of one a second (hundreds a second for slightly less plausible ones) to the extreme low of no errors at all. Oddly, Intel neglects *both* these extremes. It seems to me that it is very hard to defend Intel's position that all real-world applications must be at the low-rate end of the range. And it is completely impossible to defend their failure to even *mention* that the high-rate end of the spectrum exists, other than in terms of running to stand under a meteor as Andy Grove puts it (where the error rate is of course *millions* a second, just keep dividing with Coe's pair). This shoe is on the other foot: Intel is running very hard to stand under the Pentium division meteor, and might even get there if they don't turn around soon. > However, I agree that this is the most accurate and balanced coverage I >have seen to date. Indeed. Markoff says he and the editor worked for an hour to get the article as a whole right, they've done a super job with this, and he deeply regrets that there were still two fairly basic errors left in that little part of the article when it was done. The article does unfortunately leave the impression that there may be three points of view, Intel's, IBM's and mine, and "other scientists" (whom the article leaves unnamed). I hope this posting corrects this by establishing that there are only two points of view, Intel's, that the average user need have no concern about high error rates, and the opposing view, that it is very hard to make such a prediction in the light of the very plausible high rate scenarios that can arise even in simple financial transactions of the kind detailed by the IBM study. Hopefully future interviews will average out this sort of noise, which as Markoff pointed out is inevitable for any operation under time pressure, such as getting fast microprocessors to market. In this case NYT's short burst of noise here is not to be compared to the ongoing noise being generated today, both inside Pentiums and in the media, by the missing five entries in the FPU's P-D plot! > Just my $0.199935454 (Kudos to the originator of this joke...) On Intel's behalf I would like to state that, whatever its other flaws, the Pentium does not move the decimal point one place to the left. :) I would also like to state on Intel's behalf that, except for the bug, the Pentium FPU does a gorgeous job on the low order 11 bits in the internal 80-bit (64 bits of mantissa) representation of the FPU, much better than on some other FPU's I've seen. I am very much looking forward to extraordinarily accurate computations with the replacement chip they promised early last week to send me. I've never had this much floating point accuracy before, and I am confident that the occasional glitches I used to get at single-precision with my circle-tracking software for my font-capture program, whose glitch rate was greatly reduced by passing to double-precision (back in 1987, long before the Pentium existed), will be essentially eliminated altogether on the repaired Pentium and any other processors that handle with kid gloves those 11 low-order bits hiding out there in the FPU below the 52nd bit of the visible mantissa. ================================================================= -------------------------PENTIUM REPORT # bug26---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 24514 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: FDIV Myths (with additions, re-posted by popular demand) Date: 16 Dec 1994 09:36:31 GMT Organization: Computer Science Department, Stanford University. Lines: 134 Message-ID: <3crn2v$i75@Radon.Stanford.EDU> References: NNTP-Posting-Host: sunburn.stanford.edu In article , Peter McGavin wrote: > This is excellent work. Just a few small cavils: >Myth: The flaw affects the precision of the result only after the 9th >decimal place. > >Answer: No, operands can be scaled so that the error occurs in the >tens place or the millions place or any other place. It is much >better to consider significant digits. The largest possible errors >occur in about the 4th or 5th significant digit, but most errors are >much smaller than that. ^^^^ The "much smaller" in this last clause is open to misinterpretation. As Intel has pointed out in their white paper, released to their WWW site on Tuesday, "Statistical measurements using over a trillion test points indicate that the inaccuracy is is equally likely to manifest itself in bit positions 12 through 52 to the right of the binary point." Another way of saying this is that the the errors are uniformly distributed by their exponents. This would imply that 1/41 of the errors have their leading bit in the 12th position (13th position if you are using 1-origin indexing for the true mantissa, as I did in my table), 1/41 in the 13th position, and so on down the line. I undertook to investigate this phenomenon myself before the white paper was released (I saw it for the first time yesterday, Wednesday). I posted my preliminary conclusions late Tuesday night, available as boole.stanford.edu:/pub/FDIV/individual.bugs/bug20. I observed the same basic uniformity in error magnitude distribution noted by Intel. However I have some minor concerns with their results, ranked here in decreasing order of importance in my estimation. (The numbering system followed below numbers the bits of the mantissa from 1 (most significant) to either 23 or 52 (least significant) depending on whether the operand is single or double precision.) (i) Since the largest observed error is known to be 2^-14, for Intel to be observing errors in bit 12 suggests that their definition of where the first bit is in error is one that would make 1000000 different from 0111111 in bit 1. Readers may draw their own conclusions about the usefulness of that measure. Instead I *subtracted* the two quantities and used the position of the leading bit of that difference. (Normalization after the subtraction moves that bit up, but one can easily read off the old bit position from the now-decreased exponent.) Applying this criterion to the above example, one gets the difference at bit 7 instead of bit 1, a more faithful reflection of the actual difference. In this way Intel would appear to have overstated the mean magnitude of the error by a factor of perhaps 2, i.e. one bit position (would someone please do the exact statistics here?), and greatly increased its variance (modulo the "buzz" observed below). (ii) Instead of performing statistical measurements as an indicator, I used exhaustive search to count the exact number of single precision operands giving rise to each magnitude of error as defined above. Thus my numbers are not statistical but exact counts. As such they are exact predictors of what a statistical search, in the space of single-precision pairs, would come up with in the limit assuming a perfect random number generator. Here are these numbers for the first 10 nonzero bit positions, namely positions 14 through 23 inclusive (the possible error positions in the single-precision mantissa). (The numbers I posted on Tuesday were from a quick search, these are from an exhaustive search that ended Thursday morning, which turned up an additional 7%, not 2.5% as I had predicted. This complete search also recorded in a file (all1precerrs on boole.stanford.edu as above) the set of actual erroneous pairs themselves, so one does not need to repeat this search ever again, at least for single-precision operands.) 67 108 170 150 276 138 259 135 283 152 This starts out a little low and takes about three bit positions to ramp up to a steadily alternating ("buzzing") value, plus a little white noise. These numbers continue with the 29 bits numbered 30 through 52 as follows. 262 163 300 140 279 147 280 159 274 158 272 145 250 166 257 148 250 149 256 163 278 152 268 152 256 147 259 150 245 This shows that the uniform distribution of log magnitudes of errors continues down into double precision. Intel did not investigate whether the distribution remained uniform for the 11 bits of extended precision beyond bit 52. Here is my data for positions 53 to 62 inclusive. (Position 64, the LSB, is meaningless, being where normal truncation errors accumulate.) 161 265 153 259 154 236 169 258 149 248 This shows that the bug extends into the 11-bit region, with no essential difference from the rest of the mantissa. >Myth: Approximately 1800 different operand pairs to FDIV produce the >wrong answer. > >Answer: No, that is only the number of unique pairs of >single-precision mantissas. To be more precise, 1738 such pairs yield single-precision errors, defined as errors down to bit 23, the least significant single-precision bit, while including bit 24 makes this 2000 exactly. This is obtained by adding up bit positions 14 through 23 inclusive. For various definitions of single-precision error, the number of errors meeting this criterion will lie between 1738 and 2000. Adding up positions 14 through 52, namely 7863, gives the number of double-precision errors for single-precision operands, again modulo the exact definition of double-precision error. >The total number of double-precision >pairs is vastly larger and is probably impossible to count exactly. We'll see (about getting the exact count). >Myth: IBM says a typical spreadsheet user will see an error every 24 >days. > >Answer: No, the press release says "as often as once every 24 days". A subtle but very important distinction. I was similarly misquoted in Wednesday's New York Times as saying the error *was* high when I said it *could* be high. For these distinctions to be lost on the average Pentium customer makes computer-buying a relatively risky enterprise. -- Vaughan Pratt http://boole.stanford.edu/boole.html My every word is copyright, especially those not in the dictionary. ================================================================= -------------------------PENTIUM REPORT # bug27---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 24522 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: sci.math.num-analysis,comp.arch.arithmetic,comp.sys.intel Subject: Re: Variance of FDIV error Date: 16 Dec 1994 10:00:09 GMT Organization: Computer Science Department, Stanford University. Lines: 34 Distribution: inet Message-ID: <3crof9$ior@Radon.Stanford.EDU> References: NNTP-Posting-Host: sunburn.stanford.edu Keywords: bug Xref: Radon.Stanford.EDU sci.math.num-analysis:10190 comp.arch.arithmetic:819 comp.sys.intel:24522 In article , Joe Keane wrote: >About the Pentium FDIV bug, there is of course a tradeoff between frequency >and magnitude of errors. There's a certain chance that any error occurs, and >a smaller chance that an error of some given magnitude occurs. We can't >immediately say which is worse, a bigger error or a more frequent one. > >I figure that a reasonable way to combine these characteristics is to compute >the variance of the error, and i was wondering if anyone has tried to do this. This was already answered implicitly on this list when I posted a within-7% estimate of the actual distribution on Tuesday night. Here is the exact number of errors of each magnitude (leading bit position of x-(x/y)*y) for the 2^46 pairs of single-precision mantissas. bits 14 to 23 (single precision) 67 108 170 150 276 138 259 135 283 152 bits 24 to 52 (double precision) 262 163 300 140 279 147 280 159 274 158 272 145 250 166 257 148 250 149 256 163 278 152 268 152 256 147 259 150 245 bits 53 to 62 (extended precision) 161 265 153 259 154 236 169 258 149 248 Except for a period-2 "buzz" which won't affect central moments below the 25th or so, this is effectively a rectangular distribution. The variance of this density as a function of ln e, the *log* of the error e is that of a rectangular distribution (so 1/12 the width? or is that the standard deviation?). The density as a function of the error e itself is a hyperbola 1/e, whose variance I suppose must be the integral of e^-2, namely 1/e (for a suitable value of 1). Someone might check this more carefully than I can this late at night. -- Vaughan Pratt http://boole.stanford.edu/boole.html My every word is copyright, especially those not in the dictionary. ================================================================= -------------------------PENTIUM REPORT # bug28---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15743 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch Subject: Re: TECHNICAL (Pentium): Exact number of FDIV error-pair FLOATs (was: Bad section of the noisy-integer table) Date: 16 Dec 1994 10:11:16 GMT Organization: Computer Science Department, Stanford University. Lines: 18 Message-ID: <3crp44$j1p@Radon.Stanford.EDU> References: <874@moene.indiv.nluug.nl> NNTP-Posting-Host: sunburn.stanford.edu In article <874@moene.indiv.nluug.nl>, Toon Moene wrote: >Besides, to be really able to assess the effects of this particular bug, I >have to have the 1281 triples of (dividend, divisor, Pentium-quotient) in >a look-up table in my code, to do the right :-) thing. I haven't seen this >list up till now. This is now available (expanded to catch the few additional pairs overlooked by the quick search, and extended down to all precisions of errors) in the file all1precerrs in /pub/FDIV on boole.stanford.edu. The errors are collected under headings, one for each magnitude (position of leading bit) of error. Only operands in [2^31,2^24) are listed, all other buggy pairs of single-precision operands are these with arbitrary exponents. -- Vaughan Pratt http://boole.stanford.edu/boole.html My every word is copyright, especially those not in the dictionary. ================================================================= -------------------------PENTIUM REPORT # bug29---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 24965 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: SAS Institu's Position Date: 17 Dec 1994 15:41:17 GMT Organization: Computer Science Department, Stanford University. Lines: 41 Message-ID: <3cv0qt$pca@Radon.Stanford.EDU> References: <3cs3ug$b8d@news.cloud9.net> NNTP-Posting-Host: sunburn.stanford.edu In article <3cs3ug$b8d@news.cloud9.net>, Andrew Beveridge wrote: >According to Pratt's table, if the amount of >bruising is less than 1E-12, the result is accurate to at least >10 significant digits. This is only true for integer operands less than 1000, the range covered by my table. Dividing the integer 4194687 by 3 bruised in the least significant bit, an amount of bruising less than 1E-15, yields an error of 1E-5. Ramesh Agarwal has given two simple ways that such a very slightly bruised 3 can arise: as .3/.1, and as 4.1-1.1. In light of item 2 below it should pointed out that these are not the only ways. >If you have a DATA step division >involving a value that you know should be a small integer in >exact arithmetic but that may have been bruised, you can use the >ROUND function to make the value an exact integer. 1. "Small" can be omitted from this rule, which is sound for all positive integers and for bruises < 0.5. 2. This would be just one rule of thousands for the production of "Pentium-safe" software. For example if you know the value should have been a multiple of 1/2 but may be slightly bruised, then you should round to the nearest 1/2; all the same reasoning applies. 3. This method only works if you know *for sure* that it is an integer; if it is merely an integer most of the time then the method is inapplicable, yet you still remain just as vulnerable to the bruised integer problem. The right thing to do for *any* FDIV, whether or not you know the divisor should be an integer, is to invoke the Mathworks Pentium-safe division routine. Don't waste huge amounts of time messing with huge amounts of code trying to catch huge numbers of obscure special cases like the above. -- Vaughan Pratt http://boole.stanford.edu/boole.html My every word is copyright, especially those not in the dictionary. ================================================================= -------------------------PENTIUM REPORT # bug30---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 25106 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch,sci.math.num-analysis Subject: News media deja vu: IBM misquoted again Date: 18 Dec 1994 05:57:49 GMT Organization: Computer Science Department, Stanford University. Lines: 172 Distribution: inet Message-ID: <3d0j0t$n7r@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:25106 comp.arch:15785 sci.math.num-analysis:10228 ==================================================================== Readers of comp.sys.intel might be so good as to forward this message to those who should see it but lack the time to cope with the extraordinary volume of correspondence. It corrects a seemingly slight yet significant misrepresentation of the IBM study in the San Jose Mercury News, which is read by many people in a position to understand the technicalities of the error but who are in no position to read comp.sys.intel. ==================================================================== In the business section of Wednesday's New York Times I was quoted as having found the frequency of errors in the Pentium chip *to be* significantly higher than what Intel has reported. I corrected this misattribution at the time in this forum, and that *could be* was the strongest form of this statement I would stand by. See bug25 on boole.stanford.edu:/pub/FDIV/individual.bugs for further details. Today's San Jose Mercury News (12/17, page 13C) repeats this error, this time for IBM, reporting that "IBM said that the typical spreadsheet users *would* see an error every 24 days" (my italics). Had IBM indeed said such a thing, one could understand the contrast drawn here with Prof. William Kahan's complaint in the same article, "I wish people would stop trying to estimate probabilities based on incomplete information about what is a typical user." I am in full agreement with the sentiment of this wish. As with the NYT article, I am quoted in the SJMN article as fully agreeing with IBM's conclusion, which I certainly am. I would have no problem with being linked to IBM in this way were it not in the context of these repeated misrepresentations of IBM's conclusion. By way of clarifying how it could be that I find myself in any agreement at all with both IBM and Prof. Kahan when the papers appear to be reporting them at odds with each other, I would like to draw attention once again to what it was that IBM actually said in their Pentium study, *especially* their conclusion, which clearly refutes the Mercury's interpretation. As I've said a number of times, both in this forum and to reporters, we do not know where the great bulk of applications lie on the rate spectrum. An excellent quote in this context is Hennessy and Patterson's "Fallacy: there is such a thing as a typical program" ("Computer Architecture" p.183). The low-rate end of the Pentium's error-rate spectrum has been estimated by Intel at one error in 9 billion divisions for random double-precision data, an estimate not in dispute. (The yet lower rate of no errors at all is an even more welcome possibility for applications doing divisions with suitably constrained data, e.g. small integers.) There is no controversy here; the controversy arises when Intel makes the considerably stronger statement that this is where one can expect to find the typical applications. IBM's study differs from Intel's in the following two key respects. First, IBM demonstrates that the error-rate spectrum is a lot wider than contemplated by Intel, by showing the existence of high-rate scenarios of a rather more plausible kind than repeatedly performing 4195835/3145727. Intel in contrast does not raise the question of whether anywhere near such a high rate might be achieved in any practical situation. Second, IBM makes no claim as to where along this spectrum any given application might happen to reside. Intel in contrast takes a definite stand on the low probability of the typical spreadsheet user encountering the bug once, not just in his own lifetime but in several hundred lifetimes. The closest statement IBM makes about actually experiencing any given rate is with their statement that a user *could* make a mistake every 24 days. Both this statement and its context make clear that it is a purely hypothetical statement based on clearly stated assumptions which, when met, fully and rigorously justify the claimed rate. No suggestion is made that these assumptions are actually satisfied in any existing situations. Readers must draw their own conclusions about actual likelihoods of encountering such scenarios. Evidently the Mercury article has drawn its own conclusion, whether or not warranted, and then drawn the additional unwarranted conclusion that IBM must have done the same. Even had IBM's reference to 24 days inadvertently led some readers astray, the conclusion of the study should dispel all doubt as to the basic message of the report. The final two sentences of the brief conclusion read as follows. "In reality, a user could face either significantly more errors or no errors at all. If an error occurs, it will first appear in the fifth or higher significant digit and it may have no effect or it may have catastrophic effects." Nowhere in its study does IBM make a claim substantially exceeding this clear summary. ========== A possible small objection to IBM's study is that it does not explore very far the range of dependencies on some of the assumptions made in their hypothetical scenarios, such as the assumption in some examples that operands have two decimal digits. While this is a reasonable assumption for monetary amounts represented in floating point, one would like some sense of whether this is an extreme case or whether other precisions are equally afflicted. A possible objection to my own studies of small bruised integers is that it is not clear to the typical spreadsheet user what inference to draw for typical calculations. When one knows the incidence of both integers and bruising in one's spreadsheet it is straightforward to take my raw error rates for a steady stream of small bruised integers and dilute them according to the proportion of small bruised integers encountered in one's own daily use. But thanks to roundoff when displaying data to only 6 or 8 decimal digits, bruising tends to disappear like Mr. Snuffleupagus or an electron when you try to look directly at it. Hence most users remain blissfully unaware of the extent to which their data is subject to bruising. The typical user's reaction to the observation that the difference between 4.1 and 1.1 is not exactly 3 but a slightly bruised 3 is one of incredulity; it took me half an hour to convince a Byte reporter of the reality of this phenomenon (we tracked every bit). This makes the *practical* significance of my tables quite unclear to most spreadsheet users. To address these objections I plan to write a short program based on relatively unobjectionable features of these studies to permit people to determine for themselves the rate, seriousness, and cumulative effect of errors for the evaluation of simple expressions involving division. Unlike my study, there is no subtraction of small quantities to force integers to be bruised, and unlike the IBM study the experimenter gets to choose for herself the distribution of operands by both size and precision. Bruising still happens, but now only to the same extent as in routine spreadsheet calculations, as a result of arithmetic done with operands prior to their arrival at the division operation. My program limits this arithmetic to a single addition of nonnegative reals, a particularly common operation performed throughout typical spreadsheets. The omission of both subtraction and negative operands removes any concern about the possibility of mistaking the implicit numerical instability of x/(ya-yb) for a large Pentium error (and besides, what substance is there in negative numbers?:). I will release the C program as source; it will be quite short and hopefully reasonably readable, which should leave no doubt as to what is going on behind the scenes when running experiments. The limitation to a single addition for each operand of division implies that the expression to be analyzed is of the form (xa+xb)/(ya+yb) with xb and yb optional. This severe restriction of the language greatly increases the chances of sampling a fair cross-section of the available expressions; when one complicates the language with more constructs, the resulting combinatorial explosion in variety of expressions makes it far harder to tell whether any given set of expressions is a fair cross-section, and the results of any feasible number of experiments then have a less clear significance. I will include with the program the results of a few experiments illustrating the full range of error rates, from no errors at all (an even lower rate than Intel's estimate) to rates higher than Intel's, all obtainable merely by changing sizes and precisions independently for the numerator operands and the denominator operands. The rate claimed by Intel for typical spreadsheet users is achieved for data with generous limits on both size and precision of input data. Besides the overall error rates and cumulative and average magnitudes, the program will also report, for the much smaller population consisting of those divisions having any nontrivial error (e.g. at least one cent in a financial transaction), the minimum, mean, and maximum error magnitudes, together with their standard deviation. This data tells the complaints office whether to expect a large number of trivial complaints or a much smaller number of serious injuries. In the former case a complaints office is neither a worthwhile return on the investment nor urgently needed; in the latter however it would be market suicide not to set up an efficient complaints office geared towards the most likely complaints. -- Vaughan Pratt http://boole.stanford.edu/boole.html My every word is copyright, especially those not in the dictionary. ================================================================= -------------------------PENTIUM REPORT # bug31---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15798 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic Subject: Re: seriousness of intel FDIV bug Date: 18 Dec 1994 18:52:39 GMT Organization: Computer Science Department, Stanford University. Lines: 69 Message-ID: <3d20dn$e80@Radon.Stanford.EDU> References: <3csbasINNahs@life.ai.mit.edu> <1994Dec17.222208.11858@mole-end.matawan.nj.us> <3d1cta$q33@b.stat.purdue.edu> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:25188 comp.arch:15798 comp.arch.arithmetic:835 In article <3d1cta$q33@b.stat.purdue.edu>, Herman Rubin wrote: >Before a bug is known, there should only be ordinary >liability. But when a bug becomes known to a manufacturer or vendor, >it becomes incumbent on the INDIVIDUALS responsible to take all >reasonable steps to see that the users are notified. This is not at all clear-cut. PAST. You have to understand the social hierarchy of bugs, there's a nontrivial entomological pecking order here. Showstoppers mean war: they are hunted down mercilessly and shot on sight, no questions asked. Felons putting users at serious risk generate an all-points bulletin. Pickpockets are everywhere, and when there's a rash of them people are simply warned to watch their wallet. Industry practice today is not to report every known pickpocket to every tourist entering town. Too expensive and no one cares ("You have pickpockets? Oh dear, now how are your restaurants?"). The really hard question in the case of the Pentium bug is, when did the seriousness of this bug for at least number crunching and high finance become clear to Intel? Up to a certain point, industry standards permit Intel to simply log the bug and deal with it in due course without notifying users. Just as it is impossible today to quantify how typical applications are distributed along the error-rate spectrum, so is it impossible today to fault Intel for tardiness in reporting the bug: Intel is innocent until proven guilty. Neither issue can be resolved without a lot more work. PRESENT. Why has Intel not modified its stand on the likelihood of errors? My guess: they placed their bet when their horse, Low Rate, looked to them like an odds-on favorite, back when Intel was the only one making book on the event, for the simple reason that Low Rate seemed then to be the only horse in the race. Then up rolls a slew of owners and bettors, whom the media can't tell apart, the owners put their horses in the race, the bettors plunge heavily on the newcomers, and pretty soon Low Rates' odds start stretching out. What do you do? Do you take back your bet? If you really believe in Low Rate yourself all along and she wins, you'd look a complete ninny if you had in the meantime taken your bet back merely under peer pressure. Intel is both owner of and heaviest bettor on Low Rate. IBM and I own Two Digits and Bruiser respectively, but have not joined the crowd at the bookie's. (IBM pokes two digits in the divisor's eye, I give it my regulation bruising.) I can't speak for IBM when I say that I would be very happy for all the bettors on Low Rate if she wins, and very sorry for Intel if not---in fact it would be a major tragedy. That it *would* be a tragedy however is no excuse for sticking your head in the sand during the race, like the Dataquest "fire drill" report which with Monty Pythonesque logic solemnly advises us all to bet on Low Rate or face industry havoc. FUTURE. Should Intel recall? With the numerical literacy test toned down or eliminated (how's that doing, btw?), Intel's present course seems fine: put users who want a replacement in the queue. Here Intel can bet that many users will assess the bug's overall impact on both their work and eventual resale value as down in the noise compared to the hassle and downtime of unplugging and shipping the chip and reinstalling the new one. For Intel not to place this bet confidently is to betray a lack of confidence in their bet on Low Rate. The numerical intricacy of the bug, the guaranteed transience of this phenomenon (a more satisfactory escape in this regard than inventing TeX), and the apparent discrepancy between these two bets, are combining to maintain my intense and hopefully brief interest in the bug. -- Vaughan Pratt http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug32---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 10234 of sci.math.num-analysis: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch,sci.math.num-analysis Subject: Re: Dataquest Report: The Great Pentium Fire Drill Date: 18 Dec 1994 22:11:57 GMT Organization: Computer Science Department, Stanford University. Lines: 61 Distribution: inet Message-ID: <3d2c3d$na8@Radon.Stanford.EDU> References: <3cnceb$fms@newsbf01.news.aol.com> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:25240 comp.arch:15805 sci.math.num-analysis:10234 In article , Sivasankar Chander wrote: > There 1738 single-precision pairs that generate errors, not necessarily >this big. And there are only 2^46 single precision mantissa pairs, which >is only ~ 7 * 10e13. *Big difference*. That's the number generating *single-precision* errors. Although these are the natural kind to associate with single-precision operands, dividing two floats yields a double which is only truncated to a float after being transferred from the FPU to a 32-bit general-purpose register. Thus 1/3 is correct to 52 bits internally in the FPU, and that's what you get if you unload it into 64 bits. Here's a table for all combinations of input and output precision. NUMBER OF PENTIUM ERRORS by precision of operands and quotient Quotient Precision -------------------------------- | Single | Double | |------------------------------| Single | 1738 | 9915 | Operand | | | Precision | | | Double | ~4.0*10^20 | ~23*10^20 | -------------------------------- The figures for single precision operands are exact to within choice of definition of error. The figures for double-in-double-out are obtained dividing the 2^104 mantissa pairs by 9 billion, Intel's figure for the number of divisions per error. I obtained double-in-single-out by scaling in proportion to the line above. A naive model of the passage from single to double precision operands is to simply assume that x/y is in error just when (float)x/(float)y is in error. This is not true, but the hope is that what you lose on the swings you win back on the merry-go-round. With that model the lower line becomes 5.0*10^20 and 28*10^20. I don't understand what diluting mechanism leads to a figure 20% lower than this simplistic estimate, but Intel arrived at its figure by two independent means whereas so far I've only used one. There may well be a good explanation for the Intel figure, so that's what I've gone with. There is no unique or even obviously best definition of single-precision errors. I take a single-precision error to be one for which x - (x/y)*y before normalization differs from x in the single-precision part of the mantissa (first 23 bits not counting the hidden bit). This differs very slightly from some other definitions, and by a potentially very large amount from the definition Intel uses, namely the first bit position where the quotient is in error---this is how Intel finds the largest possible FDIV error to be in the 12th bit when my definition locates it in the 14th. The error in 10000000 vs. 01111111 is in bit 1 or bit 8 according to whether you use Intel's definition or mine respectively. -- Vaughan Pratt http://boole.stanford.edu/boole.html 4.999999/14.999999 has 6 3's, not 4 ================================================================= -------------------------PENTIUM REPORT # bug33---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 25253 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Dataquest Report: The Great Pentium Fire Drill Date: 18 Dec 1994 23:05:31 GMT Organization: Computer Science Department, Stanford University. Lines: 41 Message-ID: <3d2f7r$p0n@Radon.Stanford.EDU> References: <1994Dec14.203004.655@news.media.mit.edu> <3d1np2$m4o@newsbf01.news.aol.com> <3d230j$fom@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu In article , Duncan Murdoch wrote: >In article <3d230j$fom@Radon.Stanford.EDU> pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes: >> >>If someone you've never heard of claims that 2+2=5, would you hold your >>tongue on the off-chance that they might just have the necessary >>authority to back it up? Your logic is no better than Dataquest's. > >Dataquest's article was consistent and reasonable; it had no logical ^^^^^^ (Linear logic's inventor Jean-Yves Girard would be very pleased with this identification of logic and arithmetic. A Good Thing too.) >inconsistency like 2+2=5. To the contrary, Dataquest endorsed much *larger* arithmetic errors than 2+2=5, and I quote, "Financial users could conceivably be affected by the bug. However, the size of the error is again likely to be immaterial, representing hundreds of dollars in a million-dollar transaction." (Granted that the *relative* error of hundreds out of a million is smaller than 1 out of 4, but I hope I will be permitted one brief use of absolute error to make my point...) Ross Perot is entitled to claim 2+2=5 if he's willing to make up the difference with his own money. Not that it would last very long. If the Pentiums in the US are collectively chugging along at 10^14 instructions per second, with as few as one FDIV per million instructions, that's still 9*10^9*10^6/10^14 = 90 seconds between errors even for random data. But that rate of divisions might be way off, and it's anyone's guess what proportion of the daily fluctuations in the deficit is attributable to undetected division errors. At what odds would you feel safe betting it's less than $1000? To Keith Sullivan ("Yes, that you're windy"): if I told him to go fly a kite he'd do it to make his point. I do worry about it though, just not enough to do any good. -- Vaughan Pratt http://boole.stanford.edu/boole.html 4.999999/14.999999 has 6 3's, not 4 ================================================================= -------------------------PENTIUM REPORT # bug34---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15810 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic Subject: Re: seriousness of intel FDIV bug Date: 19 Dec 1994 00:39:32 GMT Organization: Computer Science Department, Stanford University. Lines: 42 Message-ID: <3d2ko4$rkc@Radon.Stanford.EDU> References: <3d20dn$e80@Radon.Stanford.EDU> <3d2fv3$q64@Sequoia.picosof.com> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:25276 comp.arch:15810 comp.arch.arithmetic:838 In article <3d2fv3$q64@Sequoia.picosof.com>, Bob Richardson wrote: > His contention about industry standards is quite correct. I have Thanks, Bob. Quite apart from saving my bacon, this was a wonderfully informative post. I'll resist the strong temptation to reply to it all, and take up just one very interesting point that others have already addressed, but not nearly as often and vociferously as it deserves. > However, this brings me to an interesting point vis-a-vis Professor >Pratt's previous post regarding the error rate, and the difficulty >of determining a valid error rate since we can't define a "typical >user". Is the cat in the box dead or alive? Well, I'd say we can >infer a rather high error rate given the rather short time which >elapsed since the i486 price cuts and the conversion of the marketing >channel to Pentium based machine and the discovery of the error(s). Like any crime, the pecking order for numerical errors is, undetected, detected, corrected. Your estimate of the *detected* error rate (if I've interpreted you correctly) is probably not too different from any else's, and Mathworks is addressing the corrections issue. Trouble is, no one has so far convincingly subtantiated their estimate of the *undetected* rate. If the Pentium's current contribution to the deficit measured in dollars per second can be shown to be either more or less than the cost of aerosol-induced global warming I would love to see the calculations. These errors are insidiously small---just small enough to get in under the radar, just large enough that in the enormous quantities implied by millions of Pentiums, Andy's meteors have the potential to bomb the place to hell in a handbasket in a matter of hours. I cannot imagine a more effective way of screwing up the numerical computing world than omitting a few little entries from the partial divisor table. Andy Grove nothing, this is interplanetary sabotage. Aliens stole Intel's brain, and it's anyone's guess what they are now planning to do with it. -- Vaughan Pratt http://boole.stanford.edu/boole.html 4.999999/14.999999 has 6 3's, not 4 ================================================================= -------------------------PENTIUM REPORT # bug35---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 25326 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Pentium 90 for a Pentium 100? Date: 19 Dec 1994 04:18:20 GMT Organization: Computer Science Department, Stanford University. Lines: 37 Message-ID: <3d31ic$4j4@Radon.Stanford.EDU> References: <1994Dec18.182746.80926@kuhub.cc.ukans.edu> NNTP-Posting-Host: sunburn.stanford.edu In article <1994Dec18.182746.80926@kuhub.cc.ukans.edu>, Jim Wilson wrote: >Has anyone tried when they got there replacement chip asked if >you can give them some extra dough and just get a P100? Just curious >cause that is what I would like to do if at all possible, since I have >to get a new cpu anyway might as well get the best they got :) Disclaimer: the following is intended purely as a facetious recommendation; you follow it at your own risk. I estimate the probability of encountering an error running a randomly selected P90 at 100 MHz to be < one millionth of the chance of encountering the Pentium bug. This on the basis of having seen the exact same million bugs on two P90's running at respectively 90 and 100 MHz, and never having had any other problem. You could be unlucky: due to manufacturing variations my P90 might be be especially good at adapting to 100 MHz (making all readers of this message unlucky), or your P90 might be especially bad at it (just you are then unlucky). Unlike the division error, this is a crap shoot. The division error guarantees you the same bugs on every processor. So put your faith in those of us who understand statistics better than you, and your sped-up P90 will be good for the next 27 billion years. Leaves Japanese quality in the dust, not to mention you. But you will have been able to leave that extra dough to your kids. These remarks are to be read in the context of table 5-1 on page 13 of the Intel White Paper, which gives mean times between failure for 8Mb of DRAM without ECC, particle defects in the Pentium chip, 8Mb of DRAM with ECC, and Pentium FDIV on a typical spreadsheet, as respectively 7, 200-250, 700, and 27,000. Units are years. Reread the disclaimer at the top. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug36---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 25328 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: alt.sys.pc-clone.gateway2000,comp.sys.intel Subject: Re: Why does Intel want the bad ones back? Date: 19 Dec 1994 04:30:15 GMT Organization: Computer Science Department, Stanford University. Lines: 34 Message-ID: <3d328n$4sc@Radon.Stanford.EDU> References: <3d2vk9$kv4@sundog.tiac.net> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU alt.sys.pc-clone.gateway2000:25531 comp.sys.intel:25328 In article <3d2vk9$kv4@sundog.tiac.net>, wrote: >In , jwalper@bbs.sd68.nanaimo.bc.ca (John Walper) writes: >|In article <3csq40$f4p@lynx.dac.neu.edu> csolomon@lynx.dac.neu.edu (Chad Solomon) writes: >|>Subject: Re: Why does Intel want the bad ones back? >|>From: csolomon@lynx.dac.neu.edu (Chad Solomon) >|>Date: 16 Dec 1994 14:34:24 -0500 >| >|>> Does anyone know why on earth Intel would want to pay Fed Ex to return >|>> the bad chips? Since when can they be repaired? Just curious ... >| >|> Isn't it obvious, they are probably going to turn around and reuse >|>them. ;-) >|... > >It's the only way Intel has of confirming they are replacing a >chip rather than just sending them to whoever claims to have one. >I don't think this is an issue. If the system is still under warranty the same procedure should apply as for any system under warranty: you get a replacement part and keep the old part. This point was made in a letter in today's SJ Mercury News. This particular fault in the warranted merchandise is a little out of the ordinary however. Usually something either works or it doesn't, more or less. Here it works for everyone, except for very occasional and barely noticeable changes to one's data. It seems to me that the enormous temptation to keep replaced units in service or to sell them as barely flawed merchandise makes a whole new ballgame out of the old warranty rules. High school debating teams would have a field day with this one. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug37---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 25703 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: alt.sys.pc-clone.gateway2000,comp.sys.intel Subject: Re: Why does Intel want the bad ones back? Date: 20 Dec 1994 05:26:26 GMT Organization: Computer Science Department, Stanford University. Lines: 46 Message-ID: <3d5pu2$ove@Radon.Stanford.EDU> References: <3d2vk9$kv4@sundog.tiac.net> <3d328n$4sc@radon.stanford.edu> <1994Dec20.013555.14346@mercury.ncat.edu> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU alt.sys.pc-clone.gateway2000:25652 comp.sys.intel:25703 In article <1994Dec20.013555.14346@mercury.ncat.edu>, wrote: >In article <3d328n$4sc@radon.stanford.edu> pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes: >>If the system is still under warranty the same procedure should apply >>as for any system under warranty: you get a replacement part and keep >>the old part. This point was made in a letter in today's SJ Mercury >>News. > >I just got a note from intel saying the reason that the chips were being >collected and destroyed was to avoid the possiblity that someone would buy >a buggy chip and ask for a new chip again. :-) Sounds like to me that would >happen. I could see it know! The following are only my impressions, I'll gladly defer to the facts. Apples and oranges. Intel is not replacing chips under warranty. If you tell Intel you have a chip they will mail you one. Anyone in the US can do this. You must pay for what they send you within 30 days, either on your credit card or by sending them a chip. A warranty replacement of the chip would follow a procedure that included an audit trail of where and when you acquired the chip, including showing receipts when requested. Unlike Intel's procedure, it should not be possible to get two warranty replacements on the one receipt. There is nothing wrong with Intel's procedure, it simply is a different animal from a replacement under warranty. My take on the mini-exam is that Intel was trying to sort these phone orders for good Pentiums by urgency of need. Given the limited supply, the presumed goal of giving existing customers priority, and the ease with which anyone can call up to order a good Pentium, one can understand Intel trying to come up with *some* workable solution. Too bad they came up with the exam solution; with 20/20 hindsight it's hard to imagine a worse one PR-wise, just like it's hard to imagine a worse way to mess up the FPU than with radar-invisible mosquitoes delivering malaria most people won't catch, and which took doctors more than a year to notice. It is conceivable that this incident will spawn individual component warranties for expensive components in the future. Warranties give manufacturers a great competitive edge provided they can afford them. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug38---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 25776 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: REPOST: The Hard-Wired Example (confirming IBM's findings) Date: 20 Dec 1994 12:19:01 GMT Organization: Computer Science Department, Stanford University. Lines: 172 Message-ID: <3d6i3l$6j0@Radon.Stanford.EDU> References: <3d4i5t$t6@tuba.cit.cornell.edu> NNTP-Posting-Host: sunburn.stanford.edu In article <3d4i5t$t6@tuba.cit.cornell.edu>, Allen Kim wrote: > > A B >1 ("alpha") (Number of months) > = 1/12 = 0.083333333 22.5 > >A2, B1, and B2 are hard-wired, and the rest are calculated values. > >Notice that in this implementation, I got no error!...Wondering if IBM lied, I >decided to change A1 from a calculation into a hard-wired number, so >I typed 0.083333333 exactly into the cell, and then here are the >results: .... >This case gives the error that IBM publishes. The only difference was >that in the first example, "alpha" was calculated from 1/12, and the >second example, "alpha" was typed straight in. >... >The key words in IBM's example are "hard wired." In other words, you >must do IBM's example *exactly* as explained in order to hit the >Pentium bug. > >Now what does this say about IBM and its text on the Pentium bug? >After all, IBM did say that their example was a case where >"hard-wired" constants can be put into a program or spreadsheet, and >that in this case, the hard-wired constant can lead to an error. > >But this example seems too artificial for me. The same thing bothered me too, and I asked IBM's Fred Gustavson about this. He said that many organizations mandate, as specific truncated decimals, constants that you and I would find more natural to write as rationals. I challenged him to produce examples. He complained he was too busy with management stuff, but a day later he sent me the IBM federal credit union's quoted daily rates for secured, realty, signature (?) and auto loans. Expressed as yearly rates these are respectively 9%, 9.5%, 13.5% and 10%. The credit union however gives you actual daily rates that *they* calculate, not you (who knows, maybe IBM has a system where they call up Fred to ask him what number of decimal digits will put the Pentium in the worst light :). The current daily rates are specified in the right hand column as 9/365 ~ .024658 9.5/365 ~ .026027 13.5/365 ~ .036986 10.0/365 ~ .027397 Ok, so these are real examples culled off the wall (or wherever they're posted) of a working office. Let's see whether any of them them lead to an error. Fred's suggestion is to convert them back into yearly rates (which will be more official than the original annual rates they were derived from if the daily rates happen to be the official ones, I don't know in this case but say they were). So we multiply them all by 365. Now we want to know if any of these four numbers are bad divisors. (Why are we dividing by them? Who knows, maybe to get a ratio with some competing rate, my accountant says such comparisons are done this way a lot.) To do this it helps to understand the five *Coe intervals* as I'll call them found in every octave [x,2x). Within octave [16,32) they consist of all half-open intervals of width 1/256 ending at an integer multiple of three, of which there are five, namely [18-1/256,18), [21-1/256,21), [24-1/256,24), [27-1/256,27), and [30-1/256,30). To within a power of two, all *known* bad divisors (this is not yet a theorem!) lie in one of these intervals, with the riskiest divisors concentrated close to the top of each interval. It is easier to work with unit intervals, so we scale everything by 256 to the octave [4096,8192), where these intervals become [4607,4608), [5375,5376), [6143,6144), [6911,6912), and [7679,7680). Doing all this for our four rates, we get: Nearest pentad .024658*365*512 = 9.000170*512 = 4608.087040 4608 .026027*365*512 = 9.499855*512 = 4863.925760 4608 .036986*365*512 = 13.49989*512 = 6911.943680 6912 .027397*365*512 = 9.999905*512 = 5119.951360 5376 The second and fourth rates are very safe, nowhere near any pentad. The first rate just *barely* misses the first Coe interval (whew); if the credit union had just rounded it down, giving 4607.900160, they'd have landed in the top of that interval. The third rate however lands right in the fourth Coe interval, very close to the top. Given that these intervals constitute only 5/4096 of this octave, it is scary that half of these "bruised annual rates" land in or almost in one, and at the high risk end at that. Ok, so 13.49989 is risky as a divisor, but how risky? To test this I divided it into random reals in [0,1], generated by drand48(). The error rate was one in every four million divisions. The errors were uniformly distributed (by exponent) starting from 8*10^-6 and going down all the way to 10^-18 from there. A mildly bad divisor. An example bad dividend for this divisor is 124.31147, which gives a relative error of 6*10^-6. Here are some more high-error dividends, with their corresponding relative errors. 32.189679 2.4*10^-07 45.689569 1.7*10^-07 62.155735 7.9*10^-06 this is one is off by 1 in 130,000 64.379358 2.4*10^-07 91.379138 1.7*10^-07 You might say that this one in 4 million rate was so much faster than one in 9 billion because we stood under the meteor 13.49989. But this meteor came from one of four mandated loan rates Fred got from a credit union and a perfectly natural multiplication. Not that I know what a signature loan is. ========== If the interval-checking arithmetic seems like too much work, you can take the easy way out and do what I really did (I worked out the above table after the fact so people could see behind the scenes). I just applied the following bit-hacking algorithm that checks whether its argument lies in a Coe interval. I use this procedure to find the 5/4096 bad divisors cases when collecting statistics by a search involving many divisors. risky_divisor(double x) { return ((char*)(&x))[5]==-1 && ((0x2492>>(((short*)(&x))[3]&0xf))&1); } The first half of the test checks that bits 5-12 are all 1's (for safety this might better be 5-10). ========== >First of all, people >using spreadsheets would not convert from months to years by >hand-typing a decimal factor that goes to the ninth digit. After all, >it's much more convenient to type in X*(1/12) than X*0.083333333, plus >it's more readable, too. People looking at the spreadsheet can easily >understand what's going on. Indeed. When will the people who dictate these constants discover rationals? >Andy Grove stated that IBM is "finding exactly when and where a meteor >will strike and going there themselves." Somehow, Mr. Grove's >statement doesn't sound so ridiculous anymore. Your arguments point up the extreme reasonableness of Andy Grove's position. Gustavson's arguments point up the ease with which probabilities can be grossly misestimated using such perfectly reasonable reasoning. (Marilyn vos Savant (Ask Marilyn) gets a lot of mileage out of this sort of thing; she's not so hot on physics but she's got good statistics intuition.) >Then there are also people (like me) >who think the issue is exaggerated and will blow over in about two >months, depending on how fast Intel can produce fixed chips. I very much hope it does. It might blow over (or at least defuse itself a little) if only Intel would enter into more open dialogue about all these probabilities and possibilities everyone is bandying about. Unfortunately the litigation aspects may be putting a big crimp in this. Earlier I suggested aliens stole Intel's brain, but the truth may well be that it was lawyers. What *they're* going to do with it is plain as day: make scads of money off it. >The point is that IBM has succeeded in creating more fear over the >Pentium. Years of practice, and don't forget the uncertainty and doubt. With the astronomical range of error rates and the general lack of information about most applications' profiles, it's not half surprising that there's so much uncertainty and doubt about this issue, with or without IBM. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug39---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 10271 of sci.math.num-analysis: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch,sci.math.num-analysis Subject: Three precisions if you count extended (was Fire Drill) Date: 20 Dec 1994 13:09:46 GMT Organization: Computer Science Department, Stanford University. Lines: 75 Distribution: inet Message-ID: <3d6l2q$7v1@Radon.Stanford.EDU> References: <3cnceb$fms@newsbf01.news.aol.com> <3d2c3d$na8@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:25783 comp.arch:15861 sci.math.num-analysis:10271 In article <3d2c3d$na8@Radon.Stanford.EDU>, Vaughan R. Pratt wrote: >Here's a table for all combinations of input and output precision. > > > NUMBER OF PENTIUM ERRORS > by precision of operands and quotient > > Quotient > Precision > > -------------------------------- > | Single | Double | > |------------------------------| > Single | 1738 | 9915 | > Operand | | | >Precision | | | > Double | ~4.0*10^20 | ~23*10^20 | > -------------------------------- Well, I blew this big time. I just realized that the 9915 figure I measured counts *all* Pentium bug errors, *including* those that happen in the 12-bit extension extending the 52-bit mantissa to 64 bits. I'd checked the number itself several times and couldn't find anything wrong with any of the arithmetic, so I finally went with it. This led to a 20% discrepancy between Intel's estimate of the error rate and the naive 2^58 scaling of 9915, which I remarked on as unexplained in my post. The explanation turned out to be my stupidity. Here's a corrected table, which moves the 4 entries in the above table apart to make room for 5 more entries in a 3x3 table, which makes extended precision a possible precision for both operands and result independently. I obtained line 1 by an exhaustive search, see /pub/FDIV/individual.bugs/bug26 on boole.stanford.edu. NUMBER OF PENTIUM ERRORS by precision of operands and quotient Quotient Precision ____________________/\__________________ / \ --------------------------------------------------------- | Single | Double | Extended | Total | |---------------------------------------------------------| | | | | 13 | Single| 1738 | 7863 | 9915 | 7.036*10 | Operand | | | | | Precision | 20 | 21 | 21 | 31 | Double| 5.009*10 | 2.266*10 | 2.858*10 | 2.028*10 | | | | | | | 27 | 28 | 28 | 38 | Extended| 8.404*10 | 3.802*10 | 4.795*10 | 3.403*10 | | | | | | --------------------------------------------------------- The heuristic from getting lines 2 and 3 from line 1 is simply to multiply by 2^58 and 2^82 respectively, on the win-some lose-some theory that makes "buggy(x/y) iff buggy((float)x/(float)y)" a good indicator despite not being a fact. For any row, dividing the Total column by the Double column yields the infamous 9 billion number, 8.95 to be precise. This is exact for single precision operands with double precision results, but scaling by 2^58 is merely a heuristic for extending this to the next row, and as such should be taken with a (hopefully small, possibly very small) grain of salt. Modulo any fine tuning of lines 2 and 3 as more data comes to hand, and any picky complaints about my definition of quotient precision (other definitions should have only a tiny effect), I am optimistic that this table has converged at last. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug40---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 25797 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: A simple example for Pentium FDIV flaw Date: 20 Dec 1994 14:08:21 GMT Organization: Computer Science Department, Stanford University. Lines: 94 Message-ID: <3d6ogl$a2t@Radon.Stanford.EDU> References: <19941219.151430.414@almaden.ibm.com> NNTP-Posting-Host: sunburn.stanford.edu Keywords: Pentium FDIV flaw simple example In article <19941219.151430.414@almaden.ibm.com>, wrote: > >Here is another simple example which demonstrates the FDIV bug. > > Q = (2148959.8 - 1100000.8) / (2.3 - 0.8) > = 699263.33 - Pentium Answer > 699306 - correct answer is an exact integer When I saw Agarwal's first example of this kind, 4.1-1.1 as bruised 3, I was both startled and impressed. I have since been looking at this general class of examples to get a sense of how serious they really are. I have measured the following densities of "bad divisors," defined as reals lying in one of the Coe intervals. The first line of the following table gives the probability that a randomly chosen pair of the form d.d-d.d > 0 (e.g. 2.3-0.8 as above) will be a bad divisor. There are roughly 5000 candidates, and 31 are in the region of risk. (The problem of counting these turned out to be riddled with traps for young players, it is easy to generate other kinds of noise besides the one you're studying, as a side of effect of the search process itself.) d.d - d.d 1 in 160 (31/5000) dd.d - dd.d 1 in 360 (1392/500000) d.dd - d.dd 1 in 750 (666/500000) dd.dd - dd.dd 1 in 2300 (21936/5*10^7) Random 1 in 800 (5/4096) When a digit is added to the left (numbers up to 99.9) the probability drops by more than a half. When added to the right however, corresponding to dollar amounts up to $9.99, it drops more than fourfold. Dollar amounts up to $100 are at such low risk that they are lower than random, *assuming a uniform risk in each risky interval*. This is a bad assumption. The reason is that each risky region concentrates most of the risk right at its high end. Thus a random number in the risky region is not automatically put at *high* risk. In contrast, when d.d-d.d enters a risky region, it will be because it is bruised in the least significant bit. I'd like to say this puts it at significantly higher risk, but anything *that* close to a pentad (the right end of the risky region) is I suspect somewhat or even a lot lower risk than just a tiny amount lower down. This needs to be measured more carefully. My conclusion pending such measurements is that d.d-d.d may well be reasonably serious, because of its relatively high probability and because this size of data may arise sufficiently often as to present a problem. The three-digit operands to subtraction are on somewhat weaker ground, and dollars up to $99.99 would need to be investigated more closely before I would accept any claim about their high probability. I would like to contrast isolated examples such as 4.1-1.1 and 2.3-0.8 with my various tables, starting with those in bug1 and bug2 (on boole.stanford.edu in /pub/FDIV/individual.bugs). There I simply *defined* three parameters without setting them to specific values, number of digits to the left of the decimal point (<= 3 in bug1, <= 2 and <=1 in bug2, >3 still needs to be explored but I am sure the probabilities drop off very fast), amount of bruising (ranging from 10^-3 to 10^-15), and tolerance for resulting error (similar very wide range), and asked what was the *exact* probability of an error as a function of those parameters. Since the sample space is feasibly small the probability can be measured directly by exhaustive search instead of via Monte Carlo methods. One then obtains an interesting shape from which it becomes clear where the highest risks are. This shape can be put to various uses; for example one can integrate over regions of this shape when one knows that this region is where one's program remains. In the light of Toon Moene's just-posted observations about his numerical weather prediction program and its low rate of buggy divides, which he attributes to his data being sufficiently random (exactly as Dik Winter predicted for Toon a few days ago), it would seem that operands with enough decimal digits, perhaps four, perhaps six, with these being reasonably uniformly distributed, should be subject to the basic 1 in 9 billion rate (see my recently posted 3x3 table for how this depends on input and output precision). For *clean* (i.e. integer or unbruised) data with only three decimal digits, there is no risk at all, but with any bruising the probabilities go up, and with bruising at around 10^-7 they go up very sharply. Driving the number of digits down to two digits increases it further yet. It seems highly plausible to a number of people that these small integers should arise often, and in a context where they may be subject to bruising. However there have not to date been any sightings of ominous clouds of such mosquitoes gathering for a strike. If it turns out that these are rarely encountered by most users if at all, then this potentially major source of concern may in practice not be so big a deal. This was a distinct possibility that I discussed carefully at the end of bug1. Ruling this sort of possibility out to any given user's satisfaction however may be very hard. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug41---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.sys.intel,comp.arch.arithmetic Subject: Re: Pentium FDIV bug details and detection with OBDDs Summary: Expires: References: <3d6pjo$3i@cantaloupe.srv.cs.cmu.edu> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article <3d6pjo$3i@cantaloupe.srv.cs.cmu.edu>, Randy Bryant wrote: >In article <3d6lk2$87d@Radon.Stanford.EDU> >>Vaughan Pratt responded to my earlier post: > >Like any bug, once you've identified it, there's no need for a fancy >tool to detect its presence. What I'm trying to establish is whether >the Intel engineers could have used OBDDs to detect a design flaw >prior to fabrication. The advantage of OBDDs, vs. other tools such as >theorem provers, is that we can generate an OBDD representation of the >circuit functions directly from a gate-level representation of the >circuit. Oh, *a* design flaw, not this one. OBDDs should be very good for finding design flaws. I thought you were trying to get some mileage out of The Bug. Btw, BDD's may be state of the art for CAV, but I predict a backlash within a year or so from the deduction community, for whom BDD's are a nice heuristic but nevertheless a simplistic bandaid. BDD's need logic as an equal partner, not as a discarded aging spouse. (Not only am I not acting in the service of Intel or IBM, I don't do logic and I don't do category theory, I do set theory and menu (antiset) theory, a potent pair. See my url/ftp below. >As has been stated by others: this bug is maximally >insidious---it doesn't show up on enough cases and doesn't create >enough error to be caught by even extensive simulation, but it's bad >enough to be unacceptable for at a large class of computations. > >I'd like to demonstrate the feasibility of my approach by generating >a model for one iteration of the division algorithm. That's why I'm >trying to learn more details. Your last paragraph has no logical connection to your second last. To make a connection you need a bug that resulted from missing a case in designing the table. BDD's are overkill for verifying that P = T (copying the table) was executed successfully, you can do that with verification techniques known to Alan Turing half a century ago. Unless, that is, you were proposing say to develop a version of the division algorithm that could survive any 5 random hits on its table. Now *that* would be *really* something that BDD's could go to town on! ================================================================= -------------------------PENTIUM REPORT # bug42---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.sys.intel Subject: Re: New documents available on http://www.intel.com/ Summary: Expires: References: <1994Dec20.153214.3161@news.media.mit.edu> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article <1994Dec20.153214.3161@news.media.mit.edu>, Ron Newman wrote: > > ANALYSIS OF FLOATING POINT DIVIDE CALCULATIONS IN SPREADSHEET > APPLICATIONS IN THE COMMERCIAL PC MARKETPLACE > > by M. L. Barton, Ph.D., Staff Computational Scientist, and > R. A. Passov, Senior Treasury Manager, Intel Corporation > >This paper explicitly disputes both IBM's and Vaughan Pratt's analyses >of the "bruised integer" divisor problem. Not true in my case, they only explicitly disputed IBM's analysis. Had they disputed mine I would be interested to know the particular sentence I wrote that they found objectionable, and would be happy to focus on it. As it stands however this argument is strictly between Intel and IBM, my analysis simply does not enter into their paper. Except possibly at one point, where they be inferring from my bug1 paper that because no small bruised integer generates errors more than 100x times the bruise, therefore *no* bruised integers do so. This is a serious fallacy: the 4.1-1.1 example can generate bits in the 14th bit, as Intel could and should have easily verified for themselves, and would have were there as much incentive for them to consider this particular case carefully as there evidently is for IBM. ================================================================= -------------------------PENTIUM REPORT # bug43---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15884 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: Re: THE Pentium WAR IS WON! (Intel now replaces Pentiums on demand) Date: 21 Dec 1994 00:32:14 GMT Organization: Computer Science Department, Stanford University. Lines: 57 Message-ID: <3d7t2e$2p8@Radon.Stanford.EDU> References: NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15884 comp.sys.intel:26083 In article , Bo Nygaard Bai wrote: > >BTW., it's customary to leave the battlefield after a battle :-) This was just a small ~zero-sum skirmish, and is what my PAST-PRESENT-FUTURE message called the future: eventually Intel would do this. It wasn't the battle some of us were fighting (they promised me my replacement two weeks ago), and it seemed obvious from the beginning that as supplies improved Intel would soften. The outcry simply softened Intel faster than it had planned. The hardline-in-the-beginning strategy backfired horribly and in retrospect was a PR blunder of the first magnitude. (You may see, but I can't, a reason to presume that they ever seriously intended to hold their hard line forever.) PAST. Some feel that Intel should have notified users as soon as it found out about this bug. This one's probably going to find its way into court and is way outside my ability to judge. As I've said before, Intel's initial perception of the bug *might* have been that it was just another of a hundred annoying little bugs a month; proving the opposite may turn out harder than some people are estimating. PRESENT. Intel wants its customers to believe that a buggy Pentium is safe. Motivation: minimize the number of people going to the bother of actually replacing their chip. I think this is an excellent strategy for Intel, with one caveat: no fair wilfully misleading the customer in order to get this effect. The two facts, that the temptation to do so is enormous, and that Intel has not issued what most of us would regard as a reasonable statement discussing the risks, put Intel's atypical customers, whatever the notion of atypical eventually turns out to be, at unreasonable risk, and at the same time makes it much harder to be sure that one really is a typical Intel customer. How do you know Intel is being forthright with you when it will not enter into any two-way public dialogue on the risks. The best-case scenario all round would be if everyone turned out to be a typical Intel customer. However we have heard quite enough by now from Intel about what a low risk many or most customers are at, and Intel is starting to sound like a cracked CD on this aspect of the situation, as are the rest of us (did-didn't-did-didn't). When are we going to get some unprejudiced discussion from Intel of the danger regions, who might be at risk from them, and when people *should* worry? They have to be the authorities on what their chip might or might not do to people and their bank balances and their planes. They can't leave it to IBM and random windy netters like myself churning out messages that seem in direct contradiction to Intel. To do so is to put at risk all those customers not covered by the loudly implied "Intel warranty," that if you are typical you are safe. On this present point the battle has only just begun. It will be interesting to see if it turns out in a way in which there are any winners at all. Like all the important battles of this world, this battle is not a zero-sum game. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug44---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15896 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel,rec.crafts.jewelry Subject: Re: THE Pentium WAR IS WON! (Intel now replaces Pentiums on demand) Date: 21 Dec 1994 04:56:41 GMT Organization: Computer Science Department, Stanford University. Lines: 41 Message-ID: <3d8ci9$ab1@Radon.Stanford.EDU> References: <3d7t2e$2p8@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15896 comp.sys.intel:26202 rec.crafts.jewelry:2013 In article <3d7t2e$2p8@Radon.Stanford.EDU>, Vaughan R. Pratt wrote: >This was just a small ~zero-sum skirmish It occurred to me after thinking about this a bit more that if everyone had stopped screaming bloody murder about nonreplacement during the past seven days, Intel might have taken heart and not repented. If indeed the case (how will we ever find out?) then this *was* a real battle, and well fought, and I have no business downplaying it as a "skirmish" merely because I had different fish to fry. My expectation for the past week or more had been that the battle was already over and that Intel would be meeting all requests as soon as supplies loosened up. What gave me pause was hearing Intel say they would take a Q4 hit for repenting. My first reaction was, so what else is new? I'd figured that any Intel shareholder who hadn't discounted this a week ago should not be in the stockmarket. But if Intel honestly expected that this announcement opened flood gates that would otherwise have been secure, then the "skirmish" really was a battle. This could only be the case if Intel were more out of touch with reality the whole past week than a smaller company could afford to be for more than ten minutes at a time. The jewelry is incredibly beautiful, btw. I spent several shares of Sun stock on a bunch of it for Chrissy presents. These are the chips themselves, no carrier but the chip is mounted like a triplex opal, i.e. a layer of quartz or more likely acrylic or something protecting the chip. It has more flash than an opal (just trying to get a rise out of rec.crafts.jewelry :), with the submicron lines forming a diffraction grating reflecting rainbows in each of a dozen on-chip units (would someone please tell me where the FPU is, especially the P-D table, so I can point at people's ears and say that's my research area). I got some earrings, bracelets, and cufflinks (cufflinks, who wears cufflinks? well, they were too pretty to resist). No commissions yet but I can always hope. :) -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug45---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 10312 of sci.math.num-analysis: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch.arithmetic,sci.math.num-analysis,comp.sys.intel Subject: Re: Why didn't Intel test the Pentium FDIV *tables*? Date: 21 Dec 1994 16:08:45 GMT Organization: Computer Science Department, Stanford University. Lines: 29 Distribution: inet Message-ID: <3d9jud$sm5@Radon.Stanford.EDU> References: <1994Dec14.200603.28449@wdl.loral.com> <3d84mj$44v@news1.deltanet.com> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch.arithmetic:881 sci.math.num-analysis:10312 comp.sys.intel:26367 In article <3d84mj$44v@news1.deltanet.com>, David K. Every wrote: >As I understand it... it is not testing the entries that causes >problems... but you have to test various SEQUENCES through the tables. >Which gets you back to the first problem... lots of potential >sequences. A simple but effective check on one's program is a test suite that exercises every branch of the program. This does not constitute complete verification by any means, yet it catches a remarkable number of bugs, and is among the most cost effective software tests. For hardware the simplest failure mode is for the power cord to be left unplugged. The second simplest is for wires to be stuck at 0 or 1. Any manufacturer planning to sell more than a billion dollars worth of a chip, that does not include among its test suite for that chip a test that exercises every wire to verify that it is stuck at neither 0 nor 1, does a gross disservice to its stockholders. In the case of the Pentium division flaw, such a test would have immediately found all five missing table entries. This is in hindsight. I am amazed that it takes a problem of the magnitude of the Pentium division error to wake us all up to such simple realities. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug46---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15914 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic Subject: Re: seriousness of intel FDIV bug Date: 21 Dec 1994 16:49:05 GMT Organization: Computer Science Department, Stanford University. Lines: 26 Message-ID: <3d9ma1$n7@Radon.Stanford.EDU> References: <3d2fv3$q64@Sequoia.picosof.com> <3d4h0o$17qv@locutus.rchland.ibm.com> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:26381 comp.arch:15914 comp.arch.arithmetic:882 In article <3d4h0o$17qv@locutus.rchland.ibm.com>, Del Cecchi wrote: > Problems which can cause data integrity or undetected errors are taken very >seriously by some companies. Much more seriously than bugs which could cause a >program to crash once in a rare while. Time is money, and a crash is downtime and lost money. This is liveness. Information is also money, and damaged data is therefore also lost money. This is safety. In comparing safety failures to *rare* liveness failures you are comparing apples and oranges. What if the data damage or undetected error happens only exceedingly rarely, and moreover the undetected error is so tiny that it is equivalent to the error from using single precision? Intel continues to insist that both of these are the case, and may well be able to persuade a jury that it was reasonable to believe these before December, if not after. Competitive companies take both safety and liveness very seriously. I conjecture that the most competitive companies are those that balance these fundamental concerns the best. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html In public, for the sake of decency I clothe my speculation in facts. Privately I speculate occasionally, but you can go blind doing it too often. ================================================================= -------------------------PENTIUM REPORT # bug47---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 26421 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Intel shouldn't give in! Date: 21 Dec 1994 18:03:07 GMT Organization: Computer Science Department, Stanford University. Lines: 24 Message-ID: <3d9qkr$2t1@Radon.Stanford.EDU> References: <3d9dk1$5hs@newsbf02.news.aol.com> NNTP-Posting-Host: sunburn.stanford.edu In article <3d9dk1$5hs@newsbf02.news.aol.com>, JohnL25490 wrote: > All this is gonna do is to force the manufacturers and software >companies to put extra effort into longer, redundant testings and thus >drives up the cost and delays the product release. What it might do is encourage them to test every product with a suite that exercises every component at least once. This is a relatively simple and short test for whether the component is present and whether it is doing the right thing on at least one case. Such a simple test, which has been strongly recommended for software for years, but evidently not for chips, would have caught the missing entries in the Pentium FPU's partial divisor table. This only shows the continuing enormous gap between the worlds of hardware and software. Closing this gap will be hard, but any companies that do not make a strenuous effort to do so from now on put themselves at great risk of being judged negligent by juries investigating snafus like Intel's. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html In public, for the sake of decency I clothe my speculation in facts. Privately I speculate occasionally, but you can go blind doing it too often. ================================================================= -------------------------PENTIUM REPORT # bug48---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 26409 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: Re: THE Pentium WAR IS WON! (Intel now replaces Pentiums on demand) Date: 21 Dec 1994 17:36:35 GMT Organization: Computer Science Department, Stanford University. Lines: 29 Message-ID: <3d9p33$276@Radon.Stanford.EDU> References: <3d7t2e$2p8@Radon.Stanford.EDU> <3d8ci9$ab1@radon.stanford.edu> <3d9c0v$gep@news.iastate.edu> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15916 comp.sys.intel:26409 In article <3d9c0v$gep@news.iastate.edu>, John Hascall wrote: >Vaughan R. Pratt wrote: >}What gave me pause was hearing Intel say they would take a Q4 hit for >}repenting. My first reaction was, so what else is new? I'd figured >}that any Intel shareholder who hadn't discounted this a week ago should >}not be in the stockmarket. > > Intel stock actually gained back 3pts after announcing > that they would `do the right thing'. The stock market was invented in response to the requests of people who have no business being in it but just can't resist a flutter. "More fools they" was the response, and it was born. Some people have the good sense to use good money managers. The three dollars are easily accounted for by myopic money managers and knee-jerk speculators. Someone asked me a while back if I was going to sell Intel short. I said that would be insane. Selling a company short is tying one leg to the ground and the other leg to the company to win a bet that the company will back up before it pulls you apart. Would you do that with a juggernaut like Intel? -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html In public, for the sake of decency I clothe my speculation in facts. Privately I speculate occasionally, but you can go blind doing it too often. ================================================================= -------------------------PENTIUM REPORT # bug49---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.sys.intel Subject: Re: Intel shouldn't give in! Summary: Expires: References: <3d9dk1$5hs@newsbf02.news.aol.com> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article <3d9dk1$5hs@newsbf02.news.aol.com>, JohnL25490 wrote: > All this is gonna do is to force the manufacturers and software >companies to put extra effort into longer, redundant testings and thus >drives up the cost and delays the product release. What it might do is encourage them to test every product with a suite that exercises every component at least once. This is a relatively simple and short test for whether the component is present and whether it is doing the right thing on at least one case. Such a simple test, which has been strongly recommended for software for years, but evidently not for chips, would have caught the missing entries in the Pentium FPU's partial divisor table. This only shows the continuing enormous gap between the worlds of hardware and software. Closing this gap will be hard, but any companies that do not make a strenuous effort to do so from now on put themselves at great risk of being judged negligent by juries investigating snafus like Intel's. ================================================================= -------------------------PENTIUM REPORT # bug50---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 26963 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: Re: THE Pentium WAR IS WON! (Intel now replaces Pentiums on demand) Date: 23 Dec 1994 04:58:02 GMT Organization: Computer Science Department, Stanford University. Lines: 23 Message-ID: <3ddlcq$hmi@Radon.Stanford.EDU> References: <3d7t2e$2p8@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15963 comp.sys.intel:26963 (Hi, Mandy.) In article , Joe Ragosta wrote: > >In case you missed it, Intel is still holding their hard line. If you >re-read their announcement, they offer to replace chips for the lifetime >of the computer only for users WHO CAN SHOW THAT THEY HAVE BEEN AFFECTED. > >Clever wording, and it apparently got past a lot of people. Sheesh, the noise level on this newsgroup. How about saying where we can find this announcement? The word "affected" does not appear in Intel's Dec. 20 press release on http://web.jf.intel.com/about-intel/press/replcrel.html, which is very unambiguous: Intel offers to exchange a processor for a concerned user. This is a fair exchange, we urgently need the processors and presumably Intel can make a small profit turning the concerned users into jewelry. Volunteers please take one step forwards. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html In public, for the sake of decency I clothe my speculation in facts. Privately I speculate occasionally, but you can go blind doing it too often. ================================================================= -------------------------PENTIUM REPORT # bug51---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15974 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.arch.arithmetic,comp.sys.intel Subject: Re: Pentium Errors and SPEC Date: 23 Dec 1994 21:35:57 GMT Organization: Computer Science Department, Stanford University. Lines: 40 Message-ID: <3dffrt$mq8@Radon.Stanford.EDU> References: <3dab5t$gr9@nntp.Stanford.EDU> <879@moene.indiv.nluug.nl> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15974 comp.arch.arithmetic:920 comp.sys.intel:27183 In article <879@moene.indiv.nluug.nl>, Toon Moene wrote: >In article <3dab5t$gr9@nntp.Stanford.EDU> oberman@misd.stanford.edu >(Stuart Oberman) writes: >> The following data resulted from executing 10 of the SPECfp92 benchmarks >> with their reference input data sets: >> >> Errors Total Divides Percent SPECmark >> ------ ------------- ------- -------- >> 34260 21187229 0.162 % 015.doduc >> > >This is veeeerrrrry interesting. I have been running our numerical weather >forecasting model* (instrumented to analyse the most frequently occurring >division operations) for a test case based on input data for a 36-hour >forecast starting on December 9'th, 0 GMT. Stuart's more recent post on comp.arch.arithmetic and comp.sys.intel observes that his simulator is counting some false positives. He was kind enough to let me compare his simulator to the real thing, making it possible to quantify the false positives. With the divisor fixed at the Coe divisor 3145727, I compared the simulator's error rate to that of the Pentium, obtaining a result which when applied to the above rate for 015.doduc reduces the 0.162% figure to .0000255%. Whether this extrapolation changes for other divisors would take a rather longer run to investigate and it seemed more productive to let Stuart work on it further before doing any more testing. As a couple of posters have pointed out, only counting hits on missing table entries may overestimate the error rate. The self-correcting nature of the algorithm makes this particularly true, in that it tends to neutralize some of the errors resulting from hits on the missing entries. If this turns out to be main reason for this overreporting of errors, then this tendency must be remarkably strong. Properly interpreted, Stuart's measurements have the potential to yield very useful data about the relationship between errors in the P-D plot and errors in the quotient. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug52---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 27202 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.powerpc,comp.sys.intel,comp.os.misc,comp.unix.bsd,comp.unix.pc-clone.32bit,comp.unix.sys5.r4,comp.unix.misc,comp.os.linux.development,comp.os.linux.misc,comp.os.linux.misc,comp.os.386bsd.development,comp.os.386bsd.misc Subject: Re: Interested in PowerPC for Linux / FreeBSD / NetBSD? Date: 23 Dec 1994 23:31:38 GMT Organization: Computer Science Department, Stanford University. Lines: 164 Message-ID: <3dfmkq$39@Radon.Stanford.EDU> References: <3cphs0$l6e@ddi2.digital.net> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.powerpc:35706 comp.sys.intel:27202 comp.os.misc:1718 comp.unix.bsd:4594 comp.unix.pc-clone.32bit:6697 comp.unix.sys5.r4:7989 comp.unix.misc:9703 comp.os.linux.development:24526 comp.os.linux.misc:36614 comp.os.386bsd.development:2815 comp.os.386bsd.misc:4859 In article , Richard Tobin wrote: >In article john@tyrell.net (John A. Matzen) writes: >>If a bank is using the FPU to do numerical calculations, they could be loosing >>hundreds of pennies every day due to rounding errors. It doesn't matter >>how many digits of precision the FPU is using. > >Sorry to be rude, but this is just drivel. IEEE floating point is >guaranteed (when correctly implemented!) to be exact for operations >such as addition on integers up to 53 (?) bits. Floating point isn't >just any old approximation; it's a well-defined approximation that >can be safely used by people who understand it (except on Pentiums). The IBM examples show you to be guilty of a non sequitur at the top of your voice. For example if you have $4.10 and you take away $1.10 then you have strictly less than $3. Your observation *is* relevant in the case of programs that take 1.0 to denote a penny, and that restrict their divisors to powers of 2. However my accountant tells me that most financial spreadsheet software takes 1.0 to be a dollar, the case the IBM examples are for. Normally this is no great loss; on other than a Pentium it would take years of double precision rounding errors before you ever lost a penny. This is forestalled in practice by the device of keeping relatively short those calculations that do not round their results to the nearest penny; I don't know how good spreadsheets are at observing this commonsense rule. Binary arithmetic inevitably does injustice to decimals, in fact poetic injustice: What you see when you add Isn't what you've got. Four quarters make a dollar, Ten dimes do not. Whether you add .25+.25+.25+.25 or .10+.10+.10+.10+.10+.10+.10+.10+.10+.10, the amount *displayed* is 1.00. Internally however things aren't what they appear. The former sum is, in binary, 1.000...000, no matter how you compute it, because 0.25 has an exact finite binary representation, namely .01. A dime however is .0001100110011..., and when ten of them are added up at double precision, the result is not 1.000...000 as one might expect but instead 0.111...111. At both single precision and extended precision (if I've interpreted things correctly) ten dimes come to 1.000...001. Paradoxically, five dimes *do* make half a dollar. This provides a graphic illustration of the nonassociativity of IEEE addition: adding ten dimes one at a time is not exact while adding together two sums of five dimes each is exact. The difference between adding a dime to nothing and adding it to an exact half dollar is in the alignment: dime 6 has to be shifted three places further to the right than dime 1. Dimes 7-10 are shifted 2 or 1 places further right than dimes 2-5 respectively, bruising them also. The dollar emerges from this ordeal thoroughly cowed by its beating; it's feeble protest to the FPU is greeted by a stern "We don't do things by halves in decimal arithmetic." Here in hex are the low order 16 bits of the 10 single precision sums of one to ten dimes. It helps to understand the full details of alignment, normalization, and rounding to see why each bit is just so, see David Goldberg's beautifully done Appendix A (A.4 treats addition) in Hennessy and Patterson's "Computer Architecture." cccd cccd 999a cccd 0000 999a 3334 ccce 6668 0001 Here are the corresponding trailing bits for double precision: 999a 999a 3334 999a 0000 3333 6666 9999 cccc ffff I tried doing this for extended precision. Unfortunately I don't have a Pentium manual and my 486 manual isn't clear about all the details of 80-bit extended. The main problem is to suck out the 12 hidden extension bits without falling foul of the rounding rules. The trick I've used here is to subtract 819/8192. (or 1638/16384., or 3276/32768., or 6553/65536., or 13107/131072., get the pattern?) from .1, which turns out to create a number that normalizes by shifting up by 12, 12, 12, 15, or 16 bits respectively (this is not an arithmetic progression because bits 14 and 15 of unnormalized .1 are 0). The following illustrates this for the last of these, for a 16-bit shift. The data shown is hex for the trailing end of .1, K, .1-K, etc. The same 16-bit shift is realized for .2-2*K, .3-3*K, etc., which is very convenient. If you read C, see the code below for a more detailed explanation; normalization is faked in the printout as a corollary of printf printing numbers left justified. Extended precision: 0.1 = 1999999 K = 1999800 (16 bits of 0.1) 0.1-K = 199 (normalize = shift left 16) 0.5 = 8000000 0.5-5*K = 800 0.7 = b333333 0.7-7*K = b33 0.9 = e666666 0.9-9*K = e66 1.0 = 10000000 1.-10*K = 1000 This 16-bit normalization pulls the hidden 12 bits out into the open, where they can then be stored, printed, etc. Here is what the 12 extension bits look like for dimes 1 to 10; they are the top 12 bits of the following ten 16-bit hex quantities. 99a0 99a0 3340 99a0 0000 3340 6680 99c0 cd00 0020 Assuming the extension really is 12 bits, I calculated the third entry to be 3330 instead of 3340; why the discrepancy? Are there really only 11 bits? This would make sense if the purported 64-bit internal mantissa includes the hidden leading bit, which in turn would make sense because this would then make it just a normal 64-bit integer register. Here's the program that produced this. No comments, I don't distribute my jigsaw puzzles assembled. #include #define K (((1<<17)/10)/((1<<17)+0.0)) /* first 16 bits of 0.1 */ main() { int i; float x; double t, y, z[11]; t = 10; printf("Single precision:\n"); for (x = 1./10; x < 1.01; x += 1./10) printf("%.4x ", ((unsigned*)&x)[0]&0xffff); printf("\nDouble precision:\n"); for (y = 1./10; y <= 1; y += 1./10) printf("%.4x ", ((unsigned*)&y)[0]&0xffff); printf("\nExtended precision:\n"); printf(" 0.1 = %x\n", (int)(.1*(1<<4)*(1<<16)*(1<<8))); printf(" K = %x (16 bits of 0.1)\n", (int)(K*(1<<4)*(1<<16)*(1<<8))); printf(" 0.1-K = %x (normalize = shift left 16)\n", (int)((.1-K)*(1<<4)*(1<<16)*(1<<8))); printf(" 0.5 = %x\n", (int)(.5*(1<<4)*(1<<16)*(1<<8))); printf("0.5-5*K = %x\n", (int)((.5-5*K)*(1<<4)*(1<<16)*(1<<8))); printf(" 0.7 = %x\n", (int)(.7*(1<<4)*(1<<16)*(1<<8))); printf("0.7-7*K = %x\n", (int)((.7-7*K)*(1<<4)*(1<<16)*(1<<8))); printf(" 0.9 = %x\n", (int)(.9*(1<<4)*(1<<16)*(1<<8))); printf("0.9-9*K = %x\n", (int)((.9-9*K)*(1<<4)*(1<<16)*(1<<8))); printf(" 1.0 = %x\n", (int)(1.0*(1<<4)*(1<<16)*(1<<8))); printf("1.-10*K = %x\n", (int)((1.-10*K)*(1<<4)*(1<<16)*(1<<8))); z[1] = 1./t-1*K; z[2] = 1./t+1./t-2*K; z[3] = 1./t+1./t+1./t-3*K; z[4] = 1./t+1./t+1./t+1./t-4*K; z[5] = 1./t+1./t+1./t+1./t+1./t-5*K; z[6] = 1./t+1./t+1./t+1./t+1./t+1./t-6*K; z[7] = 1./t+1./t+1./t+1./t+1./t+1./t+1./t-7*K; z[8] = 1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t-8*K; z[9] = 1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t-9*K; z[10] = 1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t+1./t-10*K; for (i = 1; i <= 10; i++) printf("%.4x ", ((unsigned*)&z)[2*i]&0xffff); printf("\n"); } -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug53---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15978 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.arch,comp.sys.intel Subject: Re: What is 'Virtually full double precision accuracy'? Date: 24 Dec 1994 00:49:01 GMT Organization: Computer Science Department, Stanford University. Lines: 44 Message-ID: <3dfr5t$291@Radon.Stanford.EDU> References: NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.arch:15978 comp.sys.intel:27225 In article , M. C. Ertem wrote: >I was just going through intel's www site and I ran across the >following quote [1]: > > 'But if 3.0 is approximated to 16 digits as 2.999999999999999, then > the accuracy expected in the quotient is 13 to 14 significant digits, > virtually full double precision accuracy even in the presence of a > flawed result.' ><...> >So, what in the world is 'virtually full double precision accuracy'? This is a meaningful notion in context, where Mike Barton was addressing the question of whether very slight bruising can yield much larger errors. Mike claimed a negative answer as a corollary of my bug1 results (boole.stanford.edu:/pub/FDIV/README). By "virtually full double precision" Mike meant that superlight bruising as with IBM's 4.1-1.1 divisor can yield only superlight errors. However the domain of bug1 is restricted to small bruised integer *operands* (both dividend and divisor). This domain is of interest because there the error rate approaches one 10^-7 or larger error in a 2,500 divisions at 10^-6 bruising, along with one 10^-5 or larger error in 70,000 divisions. Unfortunately this is not the only domain that users have to worry about when 4.1-1.1 shows up as a bad divisor. When the dividend is purely random (equivalently, a very large integer), the error rate is only one in several hundred thousand, but offsetting this lower rate is a much larger error for superlight bruising than possible for small bruised integer dividends where any errors that do occur do so at higher rates. With superlightly bruised divisor 4.1-1.1, errors on the order of 10^-5 can occur at a rate of one every three or four million divisions. This is the situation that the IBM observations are most relevant to. Heisenberg's principle is at work in heavy disguise here (the duality of safety/information and liveness/time, see the papers on Boole on this topic, particularly those on Chu spaces). With very light bruising one can get either relatively large errors or relatively high error rates; to get both together one must increase the bruising. The table in bug1 shows that bruising at 10^-6 is about the level of irritation that drives the Pentium bug into a frenzy. Much more and you knock it unconscious. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug54---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 15987 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch Subject: Whose Pentium bug is this? Intel's, gnu's, or the programmer's? Date: 24 Dec 1994 09:07:19 GMT Organization: Computer Science Department, Stanford University. Lines: 65 Message-ID: <3dgoc7$i3a@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:27283 comp.arch:15987 I'm running Linux with gcc 2.5.8 on my Pentium. The following program came up while I was investigating the Pentium FDIV bug. When I compile this program with -O (optimize) and then run it, it outputs "here today, gone tomorrow." If you follow the logic of the main loop you will see that it can only execute once, when a==6, since adding one to 6 yields exactly 7. Hence K-(K/b)*b should not change from one invocation to the next. Hence depending on whether it is zero or not, the program should print either "here today, here today" or "gone tomorrow.\ngone tomorrow." It certainly should not print something different when called with the same value each time. #define K (10-1e-13) f(double x) { printf("%s", x? "here today, ": "gone tomorrow.\n"); } main() { double a,b; for (a = 6; a < 7; a++) { b = 8.1 - a/10.; f(K-(K/b)*b); } f(K-(K/b)*b); } I presume what is happening is that the first (K-(K/b)*b) is calculated on the stack using the value of b left on the stack by the previous subtraction, which will differ from 7.5 in the hidden extension bits, and which is a buggy division. However the second (K-(K/b)*b) presumably no longer has b on the stack and so has to get it from variable b, where it has been truncated to double precision, becoming exactly 7.5, where the division is not buggy. When this program is compiled without -O it prints "gone tomorrow.\ngone tomorrow." as one would expect. The question for people who keep their buggy Pentiums then becomes, how do you reason about your program given that the optimizer can change the program's behavior in this way? What I'd like to know is, whose fault is this strange dependence on the -O flag? Intel's, Gnu's, or mine for writing such a silly program? I am rather annoyed at this quirk, since I wasted several hours trying to figure out whether I was going crazy or the Pentium was more broken than I had realized. The above is the distillation of the bug down to a small program starting with a much larger one, whose seemingly almost nondeterministic behavior had me totally mystified. The large one had no constants in it except 0, 1, and 10, and was far from silly when I wrote it. The constants in this distillation are for the sake of having a simple concrete example. This incident points up a hazard of having an FPU that runs at a different precision (80 bits) from the precision of the operands being loaded and unloaded (64 bits). RISC machines that have 32-bit, 64-bit, and 128-bit floating point should not encounter the above problem when arithmetic is done at the same precision as the operands. Anyone care to comment? -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug55---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 27286 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: TECHNICAL: What is the simplest example ? Date: 24 Dec 1994 10:00:38 GMT Organization: Computer Science Department, Stanford University. Lines: 29 Message-ID: <3dgrg6$jjn@Radon.Stanford.EDU> References: <3dasr6$7ad@news.primenet.com> <3dd9sc$dn@nntp.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu In article <3dd9sc$dn@nntp.Stanford.EDU>, wrote: >Lawson English writes >> What we need is the simplest example of the FDIV bug. >> >> The example posted earlier had was a bit to large to make the point. > >I don't know if it fits your constraints, but this is the minimal FDIV >test I've whittled my way down to based on postings to this group: > >void main() { > float x = 3.0 - (4.1 - 1.1); > puts(((((5 - x) / (15 - x)) * (3 + x)) >= 1) ? "good" : "bad"); > } Try 1/(8.3-8.2)/(8.2-.7). Theoretically this equals 1/.1/7.5 = 4/3 = 1.3333333333333. The Pentium gets 1.3333333330850. Not a spectacular relative error at only 1.8*10^-9, but clearly wrong. When testing this in C, don't write it as a constant expression since C will try to compute the whole constant very accurately at compile time. Prevent this by assigning a subexpression to a variable, as follows. double x; printf("%.12f\n", (x = 1/(8.3 - 8.2))/(8.2 - 0.7)); -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html Followup: Incidentally, of the 25 million or so expressions of the form 1/(d.d-d.d)/(d.d-d.d) where both differences are positive, 814 are in error. This constitutes an error rate of one in thirty thousand, or thirty a second if all you are doing is evaluating random expressions of this form. ================================================================= -------------------------PENTIUM REPORT # bug56---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 27346 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch Subject: Re: Whose Pentium bug is this? Intel's, gnu's, or the programmer's? Date: 24 Dec 1994 21:52:51 GMT Organization: Computer Science Department, Stanford University. Lines: 96 Message-ID: <3di57j$9mb@Radon.Stanford.EDU> References: <3dgoc7$i3a@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:27346 comp.arch:15993 In article , Anil Das wrote: >In article <3dgoc7$i3a@Radon.Stanford.EDU>, pratt@Sunburn.Stanford.EDU (Vaughan >R. Pratt) writes: >> [gcc gives too much precision with -O switch]. > >> What I'd like to know is, whose fault is this strange dependence on the >> -O flag? Intel's, Gnu's, or mine for writing such a silly program? > > I think the fault is yours. The C standard doesn't say >there shouldn't be too much precision. (Maybe the Fortran standard >does?) This is naive. If people are going to take a complex-integral viewpoint of the C standard as the sum of its loopholes, then the C standard can go jump in a lake. It is C's place to prescribe some things and preserve others. Precision falls into the latter category; C should preserve the precision of the platform at hand. If the C standard does not say this then the C standard is at fault, not me. So are you saying that it wouldn't bother you if a program was able to find the inverse of some matrix when not optimized but after optimization, still on the same platform, it changed its mind and decided the matrix was singular? I've never before encountered a situation where this sort of thing actually happened (as opposed to being a contrived hypothetical case) and the blame was not placed on the optimizer. IEEE arithmetic obeys certain laws, including commutativity of addition and multiplication, but not associativity. I would argue that an optimizer should not assume *as a default* any falsifiable law. I have no objection to the programmer lying about such laws to the compiler, e.g. as a compiler option, in those rare instances when there is a detectable performance gain. It should however be possible for the user of the binary to interrogate it at any time about all liberties taken with the truth in its compilation. Incidentally I should clarify the nature of the dependency. As I said, I was able to change the behavior of the program just by turning the -O flag on and off. But I was also able to change the behavior of the program without touching the -O flag (leaving it on), simply by adding a certain statement that came *several instructions after* the point where the behavior changed! It looked for all the world as though the computer could sense whether this new statement was going to happen. Hopefully this makes it clearer why I was tearing my hair. Now tell me *that's* my fault. So why didn't I just stop typing -O? Well, I hadn't even noticed it. I was compiling the program, hack.c, with "make hack", but without a Makefile (the "make" command defaults to an ordinary compilation when there is no Makefile). Unfortunately there *was* a makefile in the current directory, and although it did not contain any reference to hack.c, it did set CFLAGS to -O, which I had not thought to pay any attention to. Needless to say I removed it immediately once I found that it was the cause of the problem. > You could have an initialization routine called to >set the control word of the FPU to `double precision'. That should >solve the problem, since it will hold the precision down to 64 >bits (shouldn't we say 54 bits, seeing as how the mantissa is >actually 54 bits including the hidden bit?) even when the operands >are on the FPU stack. (53 bits as Ron Newman points out.) So when do I apply this solution? Five hours after tearing my hair out? Five minutes after hearing from my client's lawyer that my program encountered an unexpected singular matrix and shot down a helicopter? Or permanently? None of these are appealing. With regard to the last, why should I have to give up the additional precision possible within an arithmetic expression, a benefit derived at every instruction, just because of a million to one chance of running into an extremely freaky effect? Having slept on this question, I now firmly believe the optimizer is at fault here. The optimizer's crime is *failure to preserve precision casts.* For the case at hand, b = 8.1 - a/10.; f(K-(K/b)*b);, the unoptimized code casts b to double precision. The optimized code however is schizophrenic about b's precision. After the assignment 8.1-a/10 is stored in two locations, the FPU stack and variable b, where it has respective precisions extended and double. The optimizer pays no attention to this difference and simply takes b from wherever works best. This is insidious. That the choice can be affected by downstream instructions makes it insidiously insidious. I could go bald. In order to preserve the semantics implied by the unoptimized code, the optimizer should either take b from the variable or round the FPU entry to double precision. The half microsecond of computer time saved by capitalizing on schizophrenia is nothing compared to hours of lost programmer time tracking down the mind-bogglingly inscrutable behavior possible when this is not done. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug57---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== From pratt@Sunburn.Stanford.EDU Sat Dec 24 14:51:41 PST 1994 Article: 15994 of comp.arch Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch Subject: Re: Whose Pentium bug is this? Intel's, gnu's, or the programmer's? Date: 24 Dec 1994 22:42:38 GMT Organization: Computer Science Department, Stanford University. Lines: 22 Message-ID: <3di84u$b4e@Radon.Stanford.EDU> References: <3dgoc7$i3a@Radon.Stanford.EDU> <3di57j$9mb@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:27358 comp.arch:15994 In article <3di57j$9mb@Radon.Stanford.EDU>, Vaughan R. Pratt wrote: > >Having slept on this question, I now firmly believe the optimizer is at >fault here. The optimizer's crime is *failure to preserve precision >casts.* For the case at hand, b = 8.1 - a/10.; f(K-(K/b)*b);, the >unoptimized code casts b to double precision. The optimized code >however is schizophrenic about b's precision. At Ron Newman's suggestion I tried declaring b as long double (should have tried this earlier). This cleared up the apparent randomness by removing the precision casts and associated schizophrenia. Now with -O either on or off it prints "here today, here today, " as one would predict. When b was double, the unoptimized code printed "gone tomorrow\ngone tomorrow\n" as expected while the optimized code printed either that or "here today, gone tomorrow\n" depending on the phase of the moon. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html What you see when you add isn't what you've got Four quarters make a dollar, ten dimes do not ================================================================= -------------------------PENTIUM REPORT # bug58---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 16000 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic Subject: Re: TECHNICAL: FDIV bug according to Intel, IBM and PC Week... Date: 25 Dec 1994 06:47:56 GMT Organization: Computer Science Department, Stanford University. Lines: 77 Message-ID: <3dj4is$nhh@Radon.Stanford.EDU> References: <3diutq$ing@news.primenet.com> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:27404 comp.arch:16000 comp.arch.arithmetic:927 In article <3diutq$ing@news.primenet.com>, Lawson English wrote: > >A comparison of the analyses of the FDIV bug by Intel, IBM and PC Week >leads to one conclusion: > >None of these addresses the most likely problem scenarios: > >"bruised-integer divisions" whose numerator and denominator both lie >between 10 and 100. It might be appropriate here to repeat the concluding paragraph of my bug1 post (from boole.stanford.edu:/pub/FDIV). +While this scenario puts the FDIV bug in what looks at least to me like +a pretty bad light, I do not claim it is anything like the worst case +scenario. The two essential factors for any scenario are the rate at +which that scenario triggers the FDIV bug, and how often the essence of +that scenario arises in the real world. The "damage index" of a +scenario is the product of those two factors. A scenario that accounts +for thousands of hours of actual Pentium time but that only triggers +the bug say every ten minutes may well have a higher damage index than +my scenario above, whose strength is that it triggers the bug every few +milliseconds but whose weakness is the uncertainty as to how likely it +really is in practice. You must admit however that it is not a +completely implausible scenario. The point is that *neither* of the extremes represented by random data and SBI's (small bruised integers) need be the worst case. You have to weigh (at least) three things: the error rate, the error size, and the likelihood of encountering that scenario. What Lawson English calls "the most likely problem scenarios" are what I would prefer to call "very bad case scenarios should they ever arise." By the same token pure random data is "a very good case scenario should it ever arise." Neither the Intel scenario nor my SBI scenario is *the* worst or best case. To get the worst case, repeatedly calculate 4195835/3145727; for the best case, simply change this to 4195836/3145727, whose error rate is zero (and back up to one again at 4195836/3145727.25). I would prefer to rephrase "most likely problem scenario" to "scenario with the highest damage index." It is easy to classify scenarios by how bad they would be *if* encountered. The difficulty in assessing the damage index of a scenario is in determining how commonly that scenario is encountered (one hundredth vs. one millionth of all computer usage), and how serious are the consequences of errors (a little bank or a big one, a four-seater plane or a jumbo jet, etc.). >the above >suggests that we have yet to get even a rough estimate for the potential >"real world" problems that can arise from the FDIV bug. This is exactly right. >My impression is that the potential problem for financial spreadsheet >users is probably at least an order of magnitude worse than what PC Week has >suggested. I have *no* idea. I neither wish to play Ralph Nader nor encourage an era of casual disrespect for the consequences of small arithmetic errors either by virtue of their accumulation or their unfortunate location. The potential magnitude of the problem combined with the enormous uncertainty makes the question of its significance a most urgent one. Thus far Intel's position that it leaves most if not all users none the worse for wear has been neither refuted nor proved in a convincing way. Whether these small division errors present a serious risk to some or are no worse than a piddling nuisance to everyone still remains to be determined. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html What you see when you add isn't what you've got Four quarters make a dollar, ten dimes do not ================================================================= -------------------------PENTIUM REPORT # bug59---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 27472 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Intel - The coverup continues Date: 26 Dec 1994 01:31:00 GMT Organization: Computer Science Department, Stanford University. Lines: 20 Message-ID: <3dl6ck$md4@Radon.Stanford.EDU> References: <3dg3ne$h52@also.hooked.net> NNTP-Posting-Host: sunburn.stanford.edu In article <3dg3ne$h52@also.hooked.net>, Keith McDonald wrote: >What independent observers? I have yet to see >one person not connected with IBM, AMD, Motorolla, >or Apple (like yourself) back their "study." Let me encourage you to read comp.sys.intel more closely, to which I have posted some 50 100-line messages (on average) reporting on my independent study of the problem. These are also available by anonymous FTP from Boole.stanford.edu collected in one file as /pub/FDIV/bugs, also available from mathworks.com's pentium directory as Pratt.txt. (i) You should find my research backs IBM's study quite well. (ii) I have no connection with IBM. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html What you see when you add isn't what you've got Four quarters make a dollar, ten dimes do not ================================================================= -------------------------PENTIUM REPORT # bug60---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 27504 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,comp.arch,comp.arch.arithmetic Subject: Re: TECHNICAL: FDIV bug according to Intel, IBM and PC Week... Date: 26 Dec 1994 07:17:06 GMT Organization: Computer Science Department, Stanford University. Lines: 117 Message-ID: <3dlqli$1ur@Radon.Stanford.EDU> References: <3diutq$ing@news.primenet.com> <3dj4is$nhh@Radon.Stanford.EDU> <3dje1n$7om@news.primenet.com> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:27504 comp.arch:16007 comp.arch.arithmetic:933 Summary: PC Week may well not have focused on the most serious manifestations of the Pentium bug. However the goal is not so much to come up with a single number characterizing how serious the bug is, but rather to give users enough insight into its nature to permit them to assess its seriousness for their own applications. It is possible to alert people to the need to pay attention to the problem, a service that PC Week seems to have rendered its readers, without depending on examples that "push the badness envelope." In article <3dje1n$7om@news.primenet.com>, Lawson English wrote: > >I didn't say what I meant to say, or at least not terribly well. > >My point was that PC Week took you at your word with the phrase "small >bruised integers," and evaluated the risk for the smallest integers >possible. (single digit). I should double-check here: you mean the PC Week white paper, http://www.ziff.com/~pcweek/reviews/dec_1994/wh_paper.html, the one posted here by Allen Kim on Dec. 20? That paper talks about calculations involving numbers of the form a.b for single digits a and b. Is that what you mean by "single digit integers"? Numbers of the form a.b are a "natural" source of superlight bruising by virtue of (i) being decimal and hence not all exactly representable in binary, and (ii) having a better than one in ten chance of producing an integer or a "small" binary rational, meaning m/2^n where m and n are small integers, when added or subtracted. When three or more are added, or two or more subtracted, any resulting integer or binary rational has a chance of about one in two of not being exact, in which case it will be superlightly bruised. Now the populations I talked about on December 3 in bug1 and bug2 are a bit different. Instead of starting with unbruised decimals a.b and letting bruising arise naturally by addition and subtraction, I started with already bruised small integers, with no particular model of what caused the bruising. Bruising can happen in various ways and at various rates and magnitudes. To understand this aspect of bruising one would need to explore a lot of scenarios that might lead to bruising, not just the one that PC Week focused on in their December 16 article. It seemed to me that the most useful information would be obtained by directly studying the impact of bruising, *independently of* its cause, on error rates and magnitudes, as a function of the magnitude of the bruising. One could then extrapolate from this raw data to any particular scenario involve bruising at a given rate and level. My experiments yielded the following basic properties of bruised integers. 1. The level of bruising that irritated the bug the most is around 10^-6. Higher than 10^-4 and the bug no longer recognized it as bruising since it was interfering with the requisite pattern of 4 special bits followed by 8 1's. 2. For integers in the range 1 to 1000, with 10^-6 bruising, an error greater than 10^-5 happens every 70,000 divisions, while one greater than 10^-7 happens every 2,500 divisions. 3. Reducing the limit from 1000 to 100 reduces the above 1/70,000 rate to 1/2,000 and the above 1/2,500 rate to 1/200. I agree this is a lot higher than anything PC Week found. I also agree with those posters who have questioned PC Week's interpretation of their own results as placing them *between* IBM and Intel. If anything it makes PC Week more extreme than IBM (but not as extreme as the above data on pure sources of small bruised integers). It is natural to ask whether small bruised integers can arise in practice and in what quantities and at what levels of brusing. One way that they can arise naturally is from any data source that does its own decimal truncation. Such a source might well produce 10^-6 bruising at a very high rate. And one can come up with other scenarios for bruising at these levels. One such is the Pentium itself, whose errors are at just the right level to really arouse the bug: the Pentium will get much more angry with its own 10^-5 to 10^-7 errors than those arising via normal truncation on any machine (superlight bruising) at around 10^-18. But in view of the enormous variability of both applications and data encountered by those applications, I don't see that a terribly useful purpose can be served by speculating on what *might* happen. Far better for each user to examine their own application to determine whether bruising is likely to constitute a serious problem for them. Until a user has determined such statistics for the typical ranges of data encountered in one's own application (as opposed to what Intel says you encounter), that user cannot rule out with certainty error rates and magnitudes as high as those indicated in item 3 above. Another point to bear in mind is that there may lurk yet higher rates via some other mechanism than small bruised integers. Users need to be on the lookout for such situations. So yes, PC Week may have understated the seriousness of the problem. But I think even the picture they have painted is serious enough that users should be very concerned as a result of the PC Week article. PC Week's main purpose seems to be the same as mine: to send a warning signal alerting unsuspecting users to pay close attention to their applications. This should include learning more about the bug than PC Week has to offer, and hopefully people will install signposts to indicate where to find such additional information. The IBM report is one such place, mathworks.com's ftp/www site is another, my /pub/FDIV directory on Boole.Stanford.EDU is yet another, and hopefully other helpful information on the bug will come to light as more people join in its investigation and more is learned about it. There is a lot more to this business than pure technology, a point Intel made in its December 20 repentium. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html What you see when you add isn't what you've got Four quarters make a dollar, ten dimes do not ================================================================= -------------------------PENTIUM REPORT # bug61---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 27510 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: TECHNICAL: What is the simplest example ? Date: 26 Dec 1994 07:48:57 GMT Organization: Computer Science Department, Stanford University. Lines: 15 Message-ID: <3dlsh9$2ms@Radon.Stanford.EDU> References: <3dasr6$7ad@news.primenet.com> <3dgrg6$jjn@Radon.Stanford.EDU> <3di9lh$bkp@radon.stanford.edu> <3dif6o$pv3@news.jf.intel.com> NNTP-Posting-Host: sunburn.stanford.edu In article <3dif6o$pv3@news.jf.intel.com>, Kumar Anshumali wrote: >So, what is the bottomline ? It is that you *need* a statistical measure of user-space >of numbers and computations. The only published studies on this so far are Intel, IBM >and PC Week. Do you have some new usage data ? I was unaware of this. Could you be more specific as to what "usage data" is currently in print from those sources, particularly from IBM and PC Week? I'm aware of Intel's argument that typical users use random numbers. What other usage data were you referring to? -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html What you see when you add isn't what you've got Four quarters make a dollar, ten dimes do not ================================================================= -------------------------PENTIUM REPORT # bug62---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 27591 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: "The Official Truth" from Intel appears to be winning Date: 26 Dec 1994 20:52:33 GMT Organization: Computer Science Department, Stanford University. Lines: 80 Message-ID: <3dnaeh$svf@Radon.Stanford.EDU> References: NNTP-Posting-Host: sunburn.stanford.edu In article , Greg Fife wrote: > >Another part of the problem is that this bug has no immediate or direct >impact on the average citizen. It is a lot harder to tell a good story about >an inaccurate Computational Fluid Dynamics result than it is to tell a good >story about a broken word processor. The typical home user doesn't even own >a Pentium, and they don't see this particular problem as a threat even if they >were to own one. Thus, they don't care. The typical home Pentium owner probably *doesn't* need to care. The Pentium is like a car whose exhaust emission has a 10% level of carbon monoxide. California legislation today limits it to 1%, but imagine the response you would get from the auto trade journals in 1935 if you tried making a fuss about a car that you felt had an unduly high level of CO emissions. Everyone would be asking who's ever been injured by such a small amount of CO injected into the great outdoors. Sure people have died, intentionally or not, by venting it into their garage, but that's not the typical situation, that's asking for trouble by standing under a meteor. Many Pentium users don't give a hoot and probably don't need to. Provided they keep their Pentium pollution in their own closed-system incinerator there's no reason they should. It would cost Intel a bundle to replace all those processors, and I see no point in making life hard for Intel by getting users worked up unnecessarily about a problem they're unlikely to be seriously affected by. Of much greater concern is Pentium pollution which is vented into the "great outdoors," namely those systems that don't live in a closed little world of their own. The industry has correctly intuited that there's a great deal of capacity out there to absorb a steady stream of numerical pollution, especially given how small that stream is, just like a 10% level of CO streaming out of a car exhaust. What the industry has *not* done howver is to compare that capacity to the likely pollution rate and to establish with some reasonable certainty that the several million polluting Pentiums out there will not present a serious hazard over their expected lifetime. The nearest it has come to this is to perform some naive back-of-the-envelope calculations that we have all seen can vary by many orders of magnitude depending on whose envelope you read. The truth of the matter is that no one has a clue, and most simply have their head in the sand and refuse to admit that there could be any problem. We don't yet know what the full range of Pentium hazards are. The small-bruised-integer hazard is new as of this month. A hazard that had not occurred to anyone until a couple of days ago is that the Pentium error rate can feed on itself, as follows. The worst level of bruising for small integers, around 10^-5-10^-7, is also the level of the Pentium pollution. Thus if you have a million Pentiums sending data to each other, whether by bus, serial line, ethernet, or mag tape, then in addition to normal truncation-induced bruising at the 10^-18 level, there will also be Pentium-induced bruising at the 10^-6 level. At *that* level of bruising the bruising rates for bruised integers less than 100 goes through the roof, hitting a rate of one >10^-7 error every 200 divisions by bruised integers and one >10^-5 error every 2,000 such divisions. A network of Pentiums is like a saloonful of hooligans ready for a barroom brawl. Other hazards that have been pointed out this month include any processes that can be understood as either differentiation, where small changes in x or y can lead to large changes in x/y (the whiplash example), or as integration, where a steady stream of small errors can accumulate to form a large error. We really have not come to grips with the full range of hazards presented by the Pentium. In view of how bad the errors *can* get under certain not entirely unnatural circumstances, anyone who insists that a buggy Pentium is safe for use in applications where it can affect either people or other computers has a responsibility to demonstrate that none of the known hazardous modes of the Pentium present a serious risk for that application. Whether they also have to rule out as-yet-undiscovered hazards is a good question. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug63---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 27633 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: clari.net.talk,comp.sys.intel Subject: Re: "Intel Boo-Boo Worries Some" -- AP/ClariNet article Date: 27 Dec 1994 01:23:33 GMT Organization: Computer Science Department, Stanford University. Lines: 88 Message-ID: <3dnqal$78s@Radon.Stanford.EDU> References: <1994Dec26.143453.10150@news.media.mit.edu> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU clari.net.talk:879 comp.sys.intel:27633 In article <1994Dec26.143453.10150@news.media.mit.edu>, Ron Newman wrote: >From an un-bylined Associated Press article that was posted to the >newsgroups clari.biz.features, clari.tw.computers,and >clari.living.consumer with the Message-ID >. The article appeared in this >morning's (December 26) Boston Globe, and presumably in many other >newspapers today. > >> For each of the 1,700 combinations that could go wrong, 9 >>billion others can be divided accurately. > >This is surely inaccurate. If the bug occurs for a particular >pair of floating-point numbers, then it will also occur if either or >both numbers are scaled by any power of 2. That's because it depends >only on the base-2 mantissas of the numbers involved, not on their >base-2 exponents. For each of the 1,700 reporters capable of getting this statistic exactly right, 9 billion others will find some way to screw it up. Another problem with the above is that the number 1738 is for single-precision in, single-precision out, where the rate is not one in 9 billion but in 40 billion. The 1 in 9 billion rate is for double-precision out (and any precision in). Not that it is going to solve this reporting problem, but it occurred to me after posting my 3x3 matrix of error counts that this matrix was what is sometimes called an outer or exterior product, by which I mean one representable as uv^T where u is a column vector and v^T is a row vector. This means that one does not need to know all 9 numbers to recreate the table, only 6, three in the vector u, three in v. The column vector u is that of total population, i.e. the number of possible operand pairs, with vector entries for single precision, double precision, and extended precision. The row vector v is that of error rates, with entries 1 in 40 billion for single precision, 1 in 9 billion for double, and 1 in 7 billion for extended. (You did pay for extended precision when you bought your Pentium, so don't let Intel get away with the unqualified statement that the error rate is one in 9 billion for random operands, that's only for double precision errors.) To remember the population sizes it suffices to know that each operand has 23, 52, or 63 bits of information. (The table I posted previously was based on the last being 64, but I did not then know that 64 included the constantly-1 leading bit.) The total population of all pairs is then 4 to the number of bits per operand, e.g. 4^23 = 7.037*10^13. Remembering the double precision error rate is easy: everyone knows the magic number 1 in 9 billion, which understates the rate by only half a percent. Giving the single precision rate as 1 in 40 billion overstates it by 1%, as does giving the extended precision rate as 1 in 7 billion. Now you have all 6 numbers, which form respectively the right hand column ("TOTAL") and bottom row ("ERROR RATE") of the following table. To get the entry on row i, column j, multiply the total for row i by the rate for column j, e.g. 2.470*10^-11 * 7.037*10^13 = 1738. NUMBER OF PENTIUM ERRORS by precision of operands and quotient Quotient Precision ____________________/\__________________ / \ --------------------------------------------------------- | Single | Double | Extended | TOTAL | |---------------------------------------------------------| | | | | 13 | Single| 1738 | 7863 | 9915 | 7.037*10 | Operand | | | | | Precision | 20 | 21 | 21 | 31 | Double| 5.009*10 | 2.266*10 | 2.858*10 | 2.028*10 | | | | | | | 27 | 27 | 28 | 37 | Extended| 2.101*10 | 9.502*10 | 1.199*10 | 8.507*10 | | | | | | --------------------------------------------------------- | | | | | | -11 | -10 | -10| | ERROR RATE| 2.470*10 | 1.117*10 | 1.409*10 | | | | | | | --------------------------------------------------------- -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug64---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 27796 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Santa brought me a GOOD PENTIUM! Date: 28 Dec 1994 01:52:27 GMT Organization: Computer Science Department, Stanford University. Lines: 53 Distribution: usa Message-ID: <3dqgcr$h0n@Radon.Stanford.EDU> References: <1994Dec25.020010.4514@mercury.ncat.edu> <3dltcv$36s@Radon.Stanford.EDU> NNTP-Posting-Host: sunburn.stanford.edu In article , Anil Das wrote: >In article <3dltcv$36s@Radon.Stanford.EDU>, pratt@Sunburn.Stanford.EDU (Vaughan >R. Pratt) writes: >> Same here (December 5). How many who joined the queue on or around >> that date who have/haven't got their replacement yet? (I'm not >> complaining, I get to keep finding out interesting new things about the >> bug while I'm waiting.) > > Looks like once Prof. Pratt and other technical people >get there replacement Pentiums, studies about the bug will stop, >except maybe inside Intel. I guess that is not very important >now that Intel is replacing the chips on demand. I was about ready to call Intel to ask how things were progressing when as luck would have it a Fed Ex truck showed up this afternoon (Dec. 27) with my replacement. It bore a ship date of Dec. 23 (from Oregon) and said "Priority Overnight." I guess Christmas must be that holiday which outlaws such miracles. It took about one minute to replace, and bingo! Reliable arithmetic. The flaky chip was fun for a short fling (and none of my data seems to have been contaminated), but for the long haul I like my FPU to be boringly reliable. But the flaky one at least had a quiet fan. The fan that came already mounted on my Repentium has a whine that completely drowns out its predecessor. Also its plug was the large 3/4" size, not the little .2" one I'd been plugging into the two-pin header right beside the CPU on my Intel Plato motherboard. After I've run it for two days (infant mortality check), I'll kill two birds with the one stone by switching fans. The heatsink seems to be *very* solidly epoxied to the lid but the fan should unscrew easily from it. I am very grateful to Intel for replacing this at no charge at all, not even the return shipment. I've written off the time invested in investigating the bug against the "curiosity driven" side of my research, hopefully also with a possible benefit to at least some people hoping to make a more informed decision about whether to buy, use, or replace a buggy Pentium. And I was brought up to date somewhat on the state of the art in FPU design, IEEE arithmetic, and the proper management of mixed precision. We still have some buggy Pentiums around the department, in fact I have a X-window open on one right now, so at some point when I'm in need of excitement I may yield to the temptation of another fling with fickle floating point, whose table was so cruelly deep-fived. Meanwhile back to writing about duality theory and its applications. This post is bug64, btw. Very appropriate. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug65---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 28153 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,sci.math.num-analysis,comp.arch Subject: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Contaminated Data Persists) Date: 30 Dec 1994 06:26:58 GMT Organization: Computer Science Department, Stanford University. Lines: 381 Distribution: inet Message-ID: <3e097i$952@Radon.Stanford.EDU> References: <3dn52g$ohe@newsbf02.news.aol.com> <3dp96r$h7r@mhaal.inhouse.compuserve.com> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:28153 sci.math.num-analysis:10442 comp.arch:16045 Summary: We give an empirically determined upper bound on the time it would take a chain reaction in a critical mass of Pentiums to melt down a database, together with the program used to determine it. In article , Henry Baker wrote: >In article <3dp96r$h7r@mhaal.inhouse.compuserve.com>, "M. L." > wrote: >> > No, the fallout from the Pentium disaster is likely to be worse than >> > Chernobyl (in the U.S., at least), since most of the contaminated data >> > is here. >> This is without a doubt the most idiotic statement found in the >> sea of stupidity flowing through this news group. >Perhaps you've never attempted to do any serious floating point work. >I once worked at a research hospital where we did multiple regression >analyses that went on for days. We were attempting to separate out >the influences from different variables in very noisy data. >A problem like the Pentium bug is the worst possible problem. It isn't >large enough to be 'obviously' wrong, but you have to re-do _all_ of >your calculations. In the physical sense it is absurd to compare the Pentium with Chernobyl. I've been sitting beside mine for weeks; had it been remotely comparable to Chernobyl I wouldn't be able to lift a finger to type. Of course that's not what Henry had in mind, he meant a suitable analogy in the data processing world. What I found myself wondering was, just how broad does this analogy have to be in order to be apt? On investigating this I found that the analogy can be taken far more literally than even M. Lane presumably had in mind. Chernobyl happened when its operators lost control of the reactor's chain reaction while conducting an ill-considered experiment. When the world's Pentiums ship each other flawed data, a natural question to ask is whether the data processing equivalent of a sustainable and uncontrolled chain reaction is possible. Continuing the analogy, is the Pentium's equivalent of a nuclear half-life (the time between errors) sufficiently short, and the decay products (the error magnitudes) sufficiently strong that a chain reaction in a sufficiently dense mass of Pentiums (relative to the number of participating non-Pentiums) could both start in a reasonable time and sustain itself? Now we have seen many examples that trigger the bug with a probability greater than the official one in nine billion. The most questionable examples are those based on specific quotients such as 4195835/3145727 and 4.999999/14.999999, or somewhat more generic computations such as x/(d.d-d.d); many have wondered just how often these examples are really encountered in everyday computation, while many others are convinced that they are quite likely enough to present a real risk. Also questionable are the examples that embed some understanding of the nonuniformities in the distribution of the bad operands, e.g. my small-bruised-integer measurements. Just how at-risk really is a typical program that is ignorant of those nonuniformities? With these concerns in mind, I designed the following "lab experiment," with a Pentium as the lab. The design of the experiment is not at all specific to the Pentium, and is simple enough both to implement in a reasonable time and to at least partially understand its nature. Its purpose is to allow one to observe the rate at which errors appear, and to study their propagation. The upshot was that errors appeared about a thousand times sooner than the random model predicts, and they propagated energetically enough to sustain a chain reaction essentially analogous to those taking place in atomic bombs and nuclear reactors. The experiment simulated a computational world of indeterminate scope, one that could correspond either to a distributed database run by many computers or a centralized one run by one computer. This database's knowledge is obtained partly from "axioms" or "sensory input", i.e. clean data from outside the scope of the experiment, and partly by "deduction," i.e. computation performed within its scope. The latter is where the Pentium-induced damage arises. I had originally considered taking the "sensory input" to consist of financial amounts such as $9.98. However the more complicated the source the more suspect it becomes in terms of reflecting a possible bias towards Pentium-bad numbers. So in the end I settled for a constant incoming stream of 1's, on the theory that if the number 1 all by itself is Pentium-bad, things are much worse than we thought! "Deduction" was simply the inference of new numbers from previously obtained ones via arithmetic operations chosen at random. All divisions were by flawed Pentium, corresponding to all computers participating in the maintenance of this database being such. I did not investigate in this experiment what lesser density of Pentiums might still constitute a critical mass for a chain reaction. In more detail, I first set up a numeric database consisting of a single number, namely 1. I then set a flawed Pentium (not mine, it's been fixed) to work adding new numbers to this database. These numbers came from two sources. The first source was the database itself; two numbers were selected at random from the database, then combined by an operation randomly selected from among the four operations of addition, subtraction, multiplication, and division, then stored in the database. The second source was a stream of 1's, constituting "fresh data" moderating any chain reaction that might start up like a graphite rod in a reactor core. These sources were weighted equally for the groups plus-minus, times-divide, and the ones, one third each. This was achieved simply by viewing these as six sources: four for the arithmetic operations and two for the stream of 1's treated as a pair of such streams. These six sources were sampled randomly, more precisely pseudorandomly with a uniform distribution using the Unix library integer rand() function. I did not want the database to fill up with improbably large or tiny numbers, whose absurd sizes might raise eyebrows; on the other hand I also did not want an artificial cliff in the distribution, whose precise location might raise questions. I therefore wrote a macro ablg(x) to compute (int)(fabs(log_2(x)/32)) fast and used it in ((rnd=(rnd%15)+1)%(ablg(cl)+ablg(di)+1) == 0) as a fast stochastic filter of high-magnitude-exponent numbers whose effect is first felt on positive numbers lying outside [2^-31,2^32] (and symmetrically for negative numbers). This filter rolls off gently, with probability 1/3 of retaining a number in [2^32,2^96] (and symmetrically in [2^-95,2^-31]), down to 1/15 for [2^416,2^480] (in general 1/(2n+1) for [2^{32(2n-1)},2^{32(2n+1)}] for n=1,...,7), and probability 0 thereafter (via the %15). After the database fills up, the program creates space for each new number by overwriting a randomly selected old number. Although I have 32Mb, I set the storage limit to 750,000 double-precision numbers (.75*8*2 = 12 Mbytes, there being two copies of the database) so that the program would run on 16Mb machines without swapping (the accesses are highly nonlocal and would cause thrashing if swapping happened). To measure errors I ran two such databases, clean and dirty, in lockstep, performing the same operations on both using the same operand addresses and making all decisions identically for both (e.g the above stochastic filter resolves disagreement between the clean and dirty numbers by in effect taking their geometric mean). The only difference is that clean division uses the Mathworks' Matlab solution. (For this sort of work a dirty Pentium is much faster than a clean one because it is much faster to emulate clean division on a dirty Pentium than vice versa.) Before giving the results of this experiment it is worth considering what might plausibly have been expected. Because no effort was made to produce Pentium-sensitive numbers of the kind familiar from the IBM report and my small-bruised-integer response functions (bug1 and bug2 in /pub/FDIV/individual.bugs on boole.stanford.edu), one might expect only one out of every 9 billion divisions, and hence one out of every 6*9 = 54 billion numbers produced, would be in error. One would therefore expect to wait about four days for the first error, and only then would the interesting question arise as to whether that error would have time to clone itself sufficiently before it and its first few clones were all overwritten by the steady inbound flow of clean data. Even if the error survived, its expected size would be very small: only one in every 40 or so errors is in the 5th decimal digit, and hence occurs only every 160 days at full speed (or every 40*27,000 ~ one million years at the 1000 divisions per day rate). In fact what happened was that my database effectively "melted down" altogether in 280 "cycles" taking 32 minutes, with 20% of its 750,000 numbers having a relative error more than 0.1 and more than 6% having a relative error more than 1 (i.e. meaningless). Although a few more cycles would have seen yet further deterioration, my experiment was programmed to stop at that point: enough is enough! A "cycle" is defined here as the arrival of just enough numbers to fill the database, less a few filtered-out oversize numbers. On an isolated Pentium a cycle thus consists of 750,000 consecutive operations (including 1's), most of which put a number into the database. My P90 gets through a cycle in 7 seconds; were it not for other issues like locality and propagation delays a network of a thousand Pentiums would accomplish a cycle in 7 milliseconds. Meltdown took 32 minutes on my machine, but if the same operations were to be performed in parallel it could occur in seconds in a sufficiently large and fast network of Pentiums. (Ordinarily a sequential machine can create long sequential dependencies that would invalidate this reasoning by creating long error propagation paths that parallel computers would not have time to develop, but the random access pattern of this reactor creates dependency chains of an expected length sufficiently low as to be consistent with those arising in much more parallel computation, usually an essential component of "embarrassingly parallelizable.") Here is the printout detailing the meltdown. The vertical axis is time, increasing downwards linearly; each line is printed after 20 cycles (so line 9 comes at the end of cycle 180). The horizontal axis is space: each of the 50 or so dots on a line denotes 2% of the database, the whole space being ordered left to right by decreasing relative error, with hexadecimal digits denoting precision boundaries: 6 for 10^-6, c for 10^-12, etc. 1 0123456789a.b.c.d................................................ 2 0123456789.a.b.c.d................................................ 3 012345.6.7.8.9.a.b.c.d............................................... 4 012345.6.7.8.9.a.b...c...d.......................................... 5 012345.6.7.8.9..a....b.......c....d................................. 6 0123.4.5.6.7.8...9.....a......b.....c..d............................ 7 012.3.4.5.6.7...8........9......a....b..c.d........................... 8 012.3.4.5..6....7.......8.......9...a..b.c.d........................... 9 .0.1.2.3.4..5......6.......7.....8...9.a.b.c.d.......................... 10 .0.1.2.3..4.....5.........6....7..8.9.a.b.c.d......................... 11 .0.1.2...3......4.......5.....6..7.8.9.a.b.c.d......................... 12 .0.1...2........3......4....5..6.7.8.9.a.b.c.d........................ 13 .0....1......2........3...4..5.6.7.8.9.a.b.c.d........................ 14 ....0........1......2....3..4.5.6.7.8.9.a.b.c.d........................ The dots immediately following a hex digit or at the start of the line actually represent anywhere from a single entry to 2% of the database (whence the slight line-to-line variations in number of dots), the other dots represent exactly 2%. On line 6, the three dots between 8 and 9 indicate that between 4% and 6% of the database has relative errors between 10^-8 and 10^-9, while the single dot between 3 and 4 on line 6 means that there is at least one error in the range 10^-3 down to 10^-4. Absence of a dot, e.g. between 2 and 3 on line 6, means no errors at all in the range 10^-2 to 10^-3. Dots to the right of d denote errors less than 10^-13, including correct results. Because 1/3 of the incoming data consists of 1's, which are automatically correct, this region does not shrink below 1/3 of the whole database, which is why d slows down instead of going all the way across the page. The boundaries start out squished together in the upper left. Each boundary takes turns rushing across the no-man's land in the middle, to join its comrades huddling together in the lower right. In terms of error populations this means that the correctness of the database deteriorates explosively; errors blow up by an order of magnitude every twenty cycles or so. The experiment stops when more than 2% of the data has unit relative error (is meaningless); here it had reached 6-8% by the time this stopping condition was examined, at which point at least 20% of the data base had relative errors above 0.1. Had it not been stopped the error explosion would have continued at the same speed and the data base would in a few dozen more cycles be a mixture of 1's and completely meaningless numbers, total meltdown. Caveats. There are several important caveats in interpreting these results. The most basic caveat is that any chain reaction, whether physical or computational, is highly sensitive to the prevailing conditions. Thus small changes to various parameters may increase or decrease both the speed of onset of the reaction and its ability to sustain itself. In particular I would not expect it would take very much of a dilution, in the form of either non-Pentiums or clean Pentiums contributing to the process, or a different mix of incoming data and operations performed on the data, to moderate this reaction to below the level at which it can sustain itself. Thus few if any real-world computing environments would be likely to provide the right conditions for this chain reaction. With the large number and variety of today's environments however the possibility that some might be at some risk of "going critical" should be kept in mind. Offsetting these reassuring considerations is the less reassuring possibility that the incoming data may include either many small dollar amounts, shown by the IBM work to be at well above average risk on the Pentium, or my explicitly bruised integers, e.g. rationals as decimals truncated by mandate, or data read from a decimal calculator showing 4.999999 or from an analog-to-digital device that is positioned almost on an integer. When the stream of 1's is replaced by such Pentium-sensitive data the onset of a chain reaction could conceivably be much faster, and/or the critical mass of Pentiums (as measured by the rate of flawed divisions) required to sustain it could conceivably be much lower. All these variations almost certainly will have some effect. How much awaits further experimental investigation, as does the question of which variations have any likelihood of arising in real computing environments. As an independent application, by avoiding Pentium-specific knowledge the experiment might make a useful general-purpose FPU test when left to run for a week, provided its "clean" source is obtained from a reference machine. Here is the program, which may be freely distributed and modified without my permission. It runs under Linux but should require little or no modification for other OS's (on a Pentium of course). It would ease my mind a bit if when you hack it up you put your name, date, and hack at the top, in reverse chronological order, so I get neither the credit nor the blame due you. ==========clip here=== reactor.c ====== /* Reactor simulator; generates chain reaction using Pentium * division errors as decay products. * Please log changes/author/date here in reverse chronological order * Author: V. Pratt, 12/29/94, pratt@cs.stanford.edu */ #include #include #include #define MAXW 750000 /* steady-state population (should fit in 16Mb) */ #define MAXE 14 /* report error distribution down to 10^-14 */ #define PR 20 /* # cycles between error distribution reports */ #define ablg(x) abs(16-(0x7c&(2+(((char*)&(x))[7])))/4) /* fast approx. to (int)(fabs(log_2(x)/32)) */ int tick, cycle, W, grp[MAXE+1]; double clean[MAXW], dirty[MAXW]; main(argc, argv) char *argv[]; { int opd1, opd2, opr, ans, rnd, relcnt, watch; int i, j, e; double cl, di, rel, tem; watch = rnd = relcnt = 0; if (argc == 2) /* any argument -> watch errors that */ watch = 1; /* are large relative to recent examples */ rel = 1e-15; /* initial setting for "relatively large" */ /* initialize world with one element, 1.0 */ clean[0] = dirty[0] = 1.0, W = 1; for (cycle = 1; grp[0] < MAXW/50; cycle++) { for (tick = 0; tick < MAXW; tick++) { /* Choose random expression "opd1 opr opd2" */ opd1 = rand()%W; opd2 = rand()%W; opr = rand()%6; /* Evaluate expression */ switch (opr) { case 0: /* add operands */ cl = clean[opd1] + clean[opd2]; di = dirty[opd1] + dirty[opd2]; break; case 1: /* subtract operands */ cl = clean[opd1] - clean[opd2]; di = dirty[opd1] - dirty[opd2]; break; case 2: /* multiply operands */ cl = clean[opd1] * clean[opd2]; di = dirty[opd1] * dirty[opd2]; break; case 3: /* divide operands */ cl = at_risk(clean[opd2])? ((15./16)*clean[opd1])/ ((15./16)*clean[opd2]): clean[opd1]/clean[opd2]; di = dirty[opd1] / dirty[opd2]; break; default:/* output 1.0 */ cl = di = 1.0; break; } /* reduce probability of values far from 1 */ if ((rnd=(rnd%15)+1)%(ablg(cl)+ablg(di)+1) == 0) { /* result passed, store it */ ans = W rel*fabs(cl)) { printf("%4d:%-8d%18.14g %18.14g %7.2g\n", cycle, tick, cl, di, (cl-di)/cl); if (((relcnt++)%10) == 0) rel *= 2; } } /* Every PR cycles, print distribution of errors */ if ((cycle%PR) == 0) { printf("%4d ", cycle/PR); for (i = 0; i <= MAXE; i++) grp[i] = 0; for (i = 0; i < W; i++) { tem = (clean[i]-dirty[i])/clean[i]; e = .301*(1026-(((short*)&tem)[3]&0x7ff0)/16); e = e < 0? 0: e > MAXE? MAXE: e; grp[e]++; } for (i = 0; i <= MAXE; i++) { for (j = 0; j < grp[i]; j += W/48+1) printf("."); printf(i>(((short*)(&x))[3]&0xf))&1); } ===== end of reactor.c ========== -- Vaughan Pratt FTP: boole.stanford.edu:/pub/FDIV/README 4.999999/14.999999 has 6 3's, not 4 http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug66---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 10467 of sci.math.num-analysis: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,sci.math.num-analysis,comp.arch Subject: Re: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Contaminated Data Persists) Date: 1 Jan 1995 09:59:24 GMT Organization: Computer Science Department, Stanford University. Lines: 343 Distribution: inet Message-ID: <3e5uds$9v5@Radon.Stanford.EDU> References: <3dn52g$ohe@newsbf02.news.aol.com> <3e097i$952@radon.stanford.edu> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:28381 sci.math.num-analysis:10467 comp.arch:16078 In article , Paul Rubin wrote: >In article <3e097i$952@radon.stanford.edu>, >Vaughan R. Pratt wrote: >>In fact what happened was that my database effectively "melted down" >>altogether in 280 "cycles" taking 32 minutes, with 20% of its 750,000 > >Is this an interesting result? It sounds like your program computes >the evolution of what must be a chaotic system (your database) (is >*that* an interesting result?). You've shown that after a certain >amount of time the dirty-Pentium results differ substantially >from correct IEEE results, but the IEEE results might *also* >be totally numerically meaningless. Oops. Indeed. It's not terribly interesting to know that A took n cycles to melt down relative to B if both had long before then melted down relative to the truth. >What happens if you do the same calculation with extended precision >(i.e. storing all numbers as 80-bit floats) and comparing the results >with "clean" double (64-bit) precision? Sigh. It turns out that if you take this hit on every instruction (in contrast to the hit from the FDIV bug, which though much bigger in magnitude is much smaller in rate, about one in every few million instructions in this setup as it turns out) then the meltdown of double precision relative to extended precision takes 160 cycles, in contrast to 280 cycles for the Pentium error relative to double precision. I had put off doing this experiment because I had been expecting this difference to be so small as to take far longer to propagate, but in practice it is not. In hindsight this was a big mistake. Databases don't seem to melt down relative to the world, not even on account of using single precision arithmetic. Why they don't is an excellent question. I thought about this for a bit and decided that a basic difference between my database and a real one is that mine is untyped: it lets any two things be combined in any way. Although we aren't counting apples and oranges, one could imagine that each of the 1's streaming in counted one kumquat, the type Brian Reid would use here. Now it is fine to add 1 kumquat plus 1 kumquat to get 2 two kumquats. And it is also fine to multiply them, in which case you get one square kumquat. What is bad is to add kumquats to square kumquats. But it is ok to multiply them, getting cubic kumquats. And you can divide a kumquat by a kumquat to get a pure number, call this type *unit*. If you now divide 3 cubic kumquats into 12 units, you get 4 units per cubic kumquat. And so on. So I converted my untyped reactor into a typed one, somewhat arbitrarily set limits of plus and minus 3 (cubic and cubic^-1) for the types, and only permitted correctly typed operations. This seemed to bring the *size* of numbers under some control; they didn't seem to want to fluctuate so much. It also postponed the meltdown *a little*, getting to 200 instead of 160. But this was still much sooner than with the Pentium numbers, which got to 280 cycles before reaching the corresponding point. So what other differences might there be between my database and a real one? Well, a number that has been obtained by a long chain of calculations is at great risk of accumulating a lot of errors along the way. I therefore decided to not trust such numbers and to omit them from the database. I defined the depth of each incoming 1 as 0, and the depth of each calculated number as 1 plus the maximum depth of its operands, and arbitrarily set 100 as the maximum permissible depth. This had the following dramatic effect (see my original message on chain reactions for the meaning of this printout). 1 012345.6.7.8.9.a.b.c.d................................................ 2 012345.6.7.8.9.a.b..c...d............................................ 3 0123456.7.8.9.a.b.c.d................................................ 4 01234567.8.9.a.b.c.d................................................ 5 0123456.7.8.9.a.b.c.d................................................ 6 01234567.8.9.a.b.c.d................................................ 7 0123456789.a.b.c.d................................................ 8 01234567.8.9.a.b.c.d................................................ 9 012345678.9.a.b.c.d................................................ 10 012345678.9.a.b.c.d................................................ 11 01234567.8.9.a.b.c.d................................................ 12 0123456789.a.b.c.d................................................ 13 01234567.89.a.b.c.d................................................ 14 0123456789a.b.c.d................................................ 15 0123456789a.b.c.d................................................ 16 012345678.9.a.b.c.d................................................ 17 012345678.9.a.b.c.d................................................ 18 012345678.9.a.b.c.d................................................ 19 01234567.89.a.b.c.d................................................ For the first 20 cycles (line 1) the database started to melt down as before, with errors reaching above 10^-6. But at cycle 40 (line 2) things had not gotten any worse, and after that the errors that had initially rushed down to 10^-5 started to get cleaned out by the incoming clean 1's. By cycle 140 (line 7) all errors greater than 10-9 were gone. Thereafter occasional errors would creep in, all less than 10^-7, but sometimes there would be no errors bigger than 10^-10, as at cycles 280 and 300. So avoiding long dependency chains seems like an excellent strategy for avoiding database meltdown. Types in contrast merely slowed down the onset of meltdown by about 20%. I've always been a bit atheistic about types (10 years at typeless MIT can do that to one :), and I've also long had a deep mistrust of long proofs (I've never met a long proof I understood). These quantitative observations about meltdown provide some kind of confirmation of both of my intuitions. Now, back to the Pentium bug. With the reference database now acting more stably, it becomes a much more meaningful exercise to compare clean and Pentium-flawed division with respect to the rate at which the latter can melt things down. I chose to run both databases at double precision, assuming that the outcome would not be very different from running both at extended precision (= long double) (I don't know how true this is). Well, the outcome was mixed. On the one hand, an error appeared on average once every million divisions, at least for the first 200 cycles. Judging by their behavior (they tended to come in bunches within a few cycles, with gaps of dozens of error-free cycles in between) I would estimate that a third of these were FDIV errors and the other two thirds inherited those errors by arithmetic before being wiped out by the stream of 1's. This corresponds to 1 FDIV error in 3 million divisions, which is 3,000 times the 1 in 9 billion rate assuming random data. This is significant given that, although the incoming data was not at all random, being all 1's, the operands and operations used in obtaining new values from old were completely random. So even though no Pentium-specific knowledge at all was built into this model, erroneous operands were found by the model at a relatively high rate. The errors covered the spectrum fairly uniformly (by exponent) from 10^-6 down to 10^-13. But then I ran the experiment to 400 cycles and no further errors appeared, diluting all the numbers of the preceding paragraph by a factor of 2. But then from cycles 402 to 420, 160 errors appeared! So the number of errors can fluctuate wildly. But this particular big bunch of errors all stayed within the range 10^-10 to 10^-14, as such pretty inconsequential. On the other hand all the above FDIV-induced errors were swept up by the replacement process much more quickly than they were generated. In constrast, the constant stream of truncation errors at double precision as measured relative to extended precision came faster than they could be swept up, maintaining a population of errors that extended down to 10^-10 with occasional excursions down to 10^-8 after stabilizing, as we saw above. I then did another experiment, changing the source of numbers from constant 1 to random numbers of the form d.d, e.g. 6.7 or 0.2. These seem particularly bad in the context x/(d.d-d.d), and so might be expected to give more errors. And in fact cycles 11-36 showed 120 errors; had we not seen the 160 errors in cycles in 402-420 with 1's as inputs, we might conclude that d.d was much worse than 1's. But then in cycles 37-200, the only errors were a batch of 9 errors in 74-78 and one error in cycle 139. The enormous variation within experiments thus seems to dominate any possible variation one might suspect between these two experiments. Hence for this reactor program at least one cannot infer much of an important difference between the sources 1 and d.d. My interpretation of all this is that, from a very pragmatic point of view, Intel could afford to weaken its 1 in 9 billion claim to 1 in a few million, while at the same time pointing to the distribution of errors, which are known to occur uniformly across the spectrum of leading bit 14 down to leading bit 52, as greatly diluting the apparent seriousness of this thousand times higher rate. In particular, for at least this random model of data replenishment, such errors propagate only a little, averaging few offspring in their brief lifetimes, with however occasional big bursts. The obvious caveat here is that my reactor program doesn't claim to represent faithfully *any* real world application, which could perform either better or worse than reactor.c with regard to their response to the Pentium FDIV error. For example a nonlinear least-squares fitting program could reasonably be expected to be far more sensitive to such errors than this naive model of database maintenance. Nevertheless it seems to me a very interesting data point serving to calibrate "high" error rates such as one in a few million (as opposed to 1 in 9 billion) as to the seriousness of their ultimate impact on relatively insensitive programs such as this one, namely not very serious. Despite its artificial and simplistic nature, my program may be a sufficiently indicative model of what to expect from a "typical" application, for those times when you *have* to have some expectation and you have no a priori reason to expect circumstances sufficiently different as to indicate a substantially different outcome. I cannot recommend such an expectation for anything that puts either people or large sums of money at any risk. But for many relatively low risk applications this sort of reasoning may be a perfectly adequate justification for continuing to use a buggy Pentium. Again I enclose the program. Please bring any bugs in it to my attention, in case they are responsible for any erroneous conclusions. Remove the comments around the two instances of "long," and simplify clean division by eliminating the at_risk clause, if you want to do the double-vs-extended experiment on any system (not necessarily an x86) that distinguishes long double from double. (On a big-endian system definitely remove at_risk since it assumes little-endian order.) =======clip here====reactor.c======== /* Reactor simulator; generates chain reaction using Pentium * division errors as decay products. * Author: V. Pratt, 12/29/94, pratt@cs.stanford.edu */ #include #include #include #define MAXW 750000 /* steady-state population (should fit in 16Mb) */ #define MAXE 14 /* report error distribution down to 10^-14 */ #define PR 20 /* # cycles between error distribution reports */ #define TYPLIM 3 /* no types higher than cubic kq or per cubic kq */ #define MAXDEP 100 /* no dependency chains longer than 100 */ #define ablg(x) abs(16-(0x7c&(2+(((char*)&(x))[7])))/4) /* fast approx. to (int)(fabs(log_2(x)/32)) */ int tick, cycle, W, grp[MAXE+1]; /* long */ double clean[MAXW]; double dirty[MAXW]; char typ[MAXW]; int depth[MAXW]; main(argc, argv) char *argv[]; { int opd1, opd2, opr, ans, rnd, relcnt, watch; int i, j, e, ty, de; /* long */ double cl; double di, cld, rel, tem; watch = rnd = relcnt = 0; if (argc == 2) /* any argument -> watch errors that */ watch = 1; /* are large relative to recent examples */ rel = 1e-16; /* initial setting for "relatively large" */ /* initialize world with one element, 1.0 */ clean[0] = dirty[0] = typ[0] = W = 1; depth[0] = 0; for (cycle = 1; grp[0] < MAXW/50; cycle++) { for (tick = 0; tick < MAXW; tick++) { /* Choose random expression "opd1 opr opd2" */ opd1 = rand()%W; while (depth[opd1] >= MAXDEP && ++opd1 < W); if (opd1 == W) continue; opd2 = rand()%W; opr = rand()%6; ty = typ[opd1]; i = -1; /* require correct types and bounded depth */ switch (opr) { case 0: case 1: while ((typ[opd2] != ty || depth[opd2] >= MAXDEP) && ++opd2 < W); break; case 2: i = 1; case 3: do { ty = typ[opd1] + i*typ[opd2]; } while ((ty<-TYPLIM || TYPLIM= MAXDEP) && ++opd2depth[opd2]? depth[opd1]: depth[opd2]); /* Evaluate expression */ cl = di = 0; switch (opr) { case 0: /* add operands */ cl = clean[opd1] + clean[opd2]; di = dirty[opd1] + dirty[opd2]; break; case 1: /* subtract operands */ cl = clean[opd1] - clean[opd2]; di = dirty[opd1] - dirty[opd2]; break; case 2: /* multiply operands */ cl = clean[opd1] * clean[opd2]; di = dirty[opd1] * dirty[opd2]; break; case 3: /* divide operands */ cl = at_risk(clean[opd2])? ((15./16)*clean[opd1])/ ((15./16)*clean[opd2]): clean[opd1]/clean[opd2]; di = dirty[opd1] / dirty[opd2]; break; default:/* output 1.0 */ cl = di = 1.0; break; } cld = cl; /* reduce probability of values far from 1 if ((rnd=(rnd%15)+1)%(ablg(cld)+ablg(di)+1) == 0) { */ if (!(ablg(cld) || ablg(di))) { /* result passed, store it */ ans = W rel*fabs(cld)) { printf("%4d:%-8d%2d %18.14g %18.14g %7.2g%12d\n", cycle, tick, ty, cld, di, (cld-di)/cld, depth[ans]); fflush(stdout); if (((relcnt++)%10) == 0) rel *= 2; } } } /* Every PR cycles, print distribution of errors */ if ((cycle%PR) == 0) { printf("%4d ", cycle/PR); for (i = 0; i <= MAXE; i++) grp[i] = 0; for (i = 0; i < W; i++) { tem = (clean[i]-dirty[i])/clean[i]; e = .301*(1026-(((short*)&tem)[3]&0x7ff0)/16); e = e < 0? 0: e > MAXE? MAXE: e; grp[e]++; } for (i = 0; i <= MAXE; i++) { for (j = 0; j < grp[i]; j += W/48+1) printf("."); printf(i>(((short*)(&x))[3]&0xf))&1); } =========end of reactor.c========== -- Vaughan Pratt FTP: boole.stanford.edu:/pub/ABSTRACTS http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug67---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 16081 of comp.arch: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,sci.math.num-analysis,comp.arch Subject: Re: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Contaminated Data Persists) Date: 1 Jan 1995 20:58:06 GMT Organization: Computer Science Department, Stanford University. Lines: 95 Distribution: inet Message-ID: <3e750u$rv1@Radon.Stanford.EDU> References: <3dn52g$ohe@newsbf02.news.aol.com> <3e097i$952@radon.stanford.edu> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:28427 sci.math.num-analysis:10476 comp.arch:16081 In article , Anil Das wrote: > > With (correct) double precision math the only way >21E7 operations can give relative errors more than 1 >(or even 1E-8) is for a number of subtractions of nearly equal >quantities to occur. Maybe the experiment should be rerun after >removing the subtract operation? While ten million is a theoretical upper bound on chain lengths, the observed upper bound was roughly 200,000 times less (idle conjecture: 750,000/e less?). This was still enough to let the considerable heating effect of subtraction do its work, to my great surprise not to mention mortification (*so* embarrassing...). I'm not sure how meaningful the results are without subtraction, but I tried it anyway for both situations, double vs. extended precision. and Pentium divide vs. correct. For the former the (many) errors got no worse than 5*10^-14 at 280 cycles, i.e. not much error propagation. For the latter, in 240 cycles there were 459 errors, i.e. not many and they kept being swept up within a few cycles. Most were under 10^-10, but there were five relatively large ones: 1.5e-08, 3.4e-06, 6e-08, 4.3e-08, and 6e-08. Other experiments suggest themselves here, such as greatly diluting the 1's to see what happens when their mopping-up or moderating influence is removed; those few but large errors mentioned just above might make a much bigger nuisance of themselves then. However a more meaningful experiment than simply removing subtraction might be to allow it but put some limit on how small the difference can get relative to the operands, e.g. .001 or .5. This could be taken as a crude model of the improved stability obtained for ill-conditioned data by careful pivoting, Householder triangulations, etc. I would guess that the .001 limit would not make a big difference but that .5 would not be terribly different from no subtraction at all with regard to rate of meltdown. Too many fun experiments possible, too little time. > a) Has any attempt been made to run the experiment >with a different random number generator or the same random >number generator with a different seed? That will show how >sensitive the results are to initial conditions. The more >sensitive, the more chaotic the whole thing is. I ran quite a few experiments with other parameters being varied, which achieves the same end. In general the setup is quite chaotic as I indicated under "Caveats" in my original message. I did in fact try varying only the seed early on, just to be sure, and saw the same chaotic variations as when varying other parameters. > b) Is a count of the actual number of erroneous divisions >that occured available? > > di = dirty[opd1] / dirty[opd2]; > if (at_risk(clean[opd2])) { > cl = ((15./16)*clean[opd1]) / ((15./16)*clean[opd2]) ; > if (cl != di && dirty[opd1] == clean[opd1] > && dirty[opd2] == clean[opd2]) > bad_division_count++ ; No, something like this would be a good addition, to distinguish original from inherited FDIV errors. But I would prefer to count all FDIV bugs, not just those on clean data as per the above, since it is possible for the bug to "feed on itself." Tim Coe showed me a very nice example of this, which I've expressed here in C: main() { int i; double x = 9.2, y = 19.4; x -= 4.2, y -= 4.4; for (i = 0; i < 4; i++) { printf("want %20.15f\n got %20.15f %g\n", (1.01*x)/(1.01*y), x/y, (x - (x/y)*y)/x); x = (x/y)*y; } } This starts out with x = 9.2 - 4.2, y = 19.4 - 4.4, and repeatedly divides x by y and then multiplies the y back in. Here's how the quotient and relative error looks after doing this up to 4 times---it stabilizes at the 4th step at a relative error of 3*10^-6. want 0.333333333333333 got 0.333333333332363 2.91038e-12 want 0.333333333332363 got 0.333333329358720 1.19209e-08 want 0.333333329358720 got 0.333332312106116 3.05176e-06 want 0.333332312106116 got 0.333332312106116 0 Your term "cyberentomologist" has a catchy ring to it. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/ABSTRACTS http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug68---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 28507 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: No one seems to understand FP arithmetic : Date: 2 Jan 1995 22:26:32 GMT Organization: Computer Science Department, Stanford University. Lines: 206 Message-ID: <3e9uio$cf2@Radon.Stanford.EDU> References: <1995Jan1.112135.2002@zcon.com> NNTP-Posting-Host: sunburn.stanford.edu In article <1995Jan1.112135.2002@zcon.com>, Syed Zaeem Hosain wrote: > >>Perhaps if someone can communicate in 1000 words or less what >>the shouting is about, we can more easily avoid embarrassments >>like this in the future. > > >Sure thing. The flaw is that there are a set of numbers for which the Beautifully written and way under budget (500 words). Some picky points, and some not so picky ones in the second half. I think I can get these close to the 1000 word limit. >division instructions (FDIV, FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, >FPREM, FPREM1) and the instructions that use the division algorithms >(FPTAN, FPATAN, FYL2X and FYL2XP1) in the Pentium processor will not >provide correct results starting in the 9th bit (approx 4th decimal >digit) onwards No, 14th bit (counting the hidden mantissa bit as bit 0), 5th decimal (a relative error of 6*10^-5). For x normalized (i.e. in [1,2)), when an absolute error e is added to x, the best definition of "bit position of the error e" is the distance of e's leading 1 from the binary point, with e=1/2 being an error in bit 1. When the bit position of the error is defined as the first position where x differs from x+e, one gets meaningless results, as in the case x = 1.0111111, e = 0.0000001. Comparing x+e = 1.1000000 bit by bit with x locates the error in bit 1 when it is obviously in bit 7. >- with about equal probability that the error will be in >any one of the bits after that position. >Specifically numbers where the >divisor has the following bit patterns in the most significant bits: >1.0001, 1.0100, 1.0111, 1.1010 and 1.1101. ...and the next six bits must be 1, and another eight or more 1's greatly increases the probability of error. >and incorrect numerical calculations under certain conditions. If >looked at in the space of all potential floating point calculations >(randomly distributed), about 1 in 9 billion divisors provides an >incorrect result. No, 1 in 50,000 divisors provides an incorrect result (159 single precision divisors out of 2^23 yield single precision errors, 168 single precision divisors yield double or extended precision errors). Your figure is for operand pairs (divisor and dividend together), and furthermore is for the case of double precision results (see the table in boole.stanford.edu:/pub/FDIV/individual.bugs/bug63, Internet posting of Dec. 27). For single and extended precision results 9 billion is replaced by respectively 40 billion and 7 billion. (These rates are unexpectedly more accurate than their rounding to integers would suggest, being off by only +.5%, -1.2%, and -1.4% respectively.) >However, this does *not* mean that the probability of >occurrence of the flaw is random at that rate - if you happen to be >using one of the faulty divisors for calculations, it will occur >consistently for *every* division done with that number. Thus the >error rate can be a lot higher - depending on the situation. And >unknowingly so! Any nontrivial distribution has the property you cite. What matters is whether either distribution, faulty pairs or faulty divisors, has a positive correlation with distributions of operands that could arise in practice. Some distributions are quite harmless, in fact the irony is that *most* (>99.99%) of the possible distributions of faulty operands would cause no more of a problem than Intel has claimed based on the 1 in 9 billion figure, in fact even less when one takes into account that the significance of better than 70% of those errors is at best theological given the combination of their low rate and size. It is therefore extremely unfortunate for the Pentium, for Intel, and for Intel's customers, that the bad divisors have a particularly deadly distribution that lands them hard up against certain small integers. When computing with random expressions applied iteratively starting from data having relatively few digits (the constant 1 will do all by itself!), error rates are a thousand times higher than they would have been had the problem pairs been randomly distributed. And when the operand pairs are "small bruised integers," for which plausible scenarios exist that produce these in large quantities, the rates jump to a million times higher! In inexplicably sticking to its widely advertised "one in 27,000 years" figure, Intel commits not one but two fundamental "off by a million" exaggerations, namely in two rates, the rate of divisions and the rate of errors in those divisions. The former exaggeration is the more blatant. A spreadsheet user doing only 1000 divides per day does not need a Pentium except to keep up with the Jones. If Pentiums are to be taken seriously as workstations as opposed to mere calculators, that rate must be increased a thousand-fold to a million-fold, the latter being about the rate at which divisions were done on Toon Moene's numerical weather prediction program, on which he has been studying the impact of the Pentium bug. For the latter exaggeration, Intel assumes randomly distributed inputs, where their analysis is indeed valid and furthermore is almost certainly applicable to many applications. This assumption is grossly invalid however for applications of the types mentioned above that can give rise to thousand-fold or even million-fold increases in rate. Intel's two exaggerations combine multiplicatively to give a net exaggeration of anywhere between a thousand and a trillion depending on circumstances. But how could a trillion even be possible? Surely not even the scenario mentioned above in which one repeatedly performs the same flawed division over and over could go that high. Indeed it could, in fact here Intel's exaggeration can be easily calculated to be a *thousand* trillion, and even worse (taking distribution of magnitudes into account) if you pick a pair that produces the maximum error. But that's a scenario we can all agree with Intel president Andy Grove would be like running to stand under a meteorite. >Intel has finally decided to do this replacement, no questions asked, >while [apparently still] continually grumbling about how they are being >forced to do so. This "we are put-upon", "the Internet forced us" >martyr act is getting a little thin. No large company to my knowledge has worked as hard as Intel to draw the wrath of the Internet in just a few months. Consider. (1) Intel kept the flaw a secret from its customers for four months. (2) When discovered by Intel customer Thomas Nicely, the flaw was minimized by Intel in terms of 27,000 years and 1 in 9 billion pairs. (3) Intel initially declined replacement processors for customers who could not pass both a numerical literacy test (what exactly is your problem, exactly?) and a need test (has this problem affected you and if so how?). In boole.stanford.edu:/pub/FDIV/individual.bugs/bug31, Internet posting of Dec. 18, under the respective headings Past, Present, Future, I argued that (1) might be hard to make stick; in any event it is water under the bridge except for those bent on teaching Intel and/or the industry a lesson. And I argued that (3) was merely a PR-problem, being the absolutely worst conceivable way of coping with an understandably limited supply of good chips. No one has to explain how a fuel injector works in order to get it replaced. And replacing items at risk of failure is preventive maintenance; any company turning down those of its customer cluey enough to request specific preventive maintenance is asking to lose its cluey customers. A more customer-aware solution to the supply problem would do wonders for Intel's PR while changing nothing of any significance given the limited supply. Intel's "repenting" of (3) on Dec. 20 was therefore the perfect PR move, in three basic ways. First, it acknowledged that there was a problem, confirming what had up to that point been only a vague growing suspicion in the public's mind. More significantly, it identified for the public what the problem must have been. This effectively drew attention away from (1) and (2), whose seriousness had previously been no more clearly defined in the public's mind than that of (3). And the acknowledgement defused a crowd situation that was threatening to get out of control. Second, it satisfied those who had been complaining about (3), by acknowledging explicitly what customers had previously only been permitted to read between the lines, namely that they had to wait their turn for their chip and that there would be a long delay. (It seems highly unlikely that Intel could have successfully continued to stonewall requests for replacements once supplies had improved.) Third, it was not an unreasonably expensive move. (i) Intel cannot replace chips any faster than it had been doing before, so that particular cost is not going to increase. (ii) The "charge" mentioned in the announcement was one that could at best have been postponed under the circumstances. (iii) The loss in goodwill per denied customer far exceeds the gain in revenues per chip saved: Intel *makes* money with this move. Optimists are calling Intel's overnight change from Charles to Di a shrewd move. Pessimists question whether catching the ball once after three fumbles qualifies one for the Heisman. Either way it was a good move. This leaves just (2). My expertise is in neither the legal issues raised by (1) nor the PR issues raised by (3). Rather it is in numbers, which is the issue raised by (2). When *any* organization, be it Intel, IBM, or Boys Town, puts other people's money and lives at unknown but potentially great risk by quoting numbers that are off by multiple orders of magnitude, whether by a thousand or a trillion, in order to save its own bacon, this is gross irresponsibility. When it declines to enter into any public discussion of these risks this is further abdication of that responsibility. There has been much talk on the net about how errors of 10^-5 can grow by propagating. Intel is living proof of this: the tiny .00001 error in the Pentium has propagated around Intel and emerged in press releases as potentially a trillion-fold error. Now *that's* meltdown. Intel needs to address these issues, not by pronouncements issued from behind closed doors but in a public forum where the extent of agreement or disagreement of the affected parties can be judged by the public. As long as there are millions of these flawed Pentiums hard at work every day of the year, with no official recommendation for what exactly to put on their warning labels beyond "Intel Inside", *all* of us are potentially affected parties. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/ABSTRACTS http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug69---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 28525 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Cont Date: 3 Jan 1995 00:24:29 GMT Organization: Computer Science Department, Stanford University. Lines: 101 Message-ID: <3ea5ft$g6f@Radon.Stanford.EDU> References: <3dn52g$ohe@newsbf02.news.aol.com> <3e5uds$9v5@Radon.Stanford.EDU> <5d5eE776NgB@me-tech.PFM-Mainz.DE> NNTP-Posting-Host: sunburn.stanford.edu In article <5d5eE776NgB@me-tech.PFM-Mainz.DE>, Michael Schmidt wrote: > In article <3e5uds$9v5@Radon.Stanford.EDU> > pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) >> >> This had the following dramatic effect (see my original message on >> chain reactions for the meaning of this printout). >> >> 1 012345.6.7.8.9.a.b.c.d...................................... >...some stuff deleted... > >I think it's important enough and would be fine (I can imagine >enough others would appreciate it too) if you could repost your >whole original message, where the meanings are explained, to the >comp.sys.intel newsgroup. Let me just do it from scratch. Anyone who wants the original message can get it by anonymous ftp as bug65, available in /pub/FDIV/individual.bugs on boole.stanford.edu. The above example comes from bug66. For starters let me just read it out loud. "At cycle 20 (= line 1) the database is only mildly corrupted, with a very small scattering of errors across much of the spectrum, none greater than 10^-5 though it contains at least one relative error greater than 10^-6. At least 98% of the database is intact in the sense of containing no relative errors larger than 10^-13." Now here's how I got that out of the above. Each nonleading dot in a run of dots represents about 2% of the database entries (100/48 to be exact), arranged in order of increasing accuracy (decreasing relative error) from left to right. (2% of 750,000 entries is 15,000 errors.) The leading dot in each run denotes less, namely anywhere from a single entry to 2% of the database. Each hex digit (0-9a-d) marks an error size boundary; digit 8 is the boundary for relative errors of size about 10^-8. (Note that since digits take up space they displace dots, which therefore move around a little as an artifact of this method of display, depending on where the digits are concentrated.) Adjacent hex digits, e.g. 23, indicate complete absence of errors in that decade, namely between 10^-2 and 10^-3. So in the line above, the most erroneous entry is between 10^-5 and 10^-6, and it is at least consistent with these rules, though very unlikely, that there are only 8 entries to the left of d (i.e. less accurate than 10^-13). The other extreme would seem to be that 16% of the database is left of d. This seems like pretty crummy notation, and it was really only meant for eyeballing the general effect, but there is however a trick to reading more into this. There were 48 non-leading dots in the above line in my original message (you deleted ten), all to the right of d, but that accounts for the whole database to within a dot. Hence less than 2% of the entries can be left of the d. The number at the start of the line is the cycle number divided by 20 (more recently I changed the program to not divide by 20). (A cycle is the number of steps required to refresh the database with as many numbers as fit in the database; Mercedes claims their climate control does this to the car's air in 20 seconds, but they certainly don't claim to have replaced every molecule in that time, or even in 20 minutes = 60 cycles.) So the example is a snapshot at the end of cycle 20. At cycle 280 of the same run we see 14 0123456789a.b.c.d................................................ which indicates that all errors above 10^-10 have been swept out of the database, mostly by the incoming 1's constituting 1/3 of the incoming numbers (computing x/x will also do this but with very low probability; x-x will not do it at all because 0's are discarded along with other quantities below 2^-32). The source of errors in that run were double-precision truncations, measured using extended precision as the standard, and they came from the arithmetic operations supplying 2/3's of the incoming numbers. Though tiny they were enormous in quantity compared to FDIV errors, giving them enough "momentum" to sweep down to 10^-10 by being magnified by arithmetic before the diluting effect of the 1's was finally able to sweep them all away. But this is a chaotic process and some errors would from time to time reach 10^-7, though after cycle 120 none got past that threshold (at least none were there at each 20-cycle snapshot). I learned a lot about doing experiments from this experiment, but the most useful thing I learned about the Pentium itself was I think the fact that if you start from low precision data (and it doesn't seem to matter too much whether it is exclusively the number 1, or random decimal numbers of the form d.dd) and compute at random while respecting type (don't add kumquats to square kumquats) and keeping dependency chains short (< 100), then the kind of numbers you end up with, when selected among at random and fed into FDIV, trigger the bug every few million divisions. This is three orders of magnitude higher than for random data in Intel's sense, operands selected from the 2^23 mantissas with equal probability. I'd like to measure this more carefully and under a greater variety of conditions just to be sure, but I certainly did not have to wait 24 days for an FDIV error as the IBM report suggests, they popped up every couple of minutes or so! 27,000 years? Give me a break. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/ABSTRACTS http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug70---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 28609 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel,sci.math.num-analysis,comp.arch Subject: Re: TECHNICAL: Chain reaction in Pentiums (Was: The Flaw: Pentium-Contaminated Data Persists) Date: 3 Jan 1995 18:15:14 GMT Organization: Computer Science Department, Stanford University. Lines: 39 Distribution: inet Message-ID: <3ec47i$ki5@Radon.Stanford.EDU> References: <3dn52g$ohe@newsbf02.news.aol.com> <3e750u$rv1@Radon.Stanford.EDU> <3ebjds$dkl@doc.armltd.co.uk> NNTP-Posting-Host: sunburn.stanford.edu Xref: Radon.Stanford.EDU comp.sys.intel:28609 sci.math.num-analysis:10504 comp.arch:16099 In article <3ebjds$dkl@doc.armltd.co.uk>, David Seal wrote: >pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) writes: > >>Other experiments suggest themselves here, such as greatly diluting the >>... >>Too many fun experiments possible, too little time. > >One that suggests itself to me: run three versions in parallel, one >for buggy Pentium, one for correct double precision and one for >correct extended precision. Prune a calculation out of the database if >the inaccuracy of the correct double precision result relative to the >correct extended precision result becomes too large. > >The rationale behind this: it is an attempt to model the (somewhat >questionable) assumption that the programs people run have been >designed to be well-behaved in the presence of double precision >rounding errors: as long as double precision remains accurate relative >to extended precision, the calculation is likely to be one that you >could reasonably expect to come up in people's programs. A roughly equivalent effect turned out to be achieved by keeping dependency chains below 100. Above 200 the double-precision-truncation errors were blown up to the "danger region" of relative errors above 10^-5. (Basis: this experiment is intended to factor in considerations of error propagation, and not too many engineering applications depend directly on the 6th decimal digit.) Pruning only those computations actually creating these errors rather than the more draconian measure of pruning all that are at any risk of them due to much arithmetic may possibly give different results. I'm not sure which corresponds more closely to actual computational practice; I suppose one could interpret careful pivoting as being of the former kind, but other computations avoid this sort of magnification by avoiding excessively chaining of dependencies. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/ABSTRACTS http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug71---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 28631 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Intel does not deliver has promised! Date: 3 Jan 1995 20:59:53 GMT Organization: Computer Science Department, Stanford University. Lines: 36 Message-ID: <3ecds9$srd@Radon.Stanford.EDU> References: <3ec0ou$r7b@goudron.gel.ulaval.ca> NNTP-Posting-Host: sunburn.stanford.edu In article <3ec0ou$r7b@goudron.gel.ulaval.ca>, Jean-Sebastien Dion wrote: > >I don't know how many of you received their new pentium chip, but I really >think it's very rare. I received two phone calls early in December, one to ask if I needed a replacement and establish my eligibility, the other to get my address and Visa number. That was it as far as phone conversations with Intel directly concerning the replacement were concerned. On Dec. 27 a Fed-Ex truck arrived with a Priority Overnight package shipped from Oregon Dec. 23, containing a fixed chip (build date Dec. 21). Just five minutes ago (Jan. 3) the same Fed-Ex truck and driver arrived with a package labeled with the same product code shipped Dec. 30 (build date Dec. 23). Conclusions. 1. I've inadvertently signed up with the Pentium-of-the-week club. We'll see what Jan. 10th brings. :) 2. Fed-Ex seems reliable: "Priority Overnight" apparently means four days. 3. Build-to-ship went from 2 days to 7, but Xmas may be a factor. Does build-to-ship give any indication of volume? I was planning to buy another Pentium anyway, my real research being concurrency. Intel closes another sale. At this rate I'll be up to 200 megaflops by Easter. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/ABSTRACTS http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug72---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Newsgroups: comp.sys.intel Subject: Re: How many clocks? Summary: Expires: References: <013873XIAEGMJREVNGYB@abq-ros.com> <3er7tr$o8f@vkhdsu01.hda.hydro.com> Sender: Followup-To: Distribution: Organization: Computer Science Department, Stanford University. Keywords: Cc: In article <3er7tr$o8f@vkhdsu01.hda.hydro.com>, Terje Mathisen wrote: > >The clock counts are 19, 33 and 39 for single, double and extended prec >... > cycles = ((bits_in_mantissa + 1) >> 1) + 6; Assuming 24, 53, and 64 bits in the respective mantissas (counting the leading bit), this formula gives 18, 33, 38 cycles. The formula bits/2 + 7 gives the desired result. >(The first iteration on the SRT radix-4 divider will only generate a single >bit, while all subsequent cycles gives two more bits.) I think I know what you mean, but it should be pointed out that the first iteration can generate two bits. Consider the case 1.1111/1.0000, for which P > 5/3D and hence for which the plot on page 6 of the Intel white paper shows that digit +2 is mandatory. The example in Figure 4-2 indicates that at the first iteration the operands are aligned. (rp(0) in the figure seems to be a typo, I think it should be P(0).) Hence 1.1111/1.0000 (or any dividend above 1.10000) forces this case to produce the quotient 10, two bits, at the first iteration. I think what you mean is that ultimately (possibly not until the very end if the next 30 quotient digits are 0) that leading 1 is wiped out by a subtraction. Hence it should not be counted as one of the developed quotient bits for the purpose of stopping when you have enough quotient bits. But then the case 1.0000/1.1111 (more generally, whenever D References: <013873XIAEGMJREVNGYB@abq-ros.com> <3er7tr$o8f@vkhdsu01.hda.hydro.com> NNTP-Posting-Host: sunburn.stanford.edu In article <3er7tr$o8f@vkhdsu01.hda.hydro.com>, Terje Mathisen wrote: > >The clock counts are 19, 33 and 39 for single, double and extended prec >... > cycles = ((bits_in_mantissa + 1) >> 1) + 6; Assuming 24, 53, and 64 bits in the respective mantissas (counting the leading bit), this formula gives 18, 33, 38 cycles. The formula bits/2 + 7 gives the desired result. >(The first iteration on the SRT radix-4 divider will only generate a single >bit, while all subsequent cycles gives two more bits.) I think I know what you mean, but it should be pointed out that the first iteration can generate two bits. Consider the case 1.1111/1.0000, for which P > 5/3D and hence for which the plot on page 6 of the Intel white paper shows that digit +2 is mandatory. The example in Figure 4-2 indicates that at the first iteration the operands are aligned. (rp(0) in the figure seems to be a typo, I think it should be P(0).) Hence 1.1111/1.0000 (or any dividend above 1.10000) forces this case to produce the quotient 10, two bits, at the first iteration. I think what you mean is that ultimately (possibly not until the very end if the next 30 quotient digits are 0) that leading 1 is guaranteed to be wiped out by a subtraction. Hence it should not be counted as one of the developed quotient bits for the purpose of stopping when you have enough quotient bits. But if that's what you mean, then the other extreme 1.0000/1.1111 (more generally, whenever D References: <3f9qlq$9ds@ornews.intel.com> <3fh47f$mgq@news.jf.intel.com> NNTP-Posting-Host: sunburn.stanford.edu In article , Anil Das wrote: >In article <3fh47f$mgq@news.jf.intel.com>, proussel@pdx144.intel.com (Patrice >Roussel) writes: >> All the entries of the guess PLA of the Pentium were exercized (something >> you >> can do with few thousand (dividend,divisor) pairs). Unfortunately the >> problem was that some entries were missing :-(. > > What is meant by `missing' here? If dividend/divisor pairs >that use those entries where tested, they will come out as 0 instead >of 2, and the problem would have been found, right? This depends on how they came to be 0. The Intel white paper attributes the missing entries to an error in a script written to download entries into a PLA, in which case such a test would presumably have caught this. However at a lecture I gave on the Pentium bug a week ago at Stanford, William Kahan who was in the audience said that the five entries were missing because it had been proved (erroneously) that they were never accessed by the division algorithm, like all the entries above them in the table. Obviously one cannot test any of the inaccessible entries by running the algorithm, so if you believe these five are inaccessible then you will of course not provide a test case for them. Having heard such conflicting accounts from Intel and Kahan, neither of whom I am competent to contradict, I am now thoroughly mystified as to the real origin of the bug. At this point the only apparently relevant fact about the table that I am 100% certain of is the following. In the five erroneous columns of the PD-plot (these being the ones we can measure, thanks to the bug), the six thresholds (three each in the positive and negative halves) are set to values I have tabulated below. The relevant fact is that these values would be perfectly symmetric about -1 had the erroneous threshold (at the top of the positive half) been set correctly, namely one cell higher. This higher threshold would then not illegally penetrate the 8/3 D line (shown in the Intel white paper), the root cause of the problem. (The reason -1 is the appropriate midpoint is that sampling truncates the partial remainder towards -oo in both halves, by an amount between 0 and 2 thanks to the redundant carry-save representation of the partial remainder. Had an irredundant representation been used the truncation would have been by an amount between 0 and 1 and the appropriate midpoint would have been -0.5. Had the sampled partial remainder been truncated towards oo instead of -oo the midpoint would have been positive instead of negative.) In order for the bug to have been caused by a faulty proof, both those who wrote the proof and those who read it would need to believe that a correct table could have the inner two threshold pairs (between absolute-value quotient digits 0 and 1, and between 1 and 2) perfectly symmetric yet have the outer pair (between 2 and "3") asymmetric. I have been unable to come up with any plausible failure mode for the proof which would reasonably account for confidence in such a peculiarly isolated asymmetry. Perhaps none of the parties examining the table noticed that the inner two pairs of thresholds were symmetric. Or perhaps some reason was advanced for why there should be a single exception to the symmetry that somehow everyone involved found convincing. I find neither of these scenarios terribly plausible, and am therefore mystified as to what form a faulty proof could have taken. Here are the observable thresholds, originally told to me by Tim Coe and which I have since confirmed independently for myself by verifying that any other choice of thresholds would be inconsistent with the observed behavior of the buggy Pentium. I've laid this out here as a transposed PD-plot, with the five observable divisors (17,20,23,26,29) indexing five rows (oriented as columns in the white paper), and with the entries (quotient digits) in each region of the table indexing the columns. The entries in my table are the ranges of chopped (sampled) partial remainder which yield the indicated quotient digits. The six vertical bars in the table separate the seven regions and constitute the thresholds between the regions. (An unambiguous way to realize these as numeric thresholds might be as the mean of the two adjoining chopped partial remainders, e.g. at D=17 the 1-2 threshold, between 11 and 12, could be considered to be set at 11.5.) \ q = 0("-3") -2 -1 0 1 2 0("3") D_5\--------------------------------------------------------------------- 17=1.0001 ...-26 | -25..-14 | -13..-5 | -4..2 | 3..11 | 12..22 | 23... | | | | | | 20=1.0100 ...-30 | -29..-16 | -15..-6 | -5..3 | 4..13 | 14..26 | 27... | | | | | | 23=1.0111 ...-34 | -33..-18 | -17..-6 | -5..3 | 4..15 | 16..30 | 31... | | | | | | 26=1.1010 ...-38 | -37..-20 | -19..-7 | -6..4 | 5..17 | 18..34 | 35... | | | | | | 29=1.1101 ...-42 | -41..-22 | -21..-7 | -6..4 | 5..19 | 20..38 | 39... ok ok ok ok ok ouch The first five thresholds are marked as ok. The sixth, marked ouch, is one too low. Although the 11 other chopped divisors, namely D=16,18,19,21,22,24,25, 27,28,30,31, are not observable via the bug, Tim conjectures that their thresholds are those of their "ceilings" defined as the smallest higher buggy divisor, e.g. D=18 and 19 have the exact same thresholds as D=20. This leaves D=30 and 31 with unmeasured thresholds, but by extending the evident arithmetic progressions these are reasonably conjectured to be 31=1.1111 ...-46 | -45..-24 | -23..-8 | -7..5 | 6..21 | 22..42 | 43... With that regularity the table would only require 6 "fat" columns (one per three chopped divisors, less one at each end) instead of 16 (one per chopped divisor), shrinking the table to 6/16 its original size. Sampling the divisor to determine which of the 6 columns to use would only be done once, at setup. In contrast, sampling the partial remainder is done at every cycle of the algorithm, and it may well be preferable to keep separate positive and negative halves of the table to avoid the extra time required to . That the actual table is asymmetric is very strong evidence that the positive and negative halves are represented separately. (But not conclusive evidence---a simple logic circuit could in principle have been used to lower the top threshold by 1 for positive divisors. Such an ad-hoc addition to the logic is however almost surely excluded on the ground that it would serve no purpose even if the lower threshold *had* been correct.) Here is a slick and easily proved closed form formula for the entries of the positive half of a *correct* table. If D (between 16 and 31) is the chopped divisor, let d = D/3 + 1 (between 6 and 11, serving as an index to the 6 "fat" columns), and let p be the chopped partial remainder, p >= 0. Take the PD-plot table entry at row p and fat column d to be given by the closed form formula (d <= 2*p) + (2*d <= p) where Booleans are numerically 0 if false and 1 if true (as in C). Thus at D = 17 we have d = 17/3 + 1 = 5 + 1 = 6, and we then get for p=2,3,11,12 the values p d <= 2*p 2*d <= p (d<=2*p) + (2*d<=p) ---------------------------------------------------- 2 0 0 0 3 1 0 1 11 1 0 1 12 1 1 2 These four examples show that this closed form formula locates the two inner thresholds exactly where the Pentium puts them, at least for D=17; similar calculations confirm that this formula tracks the Pentium's inner two thresholds perfectly at the other divisors. One can imagine ways to evaluate this formula sufficiently fast during each division cycle as to obviate the need for a precomputed table of its values. This formula makes the inaccessible entries 2. If it is desired to reduce power by setting these to something else, e.g. 0 or 3, this can be safely accomplished using the easily proved fact that, with one exception, they are precisely those entries satisfying 4*d <= p. Note that this formula does *not* produce the notorious bug, which corresponds to using the more complicated formula 4*d <= p+1 (in the positive half), which compromises the rightmost third of each fat column. The one exception mentioned above is for d = 11, p = 43, where thanks to the absence of column D=32 this more complicated formula does not produce the bug, else there would be 6 groups of problem divisors instead of 5. However I would imagine the tiny economy of setting one additional entry to 0 is not worth the trouble and attendant risk of error; stick to 4*d <= p uniformly and let d=11, p=43 stay at 2. The negative half of the table is obtained simply by reflecting the positive half about -1, also easily proved correct (an outline of this proof is in the parenthetical paragraph near the beginning of this message starting "The reason"). Had these simple and easily derived formulas been known for the PD-plot table, all this trouble would have been avoided. There is *nothing* so effective as simplicity to avoid bugs, whether in computers, mathematics, or any other area where embarrassing and/or expensive mistakes are possible. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/ABSTRACTS http://boole.stanford.edu/boole.html ================================================================= -------------------------PENTIUM REPORT # bug74---------------------- Vaughan R. Pratt Department of Computer Science Stanford University Stanford, CA 94304 pratt@cs.stanford.edu ====================================================================== Article 31190 of comp.sys.intel: Path: Radon.Stanford.EDU!news.Stanford.EDU!Sunburn.Stanford.EDU!pratt From: pratt@Sunburn.Stanford.EDU (Vaughan R. Pratt) Newsgroups: comp.sys.intel Subject: Re: Pentium Math Error Date: 21 Jan 1995 20:03:26 GMT Organization: Computer Science Department, Stanford University. Lines: 34 Message-ID: <3frpae$ovk@Radon.Stanford.EDU> References: NNTP-Posting-Host: sunburn.stanford.edu In article , Warren Rath wrote: > >Now ask your self how many times does the calculation of > 4,195,835 / 3,145,727 actually come up in computing?? Once every seventeen comp.sys.intel posts, or once a minute. :) I prefer 5/15 with a millionth knocked off top and bottom, as being easier to remember than 4195835/3145727. 4.999999/14.999999 should have 6 3's but the Pentium only produces 4 of them. If you take the number 1 and randomly add, subtract, multiply, or divide it by itself, and save the result, and keep doing this with a mix of more 1's and these results, then in the course of an hour you will see the bug more than a dozen times. This is the dual experiment to Intel's. Intel takes a program consisting of a single instruction, FDIV, sends pairs of random numbers to it, and gets an error rate of 1 in 9 billion random pairs of numbers. In contrast my experiment above takes a database starting with a single number, 1, sends it to random programs, and gets an error rate of 1 in five minutes of random computing. This shows that the error rate depends on both the kind of programs you use and the kind of data flowing through them. If your program consists of just the FDIV instruction and your data is uniformly distributed then you should be as safe as Intel says. If you have more complicated programs and less complicated numbers, the above experiment indicates that you could be at much greater risk than either Intel or IBM says. -- Vaughan Pratt FTP: boole.stanford.edu:/pub/ABSTRACTS http://boole.stanford.edu/boole.html