======================================================================= 'My Hairiest Bug' War Stories [PART 1] [early extended version of paper which appeared in Comm. ACM, 40 (4) April 1997] Marc Eisenstadt Knowledge Media Institute The Open University Milton Keynes MK7 6AA UK M.Eisenstadt@open.ac.uk ======================================================================= ASCII documents available from ftp://kmi-ftp.open.ac.uk/pub/bugtales: =>bugtales.1o4 (Main part 1: Abstract, intro, data analysis) bugtales.2o4 (Main part 2: Relating the dimensions, legacy, ref's) bugtales.3o4 (Appendix A: Selected anecdotes - see also bugdata.txt) bugtales.4o4 (Appendix B: Condensed data tables) bugdata.txt (ASCII raw data from 1st trawl) PURE ASCII VERSION FOR EMAIL/BBOARD POSTINGS Be sure to print in monospace font (e.g. Courier 10pt., 6 3/4" margin). All of the above documents are available via FTP from kmi-ftp.open.ac.uk (login: anonymous, password: , directory /pub/bugtales) or by email from M.Eisenstadt@open.ac.uk. ======================================================================= ABSTRACT A worldwide trawl for debugging anecdotes elicited replies from 78 respondents, including a number of implementors of well-known commercial software. The stories included descriptions of bugs, bug-fixing strategies, discourses on the philosophy of programming, and several highly amusing and informative reminiscences. An analysis of the anecdotes reveals three primary dimensions of interest: why the bugs were difficult to find, how the bugs were found, and root causes of bugs. Half of the difficulties arose from just two sources: (i) large temporal or spatial chasms between the root cause and the symptom, and (ii) bugs that rendered debugging tools inapplicable. Techniques for bug-finding were dominated by reports of data-gathering (e.g. print statements) and hand-simulation, which together accounted for almost 80% of the reported techniques. The two biggest causes of bugs were (i) memory overwrites and (ii) vendor-supplied hardware or software faults, which together accounted for more than 40% of the reported bugs. The paper discusses the implications of these findings for the design of program debuggers, and explores the possible role of a large repository/data base of debugging anecdotes. ACKNOWLEDGEMENTS Apple Computer, Inc.'s Advanced Technology Group provided a supportive and productive environment during the summer of 1992, enabling me to undertake this analysis along with other concomitant activities. Parts of this research were also funded by the UK SERC/ESRC/MRC Joint Council Initiative on Cognitive Science and Human Computer Interaction, and by the Commission of the European Communities ESPRIT-II Project 5365 (VITAL). Mike Brayshaw and Blaine A. Price (Open University) provided valuable comments on several drafts of this paper. INTRODUCTION The work reported here attempts to expand the single-user-log-book approach of Knuth (1989) to investigate the phenomenology of debugging across a large population of users. The original aim of the study was to inform ongoing research in my laboratory on the development of program visualization and debugging tools (Eisenstadt et al., 1990; 1992; 1993), for which it seemed essential to have first-hand knowledge of the experiences of a large body of professional programmers. We had our own experiences to draw on, including extensive work with novice programmers, but it seemed important to look further afield. In particular, we had been growing increasingly concerned about the problems of scalability: academic work on program visualization and debugging environments would remain vulnerable to criticisms of "toy examples only" unless an attempt was made to address the problems faced by professional programmers working on very large programming tasks. Toward this end, I conducted a survey of professional programmers, asking them to provide stories describing their most difficult bugs involving large pieces of software. The survey was conducted by electronic mail and conferencing/bulletin board facilities with wordwide access (BIX, CompuServe, Internet Usenet, AppleLink). My contribution is to gather, edit and annotate the stories, and to categorize them in a way which may help to shed some light on the nature of the debugging enterprise. In particular, I look at the lessons learned from the stories, and discuss what they tell us about what is needed in the design of future debugging tools. I also explore the possible role of a large repository/data base of debugging anecdotes. The discussion throughout concentrates on the relationships among three critical dimentsions: (a) why the bugs were hard to find (c) how the bugs were found, and (c) the root causes of the bugs. THE TRAWL Raw data ======== On 3 March, 1992, I posted the following request for debugging anecdotes on BIX, The "BYTE Information Exchange": vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Fig. 1--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv c.language/tools #2842, from meisenstadt 771 chars, Tue Mar 3 06:09:28 1992 Comment(s). TITLE: Trawl for debugging anecdotes (w/emphasis on tools side)... I'm looking for some (serious) anecdotes describing debugging experiences. In particular, I want to know about particularly thorny bugs in LARGE pieces of software which caused you lots of headaches. It would be handy if the large piece of software were written in C or C++, but this is not absolutely essential. I'd like to know how you cracked the problem-- what techniques/tools you used: did you 'home in' on the bug systematically, did the solution suddenly come to you in your sleep, etc. A VERY brief stream-of-consciousness reply (right now!) would be much much better than a carefully-worked-out story. I can then get back to you with further questions if necessary. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^ Figure 1. The original "trawl" request, posted on BIX. c.language is the name of the conference (major subject category), and "tools" is the name of the topic (minor interest sub-category). BIX has 60,000 members, hundreds of conferences and thousands of topics, each with thousands of messages and replies. A second message, explaining my motivation, was also posted. During the ensuing months, similar messages were then posted to AppleLink, CompuServe, various Usenet newsgroups on Internet, and the Open University's own conferencing system (OU CoSy). The trawl request elicited replies from 78 "informants", mostly in the USA and UK. The group included implementors of very well-known commercial C compilers, members of the ANSI C++ definition group, and other known commercial software developers. Figure 2 shows a typical reply to the original request. In that reply, the informant describes many important facets of his debugging experience: (a) the background context (MS-DOS 8086 compiler), (b) the symptom (wrong value on stack); (c) how he approached the problem (single-stepping using assembly-level debugger); (d) why it was hard (the change happened more quickly at run-time than he could observe while single-stepping, plus he had the wrong model of which way the stacks grew, which made it harder to understand the behaviour); and (e) what the root cause of the problem was (the address below the stack pointer was being wiped out by the operating system interrupt handlers, and the pointer was being decremented too late in the compiled code). vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Fig. 2--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv "I had a bug in a compiler for 8086's running MSDOS once that stands out in my mind. The compiler returned function values on the stack and once in a while such a value would be wrong. When I looked at the assembly code, all seemed fine. The value was getting stored at the correct location of the stack. When I stepped thru it in the assembly-level debugger and got to that store, sure enough, the effective address was correct in the stack frame, and the right value was in the register to be stored. Here's the weird thing --- when I stepped through the store instruction the value on the stack didn't change. It seems obvious in retrospect, but it took some hours for me to figure out that the effective address was below the stack pointer (stacks grow down here), and the stored value was being wiped out by os interrupt handlers (that don't switch stacks) about 18 times a second. The stack pointer was being decremented too late in the compiled code." ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Figure 2 A typical debugging anecdote. A representative sample of the raw data is presented in Appendix A, and a condensed table is presented in Appendix B {enclosedas a subsequent file/message in this ASCII version} A total of 110 messages were generated by 78 different informants. Of those, 50 informants specifically told a story about a nasty bug. A few informants provided several anecdotes, and in all a total of 59 bug anecdotes were collected. A second collection of anecdotes, which will be used subsequently for a top-down analysis by comparing it against the first collection, yielded an additional 58 messages from 47 more informants, of whom 45 told specific debugging stories. A few representative raw anecdotes are presented in Appendix A . The next sections presents a detailed summary of the raw data, followed by an analysis and discussion of the lessons learned. ANALYSIS OF THE ANECDOTES Dimensions of analysis: why difficult, how found, and root cause ================================================================ Although the "root cause" of reported bugs is of a priori interest, in order to fully characterise the phenomenology of the debugging experiences I needed to look at more than the causes of the bugs. In particular, it was necessary to say something about why a bug was hard to find (which might or might not be related to the underlying cause), and how it was found (which might or might not be related to the underlying cause and the reason for the difficulty). Thus, three dimensions became the critical focus for the ensuing analysis: * Why difficult: This identifies the reason that the debugging experience itself was tricky or painful, e.g. perhaps the bug rendered the debugging tools unusable. * How found: This identifies the diagnostic methodologies or techniques that were used in resolving the problem, e.g. single- stepping the code or inserting hand-tailored print statements. * Underlying cause of bug: This identifies the root cause of the bug which, when fixed, means that the programmer has either totally solved the problem or else has gone far enough to regard the problem as being "in hand". We know something about each of these dimensions from previous studies, notably those of Vesey (1989), Katz and Anderson (1988), Johnson et. al. (1983), Spohrer et al. (1985), Knuth (1983), and du Boulay (1986). These are discussed in an expanded version of this paper (Eisenstadt, 1993b). I found it informative to evolve my own categories in a largely bottom- up fashion after extensive inspection of the data, and then compare them specifically with the ones provided by Knuth. The comparison is provided in Table B-1 (Appendix B) by means of a bracketed label such as "{A}" or "{L}" following relevant entries. In some cases, there is a straightforward mapping, but in others it is more subtle. For instance, there is sometimes a one-to-many mapping between my categories and those of Knuth. The reason is that Knuth knew his state of mind at the time he committed the errors reported in his log, whereas I have to rely on my interpretation of the programmer's anecdote. Even when the programmer's state of mind is clearly assessible, the assignment of blame can be awkward, because a given cause may always have an even deeper root cause. The criterion I have adopted for identifiying root causes is as follows: when the programmer is essentially satisfied that several hours or days of bewilderment have come to an end once a particular culprit is identified, then that culprit is the root cause, even when deeper causes can be found. I have adopted this approach (a) because a possible infinite regress is nipped in the bud, (b) because it is consistent with my emphasis on the phenomenology of debugging, i.e. what is apparently taking place as far as the front-line programmer is concerned, (c) it enables me to concentrate on what the programmers reported, and not try to second-guess them. The subsections which follow describe the three dimensions of analysis (why difficult; how found; root cause) in turn. Dimension 1: Why difficult ========================== Categories ---------- The reasons that the bug was hard to trap fell into five categories, as described below: * cause/effect chasm: Often the symptom is far removed in space and/or time from the root cause, and this can make the cause hard to detect. Specific instances can involve timing or synchronization problems, bugs which are intermittent, inconsistent, or infrequent, and bugs which materialize "far away" (e.g. thousands of iterations) from the actual place they are spawned. In this general category I have also included debugging episodes in which thre are too many degrees of freedom. As an example of such an episode, consider the case in which a piece of software which works perfectly in one environment, yet fails to work in another environment. If many things have changed (e.g. different hardware, different compiler, different linker), then there are simply too many degrees of freedom to enable systematic testing under controlled conditions to isolate the bug. Such testing can be done, given enough time and resources, but it is difficultÑhence this category. * tools inapplicaple or hampered: Most programmers have encountered so-called "Heisenbugs", named after the Heisenberg uncertainty principle in physics: the bug goes away when you switch on the debugging tools! Other variations within this category are: long run to replicate (i.e. the bug takes a long time to replicate on a fresh execution, so if switching on the debugging tool significantly slows down execution, then it can not really be used); stealth bug (i.e. the error itself consumes the evidence that you need to find the bug, or even clobbers the debugging tool); context precludes (i.e. some configuration or memory constraints make it impractical or impossible to use the debugging tool). * WYSIPIG (What you see is probably illusory, guv'nor): I have coined this expression to reflect the cases in which the programmer stares at something which simply is not there, or is dramatically different from what it appears to be. This can range from syntactic problems (e.g. hallucinating a key word in the code which is actually not there) to run-time observations (looking at a value on the screen which is displayed in a different way, e.g. "10" in an octal display being misinterpreted as meaning 7+3 rather than 7+1). Such observations lead the programmer on a wild-goose chase, and can be the reason why certain otherwise simple bugs take a long time to track down. * faulty assumption/model or mis-directed blame: If you think that stacks grow up rather than down (as did the informant in Figure 2), then bugs which are related to this behaviour are going to be hard to detect. Equally, if you bet your life on the known correct behaviour of a certain function (which is actually faulty), you are going to spend a lot of time looking in the wrong place. * spaghetti (unstructured) code: Informants sometimes reported that the code "was in a mess" when they were called in to deal with it. There is, unsurprisingly, a 100% correlation between complaints about "ugly" code and the assertions that "someone else" wrote the code. Results ------- The frequency of occurence of the different reasons for having difficulty is shown in Table 1. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Table 1--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv CATEGORY OCCURRENCES cause/effect chasm.......................................15 tools inapplicable or hampered...........................12 WYSIPIG: What you see is probably illusory, guv'nor.......7 faulty assumption/model or mis-directed blame.............6 spaghetti (unstructured) code.............................3 ??? (no information)......................................8 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Table 1. Why the bugs were difficult to track down. Thus, 53% of the difficulties are attributable to just two sources: (i) large temporal or spatial chasms between the root cause and the symptom, and (ii) bugs that rendered debugging tools inapplicable. The high frequency of reports of cause/effect chasms accords well with the analyses of Vesey (1989) and Pennington (1987) which argue that the programmer must form a robust mental model of correct program behaviour in order to detect bugsÑthe cause/effect chasm seriously undermines the programmer's efforts to construct a robust mental model. The relationship of this finding to the analysis of the other dimensions is reported below. Dimension 2: How found ====================== Categories ---------- The informants reported four major bug-catching techniques, as follows: * gather data: This category refers to cases in which the informant decided to "find out more", say by planting print statements or breakpoints. In fact, a variety of specific data collection techniques were reported. The important thing that distinguishes these data collection techniques from the "controlled experiments" category described below is their bottom-up nature. That is, the informants may have had a rough idea of what they were looking for, but were not explicitly testing any hypotheses in a systematic way. Notice that this category is different both from Katz & Anderson's (1988) causal reasoning and from their data-driven hand simulation: it is really a hybrid of the two, because it is bottom-up, on the one hand, yet can lead directly to a causal analysis once the data has been gathered. Here are the six sub-categories reported by the informants: + step & study: the programmer single-steps through the code, and studies the behaviour, typically monitoring changes to data structures + wrap & profile: tailor-made performance, metric, or other profiling information is collected by "wrapping" (enclosing) a suspect function inside a one-off variant of that function which calls (say) a timer or data-structure printout both before and after the suspect function. + print & peruse: print statements are inserted at particular points in the code, and their output is observed during subsequent runs of the program + dump & diff: either a true core dump or else some variation (e.g. voluminous output of print statements) is saved to two text files corresponding to two different execution runs; the two files are then compared using a source-compare ("diff") utility, which highlights the difference between the two execution runs + conditional break & inspect: a breakpoint is inserted into the code, typically triggered by some specific behaviour; data values are then inspected to determine what is happening + specialist profile tool (MEM or Heap Scramble): there are several off-the-shelf tools which detect memory leaks and corrupt or illegal memory references, and the experts who relied on these also tended to rave about their value. * "inspeculation": This name is meant to be a hybrid of "inspection" (code inspection), "simulation" (hand-simulation), and "speculation", which were among a wide variety of techniques mentioned explicitly or implicitly by informants: cogitation, meditation, observation, inspection, contemplation, hand-simulation, gestation, rumination, dedication, inspiration. In other words, they either go away and think about something else for a while, or else spend a lot of time reading through the code and thinking about it, possibly hand-simulating an execution run. "Articulation" (explaining to someone else how the code works) also fits here. The point is that this family of techniques does not involve any experimentation or data gathering, but rather involves "thinking about" the code. * expert recognized cliche: These are cases where the programmer called upon a cohort, and the cohort was able to spot the bug relatively simply. This recognition corresponds to the heuristic 'mapping' observed by Katz & Anderson (1988). The very nature of my data gathering excercise invariably requires a cohort (rather than the informant) to have detect the bug: the informant would not have been stumped, nor would have bothered telling me about the bug, had he or she been able to identify the solution quickly in the first place! * controlled experiments: Informants resorted to specific controlled experiments when they had a clear idea about what the root cause of the bug might be. Results ------- The frequency of occurrence of the different debugging techniques is shown in Table 2. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Table 2--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv CATEGORY OCCURRENCES gather data...................27 inspeculation.................13 expert recognized cliche.......5 controlled experiments.........4 ??? (no information)...........2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Table 2. Techniques used to track down the bugs. Techniques for bug-finding are clearly dominated by reports of data- gathering (e.g. print statements) and hand-simulation, which together account for 78% of the reported techniques, and highlight the kind of "groping" that the programmer is reduced to in difficult debugging situations. Let's now turn to an analysis of the root causes of the bugs before we go on to see how the different dimensions interrelate. Dimension 3: Root cause ======================= Categories ---------- The bug causes reported by the informants fell into the following nine categories: * mem: Memory clobbered or used up. This cause has a variety of manifestations (e.g. overwriting a reserved portion of memory, and thereby causing the system to crash) and may even have deeper causes (e.g. array subscript out of bounds), yet is often singled out by the informants as being the source of the difficulty. Knuth has an analagous category, which he calls "D = Data structure debacle". * vendor: Vendor's problem (hardware or software). Some informants report buggy compilers or faulty logic boards, for which they either need to develop a workaround or else wait for the vendor to provide corrective measures. * des.logic: Unanticipated case (faulty design logic). In such cases, the algorithm itself has gone awry, because the programmer has not worked through all the cases correctly. This category encompasses both those which Knuth labels as "A = algorithm awry" and also those labelled as "S=surprise scenario". Knuth's A-vs.-S distinction can only be resolved by in-depth introspection, and is too fine-grained for the purposes of this study. * init: Wrong initialization; wrong type; definition clash. A programmer will sometimes make an erroneous type declaration, or re- define the meaning of some system keyword, or incorrectly initialize a variable. I refer to all of these as "init" errors, since the program begins with its variables, data structures, or function definitions in an incorrect starting state. * var: Wrong variable or operator. Somehow, the wrong term has been used. The informant may not provide enough information to deduce whether this was really due to faulty design logic (des.logic) or whether it was a trivial lexical error (lex), though in the latter case trivial typos are normally mentioned explicitly as the root cause. * lex: Lexical problem, bad parse, or ambiguous syntax. These are meant to be trivial problems, not due to the algorithm itself, nor to faulty variables or declarations. This class of errors encompasses Knuth's "B=Blunder" and "T=Typo", which are hard to distinguish in informant's reports. * unsolved: Unknown and still unsolved to this day. Some informants never solved their problem! * lang: Language semantics ambiguous or misunderstood. In one case, an informant reports that he thought that 256K meant 256000, which is incorrect, and can be thought of as a semantic confusion. In another case, an informant reported a mismatch between the way a manual described some maximum value and the way in which it was actually dealt with by the compiler. * behav: End-user's (or programmer's) subtle behaviour. For example, in one case the bug was caused by an end-user mysteriously depressing several keys on the keyboard at once, and in another case the bug involved some mischievous code inserted as a joke. These are really manifestations of behaviour external to the normal progamming arena, but still warrant a category in their own right. Results ------- Table 3 displayes the frequency of occurrence of the nine underlying causes. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Table 3--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv CATEGORY OCCURRENCES mem: Memory clobbered or used up.............................13 vendor: Vendor's problem (hardware or software)...............9 des.logic: Unanticipated case (faulty design logic)...........7 init: Wrong initialization; wrong type; definition clash......6 lex: Lexical problem, bad parse, or ambiguous syntax..........4 var: Wrong variable or operator...............................3 unsolved: unknown and still unsolved to this day..............3 lang: language semantics ambiguous or misunderstood...........2 behav: end-user's (or programmer's) subtle behaviour..........2 ??? (no information)..........................................2 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Table 3. Underlying causes of the reported bugs. Table 3 indicates that the biggest culprits were memory overwrites and vendor-supplied hardware/software problems. Even ignoring vendor- specific difficulties, one implication of Table 3 is that 37% of the nastiest bugs reported by professionals could be addressed by (a) memory-analysis tools and (b) smarter compilers which trapped initialization errors. But what about the interaction between the cause of the bug, the reason for the debugging difficulty, and the debugging technique? That is precisely the focus of the next section. ------------------------------------------------------------------------ [Next: bugtales.2o4 (Relating the dimensions, legacy, ref's)]