=======================================================================
'My Hairiest Bug' War Stories
[PART 1]

[early extended version of paper which appeared in 
Comm. ACM, 40 (4) April 1997]

Marc Eisenstadt 
Knowledge Media Institute
The Open University
Milton Keynes MK7 6AA
UK

M.Eisenstadt@open.ac.uk

=======================================================================
ASCII documents available from ftp://kmi-ftp.open.ac.uk/pub/bugtales:
=>bugtales.1o4 (Main part 1: Abstract, intro, data analysis)
  bugtales.2o4 (Main part 2: Relating the dimensions, legacy, ref's)
  bugtales.3o4 (Appendix A:  Selected anecdotes - see also bugdata.txt)
  bugtales.4o4 (Appendix B:  Condensed data tables)
  bugdata.txt  (ASCII raw data from 1st trawl)

PURE ASCII VERSION FOR EMAIL/BBOARD POSTINGS
Be sure to print in monospace font (e.g. Courier 10pt., 6 3/4" margin).

All of the above documents are available via FTP from kmi-ftp.open.ac.uk 
(login: anonymous, password: <YourEmailAddress>, directory /pub/bugtales) 
or by email from M.Eisenstadt@open.ac.uk.
=======================================================================


ABSTRACT

A worldwide trawl for debugging anecdotes elicited replies from 78 
respondents, including a number of implementors of well-known commercial 
software.  The stories included descriptions of bugs, bug-fixing 
strategies, discourses on the philosophy of programming, and several 
highly amusing and informative reminiscences.  An analysis of the 
anecdotes reveals three primary dimensions of interest: why the bugs 
were difficult to find, how the bugs were found, and root causes of 
bugs.  Half of the difficulties arose from just two sources: (i) large 
temporal or spatial chasms between the root cause and the symptom, and 
(ii) bugs that rendered debugging tools inapplicable.  Techniques for 
bug-finding were dominated by reports of data-gathering (e.g. print 
statements) and hand-simulation, which together accounted for almost 80% 
of the reported techniques.  The two biggest causes of bugs were (i) 
memory overwrites and (ii) vendor-supplied hardware or software faults, 
which together accounted for more than 40% of the reported bugs.  The 
paper discusses the implications of these findings for the design of 
program debuggers, and explores the possible role of a large 
repository/data base of debugging anecdotes.  


ACKNOWLEDGEMENTS

Apple Computer, Inc.'s Advanced Technology Group provided a supportive 
and productive environment during the summer of 1992, enabling me to 
undertake this analysis along with other concomitant activities.  Parts 
of this research were also funded by the UK SERC/ESRC/MRC Joint Council 
Initiative on Cognitive Science and Human Computer Interaction, and by 
the Commission of the European Communities ESPRIT-II Project 5365 
(VITAL).  Mike Brayshaw and Blaine A. Price (Open University) provided 
valuable comments on several drafts of this paper.


INTRODUCTION

The work reported here attempts to expand the single-user-log-book 
approach of Knuth (1989) to investigate the phenomenology of debugging 
across a large population of users.  The original aim of the study was 
to inform ongoing research in my laboratory on the development of 
program visualization and debugging tools (Eisenstadt et al., 1990; 
1992; 1993), for which it seemed essential to have first-hand knowledge 
of the experiences of a large body of professional programmers.  We had 
our own experiences to draw on, including extensive work with novice 
programmers, but it seemed important to look further afield.  In 
particular, we had been growing increasingly concerned about the 
problems of scalability: academic work on program visualization and 
debugging environments would remain vulnerable to criticisms of "toy 
examples only" unless an attempt was made to address the problems faced 
by professional programmers working on very large programming tasks.

Toward this end, I conducted a survey of professional programmers, 
asking them to provide stories describing their most difficult bugs 
involving large pieces of software.  The survey was conducted by 
electronic mail and conferencing/bulletin board facilities with wordwide 
access (BIX, CompuServe, Internet Usenet, AppleLink).  My contribution 
is to gather, edit and annotate the stories, and to categorize them in a 
way which may help to shed some light on the nature of the debugging 
enterprise.  In particular, I look at the lessons learned from the 
stories, and discuss what they tell us about what is needed in the 
design of future debugging tools.  I also explore the possible role of a 
large repository/data base of debugging anecdotes.  The discussion 
throughout concentrates on the relationships among three critical 
dimentsions: (a) why the bugs were hard to find (c) how the bugs were 
found, and (c) the root causes of the bugs.   


THE TRAWL

Raw data
========

On 3 March, 1992, I posted the following request for debugging anecdotes 
on BIX, The "BYTE Information Exchange": 

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Fig. 1--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
c.language/tools #2842, from meisenstadt
771 chars, Tue Mar  3 06:09:28 1992
Comment(s). 
TITLE: Trawl for debugging anecdotes (w/emphasis on tools side)...

I'm looking for some (serious) anecdotes describing debugging 
experiences.  In particular, I want to know about particularly thorny 
bugs in LARGE pieces of software which caused you lots of headaches.  It 
would be handy if the large piece of software were written in C or C++, 
but this is not absolutely essential.  I'd like to know how you cracked 
the problem-- what techniques/tools you used: did you 'home in' on the 
bug systematically, did the solution suddenly come to you in your sleep, 
etc.  A VERY brief stream-of-consciousness reply (right now!) would be 
much much better than a carefully-worked-out story.  I can then get back 
to you with further questions if necessary.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^
Figure 1.  The original "trawl" request, posted on BIX.  c.language is 
the name of the conference (major subject category), and "tools" is the 
name of the topic (minor interest sub-category).  BIX has 60,000 
members, hundreds of conferences and thousands of topics, each with 
thousands of messages and replies.  A second message, explaining my 
motivation, was also posted.


During the ensuing months, similar messages were then posted to 
AppleLink, CompuServe, various Usenet newsgroups on Internet, and the 
Open University's own conferencing system (OU CoSy).  The trawl request 
elicited replies from 78 "informants", mostly in the USA and UK.  The 
group included implementors of very well-known commercial C compilers, 
members of the ANSI C++ definition group, and other known commercial 
software developers.  

Figure 2 shows a typical reply to the original request.  In that reply, 
the informant describes many important facets of his debugging 
experience: (a) the background context (MS-DOS 8086 compiler), (b) the 
symptom (wrong value on stack); (c) how he approached the problem 
(single-stepping using assembly-level debugger); (d) why it was hard 
(the change happened more quickly at run-time than he could observe 
while single-stepping, plus he had the wrong model of which way the 
stacks grew, which made it harder to understand the behaviour); and (e) 
what the root cause of the problem was (the address below the stack 
pointer was being wiped out by the operating system interrupt handlers, 
and the pointer was being decremented too late in the compiled code).


vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Fig. 2--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
"I had a bug in a compiler for 8086's running MSDOS once that stands out 
in my mind.  The compiler returned function values on the stack and once 
in a while such a value would be wrong.  When I looked at the assembly 
code, all seemed fine.  The value was getting stored at the correct 
location of the stack.   When I stepped thru it in the assembly-level 
debugger and got to that store, sure enough, the effective address was 
correct in the stack frame, and the right value was in the register to 
be stored.  Here's the weird thing --- when I stepped through the store 
instruction the value on the stack didn't change.  It seems obvious in 
retrospect, but it took some hours for me to figure out that the 
effective address was below the stack pointer (stacks grow down here), 
and the stored value was being wiped out by os interrupt handlers (that 
don't switch  stacks) about 18 times a second.  The stack pointer was 
being decremented too late in the compiled code."
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Figure 2  A typical debugging anecdote.  A representative sample of the 
raw data is presented in Appendix A, and a condensed table is presented 
in Appendix B {enclosedas a subsequent file/message in this ASCII 
version}


A total of 110 messages were generated by 78 different informants.  Of 
those, 50 informants specifically told a story about a nasty bug.  A few 
informants provided several anecdotes, and in all a total of 59 bug 
anecdotes were collected.  A second collection of anecdotes, which will 
be used subsequently for a top-down analysis by comparing it against the 
first collection, yielded an additional 58 messages from 47 more 
informants, of whom 45 told specific debugging stories.  A few 
representative raw anecdotes are presented in Appendix A .   The next 
sections presents a detailed summary of the raw data, followed by an 
analysis and discussion of the lessons learned.


ANALYSIS OF THE ANECDOTES

Dimensions of analysis: why difficult, how found, and root cause
================================================================

Although the "root cause" of reported bugs is of a priori interest, in 
order to fully characterise the phenomenology of the debugging 
experiences I needed to look at more than the causes of the bugs.  In 
particular, it was necessary to say something about why a bug was hard 
to find (which might or might not be related to the underlying cause), 
and how it was found (which might or might not be related to the 
underlying cause and the reason for the difficulty).  Thus, three 
dimensions became the critical focus for the ensuing analysis:

*  Why difficult: This identifies the reason that the debugging 
experience itself was tricky or painful, e.g. perhaps the bug rendered 
the debugging tools unusable.    

*  How found: This identifies the diagnostic methodologies or 
techniques that were used in resolving the problem, e.g. single-
stepping the code or inserting hand-tailored print statements.  

*  Underlying cause of bug:  This identifies the root cause of the 
bug which, when fixed, means that the programmer has either totally 
solved the problem or else has gone far enough to regard the problem as 
being "in hand".

We know something about each of these dimensions from previous studies, 
notably those of Vesey (1989), Katz and Anderson (1988), Johnson et. al. 
(1983), Spohrer et al. (1985), Knuth (1983), and du Boulay (1986).  
These are discussed in an expanded version of this paper (Eisenstadt, 
1993b).  

I found it informative to evolve my own categories in a largely bottom-
up fashion after extensive inspection of the data, and then compare them 
specifically with the ones provided by Knuth.  The comparison is 
provided in Table B-1 (Appendix B) by means of a bracketed label such as 
"{A}" or "{L}" following relevant entries.  In some cases, there is a 
straightforward mapping, but in others it is more subtle.  For instance, 
there is sometimes a one-to-many mapping between my categories and those 
of Knuth.  The reason is that Knuth knew his state of mind at the time 
he committed the errors reported in his log, whereas I have to rely on 
my interpretation of the programmer's anecdote.  Even when the 
programmer's state of mind is clearly assessible,  the assignment of 
blame can be awkward, because a given cause may always have an even 
deeper root cause.  The criterion I have adopted for identifiying root 
causes is as follows: when the programmer is essentially satisfied that 
several hours or days of bewilderment have come to an end once a 
particular culprit is identified, then that culprit is the root cause, 
even when deeper causes can be found.  I have adopted this approach (a) 
because a possible infinite regress is nipped in the bud, (b) because it 
is consistent with my emphasis on the phenomenology of debugging, i.e. 
what is apparently taking place as far as the front-line programmer is 
concerned, (c) it enables me to concentrate on what the programmers 
reported, and not try to second-guess them.  

The subsections which follow describe the three dimensions of analysis 
(why difficult; how found; root cause) in turn. 


Dimension 1: Why difficult
==========================

Categories
----------

The reasons that the bug was hard to trap fell into five categories, as 
described below:

*  cause/effect chasm: Often the symptom is far removed in space 
and/or time from the root cause, and this can make the cause hard to 
detect.  Specific instances can involve timing or synchronization 
problems, bugs which are intermittent, inconsistent, or infrequent, and 
bugs which materialize "far away" (e.g. thousands of iterations) from 
the actual place they are spawned.  In this general category I have also 
included debugging episodes in which thre are too many degrees of 
freedom.  As an example of such an episode, consider the case in which a 
piece of software which works perfectly in one environment, yet fails to 
work in another environment.  If many things have changed (e.g. 
different hardware, different compiler, different linker), then there 
are simply too many degrees of freedom to enable systematic testing 
under controlled conditions to isolate the bug.  Such testing can be 
done, given enough time and resources, but it is difficultÑhence this 
category.

*  tools inapplicaple or hampered: Most programmers have encountered 
so-called "Heisenbugs", named after the Heisenberg uncertainty principle 
in physics: the bug goes away when you switch on the debugging tools!  
Other variations within this category are: long run to replicate (i.e. 
the bug takes a long time to replicate on a fresh execution, so if 
switching on the debugging tool significantly slows down execution, then 
it  can not really be used); stealth bug (i.e. the error itself consumes 
the evidence that you need to find the bug, or even clobbers the 
debugging tool); context precludes (i.e. some configuration or memory 
constraints make it impractical or impossible to use the debugging 
tool).

*  WYSIPIG (What you see is probably illusory, guv'nor): I have 
coined this expression to reflect the cases in which the programmer 
stares at something which simply is not there, or is dramatically 
different from what it appears to be.  This can range from syntactic 
problems (e.g. hallucinating a key word in the code which is actually 
not there) to run-time observations (looking at a value on the screen 
which is displayed in a different way, e.g. "10" in an octal display 
being misinterpreted as meaning 7+3 rather than 7+1).  Such observations 
lead the programmer on a wild-goose chase, and can be the reason why 
certain otherwise simple bugs take a long time to track down.

*  faulty assumption/model or mis-directed blame: If you think that 
stacks grow up rather than down (as did the informant in Figure 2), then 
bugs which are related to this behaviour are going to be hard to detect.  
Equally, if you bet your life on the known correct behaviour of a 
certain function (which is actually faulty), you are going to spend a 
lot of time looking in the wrong place.

*  spaghetti (unstructured) code: Informants sometimes reported that 
the code "was in a mess" when they were called in to deal with it.  
There is, unsurprisingly, a 100% correlation between complaints about 
"ugly" code and the assertions that "someone else" wrote the code.


Results
-------

The frequency of occurence of the different reasons for having 
difficulty is shown in Table 1.


vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Table 1--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
CATEGORY                                            OCCURRENCES
cause/effect chasm.......................................15
tools inapplicable or hampered...........................12
WYSIPIG: What you see is probably illusory, guv'nor.......7
faulty assumption/model or mis-directed blame.............6
spaghetti (unstructured) code.............................3
??? (no information)......................................8
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Table 1.  Why the bugs were difficult to track down.


Thus, 53% of the difficulties are attributable to just two sources: (i) 
large temporal or spatial chasms between the root cause and the symptom, 
and (ii) bugs that rendered debugging tools inapplicable.  The high 
frequency of reports of cause/effect chasms accords well with the 
analyses of Vesey (1989) and Pennington (1987) which argue that the 
programmer must form a robust mental model of correct program behaviour 
in order to detect bugsÑthe cause/effect chasm seriously undermines the 
programmer's efforts to construct a robust mental model. The 
relationship of this finding to the analysis of the other dimensions is 
reported below. 


Dimension 2: How found
======================

Categories
----------

The informants reported four major bug-catching techniques, as 
follows:

*  gather data: This category refers to cases in which the informant 
decided to "find out more", say by planting print statements or 
breakpoints.  In fact, a variety of specific data collection techniques 
were reported.  The important thing that distinguishes these data 
collection techniques from the "controlled experiments" category 
described below is their bottom-up nature.  That is, the informants may 
have had a rough idea of what they were looking for, but were not 
explicitly testing any hypotheses in a systematic way.  Notice that this 
category is different both from Katz & Anderson's (1988) causal 
reasoning and from their data-driven hand simulation: it is really a 
hybrid of the two, because it is bottom-up, on the one hand, yet can 
lead directly to a causal analysis once the data has been gathered.  
Here are the six sub-categories reported by the informants:

   +  step & study: the programmer single-steps through the code, and 
      studies the behaviour, typically monitoring changes to data 
      structures

   +  wrap & profile: tailor-made performance, metric, or other 
      profiling information is collected by "wrapping" (enclosing) a 
      suspect function inside a one-off variant of that function which 
      calls (say) a timer or data-structure printout both before and  
      after the suspect function.

   +  print & peruse: print statements are inserted at particular points 
      in the code, and their output is observed during subsequent runs 
      of the program

   +  dump & diff: either a true core dump or else some variation (e.g. 
      voluminous output of print statements) is saved to two text files 
      corresponding to two different execution runs; the two files are 
      then compared using a source-compare ("diff") utility, which 
      highlights the difference between the two execution runs

   +  conditional break & inspect: a breakpoint is inserted into the 
      code, typically triggered by some specific behaviour; data values 
      are then inspected to determine what is happening

   +  specialist profile tool (MEM or Heap Scramble): there are several 
      off-the-shelf tools which detect memory leaks and corrupt or 
      illegal memory references, and the experts who relied on these 
      also tended to rave about their value.


*  "inspeculation": This name is meant to be a hybrid of "inspection" 
(code inspection), "simulation" (hand-simulation), and "speculation", 
which were among a wide variety of techniques mentioned explicitly or 
implicitly by informants: cogitation, meditation, observation, 
inspection, contemplation, hand-simulation, gestation, rumination, 
dedication, inspiration.  In other words, they either go away and think 
about something else for a while, or else spend a lot of time reading 
through the code and thinking about it, possibly hand-simulating an 
execution run.  "Articulation" (explaining to someone else how the code 
works) also fits here.  The point is that this family of techniques does 
not involve any experimentation or data gathering, but rather involves 
"thinking about" the code.     

*  expert recognized cliche: These are cases where the programmer 
called upon a cohort, and the cohort was able to spot the bug relatively 
simply.  This recognition corresponds to the heuristic 'mapping' 
observed by Katz & Anderson (1988).  The very nature of my data 
gathering excercise invariably requires a cohort (rather than the 
informant) to have detect the bug: the informant would not have been 
stumped, nor would have bothered telling me about the bug, had he or she
been able to identify the solution quickly in the first place!

*  controlled experiments: Informants resorted to specific controlled 
experiments when they had a clear idea about what the root cause of the 
bug might be.


Results
-------

The frequency of occurrence of the different debugging techniques is 
shown in Table 2.


vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Table 2--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
CATEGORY                 OCCURRENCES
gather data...................27
inspeculation.................13
expert recognized cliche.......5
controlled experiments.........4
??? (no information)...........2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Table 2.  Techniques used to track down the bugs. 


Techniques for bug-finding are clearly dominated by reports of data-
gathering (e.g. print statements) and hand-simulation, which together 
account for 78% of the reported techniques, and highlight the kind of 
"groping" that the programmer is reduced to in difficult debugging 
situations.  Let's now turn to an analysis of the root causes of the 
bugs before we go on to see how the different dimensions interrelate.   


Dimension 3: Root cause
=======================

Categories
----------

The bug causes reported by the informants fell into the following nine 
categories:

*  mem: Memory clobbered or used up.  This cause has a variety of 
manifestations (e.g. overwriting a reserved portion of memory, and 
thereby causing the system to crash) and may even have deeper causes 
(e.g. array subscript out of bounds), yet is often singled out by the 
informants as being the source of the difficulty.  Knuth has an 
analagous category, which he calls "D = Data structure debacle".

*  vendor: Vendor's problem (hardware or software).  Some informants 
report buggy compilers or faulty logic boards, for which they either 
need to develop a workaround or else wait for the vendor to provide 
corrective measures.

*  des.logic: Unanticipated case (faulty design logic).  In such 
cases, the algorithm itself has gone awry, because the programmer has 
not worked through all the cases correctly.  This category encompasses 
both those which Knuth labels as "A = algorithm awry" and also those 
labelled as "S=surprise scenario".  Knuth's A-vs.-S distinction can only 
be resolved by in-depth introspection, and is too fine-grained for the 
purposes of this study.

*  init: Wrong initialization; wrong type; definition clash.  A 
programmer will sometimes make an erroneous type declaration, or re-
define the meaning of some system keyword, or incorrectly initialize a 
variable.  I refer to all of these as "init" errors, since the program 
begins with its variables, data structures, or function definitions in 
an incorrect starting state. 

*  var: Wrong variable or operator.  Somehow, the wrong term has been 
used.  The informant may not provide enough information to deduce 
whether this was really due to faulty design logic (des.logic) or 
whether it was a trivial lexical error (lex), though in the latter case 
trivial typos are normally mentioned explicitly as the root cause.  

*  lex: Lexical problem, bad parse, or ambiguous syntax.  These are 
meant to be trivial problems, not due to the algorithm itself, nor to 
faulty variables or declarations.  This class of errors encompasses 
Knuth's "B=Blunder" and "T=Typo", which are hard to distinguish in 
informant's reports. 

*  unsolved: Unknown and still unsolved to this day.  Some informants 
never solved their problem!

*  lang: Language semantics ambiguous or misunderstood.  In one case, 
an informant reports that he thought that 256K meant 256000, which is 
incorrect, and can be thought of as a semantic confusion.  In another 
case, an informant reported a mismatch between the way a manual 
described some maximum value and the way in which it was actually dealt 
with by the compiler.

*  behav: End-user's (or programmer's) subtle behaviour.  For 
example, in one case the bug was caused by an end-user mysteriously 
depressing several keys on the keyboard at once, and in another case the 
bug involved some mischievous code inserted as a joke.  These are really 
manifestations of behaviour external to the normal progamming arena, but 
still warrant a category in their own right.


Results
-------

Table 3 displayes the frequency of occurrence of the nine underlying 
causes.


vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Table 3--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
CATEGORY                                               OCCURRENCES
mem: Memory clobbered or used up.............................13
vendor: Vendor's problem (hardware or software)...............9
des.logic: Unanticipated case (faulty design logic)...........7
init: Wrong initialization; wrong type; definition clash......6
lex: Lexical problem, bad parse, or ambiguous syntax..........4
var: Wrong variable or operator...............................3
unsolved: unknown and still unsolved to this day..............3
lang: language semantics ambiguous or misunderstood...........2
behav: end-user's (or programmer's) subtle behaviour..........2
??? (no information)..........................................2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Table 3.  Underlying causes of the reported bugs. 


Table 3 indicates that the biggest culprits were memory overwrites and 
vendor-supplied hardware/software problems.  Even ignoring vendor-
specific difficulties, one implication of Table 3 is that 37% of the 
nastiest bugs reported by professionals could be addressed by (a) 
memory-analysis tools and (b) smarter compilers which trapped 
initialization errors.  But what about the interaction between the cause 
of the bug, the reason for the debugging difficulty, and the debugging 
technique?  That is precisely the focus of the next section. 


------------------------------------------------------------------------
[Next: bugtales.2o4 (Relating the dimensions, legacy, ref's)]