======================================================================= 'My Hairiest Bug' War Stories [PART 2] [early extended version of paper which appeared in Comm. ACM, 40 (4) April 1997] Marc Eisenstadt Knowledge Media Institute The Open University Milton Keynes MK7 6AA UK M.Eisenstadt@open.ac.uk ======================================================================= ASCII documents available from ftp://kmi-ftp.open.ac.uk/pub/bugtales: bugtales.1o4 (Main part 1: Abstract, intro, data analysis) =>bugtales.2o4 (Main part 2: Relating the dimensions, legacy, ref's) bugtales.3o4 (Appendix A: Selected anecdotes - see also bugdata.txt) bugtales.4o4 (Appendix B: Condensed data tables) bugdata.txt (ASCII raw data from 1st trawl) PURE ASCII VERSION FOR EMAIL/BBOARD POSTINGS Be sure to print in monospace font (e.g. Courier 10pt., 6 3/4" margin). All of the above documents are available via FTP from kmi-ftp.open.ac.uk (login: anonymous, password: , directory /pub/bugtales) or by email from M.Eisenstadt@open.ac.uk. ======================================================================= RELATING THE DIMENSIONS To understand the ways in which the three dimensions of analysis interrelate, we can place every anecdote precisely in our three- dimensional space. For expository purposes (and because multi- dimensional diagrams are hard to discuss) let's consider just the following two-dimensional comparisons: (a) root cause vs. how found, and (b) how found vs. why difficult. Table 4 compares root causes (row labels) against bug-finding techniques (column labels). Each cell entry shows the number of anecdotes with the given attributes. In cases where an anecdote reveals multiple attributes (say, it belonged partly to column 3 and partly to column 4), it is simply split evenly across the appropriate cells, hence the fractional entries. Tables 1, 2, and 3, incidentally, did not use this splitting technique, and correspond to tallies of the primary (first- reported) categories only. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Table 4--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv CAUSE vs. HOW gather inspecu- expert contr. ??? data lation clichŽ expts (no info) TOTALS mem 8.50 4.00 1.00 13.50 vendor 4.00 3.00 1.00 3.00 11.00 des.logic 4.00 3.00 7.00 init 5.00 1.00 6.00 lex 1.00 2.00 1.00 4.00 var 2.00 .50 1.00 3.50 unsolved 3.00 .50 .50 4.00 lang 1.00 1.00 2.00 behav 1.00 1.00 2.00 ??? (no info) 2.00 2.00 TOTALS 29.50 13.00 6.00 4.50 2.00 55.00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Table 4. Tally of root causes of bugs (rows) vs. how found (columns). Each cell entry (e.g. 8.50) is a tally of the number of anecdotes reporting that cell's row label (i.e. root cause) and column label (i.e. how found). Fractional entries reflect anecdotes which have been divided into multiple categories, so that an anecdote reporting both a "controlled experiment" and "expert recognized cliche" scores .50 in each cell. Note that tables 1, 2, and 3 only tallied the primary (first-reported) category for each dimension. Of most interest is the relative density of anecdotes in the upper left- hand corner of the table, suggesting particularly that memory-clobbering errors could usefully be dealt with by better data-gathering tools. The density of the cell entries is not greater than that predictable by chance from the row and column totals alone (X^2 ,df:36, = 45.30, ns), suggesting no reliable relationship between root cause and how found, though the cell densities are nevertheless of a priori interest to tool developers, as discussed below. Table 5 compares reasons for difficulty (rows lables) against bug- finding techniques (column labels). Once again, the need for data- gathering tools is highlighted, this time for dealing with cause/effect chasms and cases in which other debugging tools are inapplicable. In this case, the density of certain cell entries is greater than that predictable by chance from the row and column totals alone (X^2 ,df:20, = 33.50, p. < .05), suggesting in particular that data-gathering activities are of special relevance when a cause/effect chasm is involved or when the built-in debugging tools are somehow rendered inapplicable. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Table 5--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv WHY vs. HOW gather inspecu- expert contr. ??? data lation cliche expts (no info) TOTALS cause/effect chasm 9.83 3.00 1.50 2.50 16.83 tools hampered 9.83 2.00 2.00 13.83 WYSIPIG 2.00 2.00 1.50 2.00 7.50 faulty assumption 2.50 3.00 1.00 6.50 spaghetti 1.33 1.00 2.33 ??? (no info) 4.00 2.00 2.00 8.00 TOTALS 29.50 13.00 6.00 4.50 2.00 55.00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Table 5. Tally of why bugs were difficult (rows) vs. how found (columns). Each cell entry (e.g. 9.83) is a tally of the number of anecdotes reporting that cell's row label (i.e. root cause) and column label (i.e. how found). Fractional entries reflect anecdotes which have been divided into multiple categories, so that an anecdote reporting three reasons for difficulty scores .33 in each of three relevant cells. A niche of potential interest (and profit) to tool vendors is highlighted by looking at the relationship among the three dimensions: the most heavily populated cells are those involving data-gathering, cause-effect chasms and memory or initialization errors. The implications of this finding are discussed in the next section. DISCUSSION: LESSONS LEARNED From boasting war stories to on-line repository =============================================== What intrigued me the most upon seeing the replies was the way in which complete strangers, with very little prompting and no incentive, were so articulate in their reminscences, and so forthcoming with details. These people clearly enjoyed relating their debugging experiences. Moreover, the depth of supplied details seemed to be independent of whether I had explicitly posted my motivation (as I did on BIX and AppleLink) or not (as was the case on Usenet and CompuServe). Clearly, this is a self-selecting audience of email users and conference browsers who enjoy electronic "chatting" anyway, and some may even have felt a "macho" need to tell a good (and hence boastful) war story-- so much the better! I have no reason to distrust the sources, and the details of each story certainly have their own self-consistency. It is already widely accepted that the international computer network community is a gold-mine of information (see, e.g. BYTE feature article on the Internet, 1992). This collection of anecdotes suggests that it may also be a rich repository of willing subjects ready to supply detailed knowledge in a fairly rigorous manner which may then serve as a resource for others. We are now exploring the idea of developing an on-line repository which could be used both to receive new anecdotes and to respond automatically to keyword enquiries. This could be of great benefit to those with an urgent need to solve complex debugging problems, but only after an appropriate indexing scheme has evolved. This paper is a first step toward such a scheme. Note that it is not necessary to develop a definitive taxonomy. On the contrary, the stories themselves, even when only tangentially related to a specific bug enquiry, could be sufficient to trigger an insight which leads to the solution of a debugging problem. Comparison with fine-grained study ================================== As part of a series of investigations on the nature of programming and debugging environments, I have also looked in detail at what it's like to work with an apparently "modern" and "friendly" program development environment: HyperCard. I kept a detailed diary of several lengthy debugging sessions, and then analysed the problems and difficulties I experienced (Eisenstadt, 1993). In particular, the "why difficult" dimension was analysed at a more a fine-grained level of detail, revealing eight fundamental problems. Table 6 below lists each of these eight fundamental problems, shows which coarse-grained "why difficult" category they correspond to, and identifies a possible solution. vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv--Table 6--vvvvvvvvvvvvvvvvvvvvvvvvvvvvvv {Column labels duplicated for each row in this ASCII version: "Fine" = Fine-grained Problem "Coarse" = Coarse-grained "Why difficult" Category "Sol'n" = Proposed Solution -1---------------------------------------------------------------------- FINE: The link between an error message and the offending source code line has to be deduced by the user (whereas it could be provided for free). COARSE: Cause/effect chasm; Tools inapplicable or hampered (long run to replicate); SOL'N: S1: Computable relations should be computed on request, rather than be deduced by the user. -2---------------------------------------------------------------------- FINE: Performing a routine debugging/inspecting action involves dealing with many disruptive subgoals (e.g. switching modes to enable specific machine states). COARSE: Faulty assumption (typically about what "mode" the debugging tool is in) SOL'N: S2: Atomic user-goals should be mapped onto atomic actions. -3---------------------------------------------------------------------- FINE: Access to the interpreter (and other features) in the middle of a break is disallowed. COARSE: Tools inapplicable or hampered (context precludes using debugger) SOL'N: S3: Allow full functionality at all times -4---------------------------------------------------------------------- FINE: The behaviour of built-in functions can not be monitored easily. COARSE: Tools inapplicable or hampered (context precludes using debugger) SOL'N: S4: Viewers should be provided for "players" (any evaluable expression) rather than just "variables". -5---------------------------------------------------------------------- FINE: There is no meaningful coarse-grained view of execution. COARSE: Tools inapplicable or hampered SOL'N: S5: Provide a variety of navigation tools at different levels of granularity. -6---------------------------------------------------------------------- FINE: Traversal of indirect data influences (data flow) requires too much detective work. COARSE: Cause/effect chasm SOL'N: S6=S1: Computable relations should be computed on request, rather than be deduced by the user. -7---------------------------------------------------------------------- FINE: Understanding the precise conditions under which a particular event happens (control flow logic) requires too much detective work. COARSE: Cause/effect chasm SOL'N: S7=S1: Computable relations should be computed on request, rather than be deduced by the user. -8---------------------------------------------------------------------- FINE: The "inner state" of an object can be deceptively different from its apparent state, requiring extra detective work to uncover. COARSE: WYSYIPIG: what you see is probably illusory, guv'nor SOL'N: S8: Displayable states should be displayed on request, rather than be deduced by the user. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Table 6. Relationship between fine-grained problems identified using self-report and coarse-grained categories of the current broad survey. Prospective solutions to each of the eight fine-grained problems are also identified. Although the scope of the two studies was rather different, it is nevertheless gratifying that most of the problems found in the fine- grained study fell into the two most popular studies reported in this coarse-grained study of the current paper (namely, tools hampered and cause/effect chasm). Solutions ========= What programmers really need, of course, are smarter compilers and debuggers. But the analyses presented throughout this paper suggest that we can be more precise than simply demanding "smartness" from tool developers. For one thing, we have identified a niche that really needs attention: the most heavily populated cell in our three dimensional analysis suggests that a winning tool would be one which employed some data-gathering or traversal method to resolved large cause/effect chasms in the case of memory-clobbering errors (indeed Purify, described below, does precisely this). Secondly, we can propose solutions to the "why difficult" problems by considering the specific cases brought to light by the fine-grained study described above. One way or another, all of the problems mentioned in Table 6 are connected with "directness" and "navigation". For example, the need to go through indirect steps, intermediate subgoals or obtuse lines of reasoning plagues the user encountering problems 1, 2, 3, 6, 7, and 8, and each of these problems can be addressed specifically. The proposed solutions presented in Table 6 are not necessarily easy to implement, but there are an increasing number of tools appearing both in the research community and in the marketplace which illustrate aspects of these solutions. For example, consider the idea of computing and displaying important relations and states on request, rather than relying on the programmer's deductive skills (solutions S1=S6=S7 and S8). The software tool Purify (Hastings & Joyce, 1992) analyses run- time memory leaks in C programs on Sun workstations by patching the object code at link time, and pinpoints the root cause of the leak by traversing many indirect dataflow links back to the offending source code. Thus, it already solves a much harder dataflow traversal problem than that required to deal with indirect pointer traversing such as that reported by several informants. Solution S3 (allowing full functionality at all times) is effectively provided in many modern Lisp implementations. Solutions S4 and S5 (viewers and granularity) have been explored at length in (Brayshaw & Eisenstadt, 1991) and (Eisenstadt et. al., 1993). That leaves S2 (atomic user-goals should be mapped onto atomic actions), which is increasingly addressed in commercial debugging tools, but still requires significant research input. SUMMARY AND CONCLUSIONS An analysis of the debugging anecdotes collected from a worldwide email trawl revealed three primary dimensions of interest: why the bugs were difficult to find, how the bugs were found, and root causes of bugs. Half of the difficulties arose from just two sources: (i) large temporal or spatial chasms between the root cause and the symptom, and (ii) bugs that rendered debugging tools inapplicable. Techniques for bug-finding were dominated by reports of data-gathering (e.g. print statements) and hand-simulation, which together accounted for almost 80% of the reported techniques. The two biggest causes of bugs were (i) memory overwrites and (ii) vendor-supplied hardware or software faults, which together accounted for more than 40% of the reported bugs. The analysis pinpoints a winning niche for future tools: data-gathering or traversal methods to resolved large cause/effect chasms in the case of memory- clobbering errors. Other specific solutions, all of which emphasize issues of "directness" and "navigation" were developed by comparing the current study with a fine-grained self-report study. The investigation highlights a potential wealth of information available by worldwide email, and indicates that it may well be possible to establish an on- line repository for perusal by those with an urgent need to solve complex debugging problems. REFERENCES Brayshaw, M. & Eisenstadt, M. (1991). A Practical Graphical Tracer for Prolog. International Journal of Man-Machine Studies, 35(5): 597-631. Brooks, R. E. (1980). Studying Programmer Behavior Experimentally: the problems of a proper methodology. Communications of the ACM, 23(4):207- 213. du Boulay, J. B. H. (1986). Some difficulties of learning to program. Journal of Educational Computing Research, 2(1):57-73. Eisenstadt, M. (1993a). Why HyperTalk Debugging is More Painful than it Ought To Be. In J. Alty, D. Diaper, and S.P. Guest (Eds.), People and Computers VIII. Cambridge, UK: Cambridge University Press. Also available as: Technical Report No. 105, Human Cognition Research Laboratory, The Open University, Milton Keynes, UK.(Available via anonymous FTP from hcrl.open.ac.uk). Eisenstadt, M. (1993b). Tales of debugging from the front lines. In J. Spohrer and C. Cook (Eds.) Empirical Studies of Programmers: Proceedings of the Fifth Workshop. Norwood, NJ: Ablex. Also available as Technical Report No. 106, Human Cognition Research Laboratory, The Open University, Milton Keynes, UK.(Available via anonymous FTP from hcrl.open.ac.uk). Eisenstadt, M., Domingue, J., Rajan, T., & Motta, E. (1990). Visual Knowledge Engineering. IEEE Transactions on Software Engineering, 16(10):1164-1177. Eisenstadt, M., Keane, M., & Rajan, T. (Ed.). (1992). Novice Programming Environments: explorations in human-computer interaction and artificial intelligence. East Sussex, UK: Lawrence Erlbaum Associates. Eisenstadt, M., Price, B. A., & Domingue, J. (1993). Software Visualization As A Pedagogical Tool. Instructional Science, 21: 335-365. Gould, J.D., & Drongowski, P. (1974). An exploratory study of computer program debugging. Human Factors, 16 (3): 258-277. Hastings, R. & Joyce, B. Purify: fast detection of memory leaks and access errors. Proceedings of the Winter Usenix Conference, January 1992. Johnson, W. L. (1983). An Effective Bug Classification Scheme Must Take the Programmer into Account. In Proceedings of The Workshop on High- Level Debugging, . Palo Alto, CA: Kahney, H. & Eisenstadt, M. (1982). Programmers' Mental Models of their Programming Tasks: The Interaction of Real World Knowledge and Programming Knowledge. In Proceedings of The Fourth Annual Conference of the Cognitive Science Society, (pp. 143-145). Katz, I. R. & Anderson, J. R. (1988). Debugging: An analysis of bug- location strategies. Human Computer Interaction, 3(4):351-399. Knuth, D. E. (1989). The Errors of TeX. SoftwareÑPractice and Experience, 19(7):607-685. McCullough, P. L. (1983). Implementing the Smalltalk-80 System: The Tektronix Experience. In G. Krasner (Eds.), Smalltalk-80: Bits of History, Words of Advice (pp. 59-78). Reading, MA., USA: Addison-Wesley. Pennington, N. (1987). Stimulus structures and mental representations in expert comprehension of computer programs. Cognitive Psychology, 19: 295-341. Shneiderman, B. (1980). Software Psychology. Cambridge, MA: Winthrop. Sime, M. E., Green, T. R. G., & Guest, D. J. (1973). Psychological evaluation of two conditional constructions in computer languages. International Journal of Man-Machine Studies, 5:123-143. Soloway, E. & Iyengar, S. (Ed.). (1986). Empirical Studies of Programmers. Norwood, NJ: Ablex. Spohrer, J. C., Soloway, E., & Pope, E. (1985). A Goal/Plan Analysis of Buggy Pascal Porgrams. Human-Computer Interaction, 1(2):163-207. Vesey, I. (1989). Toward a theory of computer progream bugs: an empirical test. International Journal of Man-Machine Studies, 30:123-46. ------------------------------------------------------------------------ [Next: bugtales.3o4 (Appendix A: Selected raw anecdotes)]