======================================================================= 'My Hairiest Bug' War Stories [FULL RAW DATA/ANECDOTES] [early extended version of paper which appeared in Comm. ACM, 40 (4) April 1997] Marc Eisenstadt Knowledge Media Institute The Open University Milton Keynes MK7 6AA UK M.Eisenstadt@open.ac.uk ======================================================================= ASCII documents available from ftp://kmi-ftp.open.ac.uk/pub/bugtales: bugtales.1o4 (Main part 1: Abstract, intro, data analysis) bugtales.2o4 (Main part 2: Relating the dimensions, legacy, ref's) bugtales.3o4 (Appendix A: Selected anecdotes - see also bugdata.txt) bugtales.4o4 (Appendix B: Condensed data tables) =>bugdata.txt (ASCII raw data from 1st trawl) All of the above documents are available via FTP from kmi-ftp.open.ac.uk (login: anonymous, password: , directory /pub/bugtales) or by email from M.Eisenstadt@open.ac.uk. PURE ASCII VERSION Be sure to print in monospace font (e.g. Courier 10pt., 6 3/4" margin). ======================================================================= This document accompanies 'My hairiest bug' war stories [Comm. ACM 40 (4), April 1997], which describes the background motivation, and provides a summary and analysis of related data. The text has been trivially edited in order (a) to preserve anonymity and (b) to remove opening and closing meta-comments such as "I'm not sure this is what you wanted...". Anonymity would not normally be an issue for items posted to communally-accessible electronic bulletin boards, but I have not obtained explicit permission from everyone to publish these stories outside of the original bulletin board/email context. I am therefore acting cautiously, out of respect for the privacy of the individuals concerned. The document is divided into five sections 1. Original messages 2. Usenet replies 3. Bix replies 4. Open University replies 5. AppleLink replies ======================================================================= 1. Original trawl request message --------------------------------- (posted to BIX, AppleLink, and the Open University's CoSy): I'm looking for some (serious) anecdotes describing debugging experiences. In particular, I want to know about particularly thorny bugs in LARGE pieces of software which caused you lots of headaches. It would be handy if the large piece of software were written in C or C++, but this is not absolutely essential. I'd like to know what the symptoms were, and how you cracked the problem-- what techniques/tools you used: did you 'home in' on the bug systematically, did the solution suddenly come to you in your sleep, etc. A VERY brief stream-of-consciousness reply (right now!) would be much much better than a carefully-worked-out story. I can then get back to you with further questions if necessary. Thanks! Usenet variant of trawl request message: ---------------------------------------- I'm looking for some serious anecdotes about particularly nasty or bizzare debugging experiences that you have come across in large pieces of software (language doesn't matter). I'd like to know what the symptoms were, and what you eventually used to home in on the bug (e.g. some systematic method, a particular debugging tool, did it come to you in your sleep, etc.). If you have a quick stream of consciousness reply, that would be great since I can get back to you if I need more detail. Motivation details (posted on BIX and AppleLink): ------------------------------------------------- Here are some comments regarding my motivation for collecting 'my hairiest bug' tales: 1) I've done a lot of work on program visualization ("PV"). A big concern of my group is how to move away from toy problems (typical of PV work) and to ensure that our systems scale up to *BIG* problems... otherwise, to put it bluntly, why should anyone care about PV? I'm therefore interested in hearing (a) about the kinds of hairy debugging problems that cause people to lose sleep, (b) about the techniques that people used to solve their problems, to help shed light on what sort of semi-automated tools might be useful, (c) about the tools that people use, (d) about the tools that people would like to see. Quick perusal of AppleLink bulletin boards suggested to me that the Debugging area in 'Developer Talk' would be the most useful place to post my request. 2) We've made good progress addressing 'scalability' for certain classes of language, notably logic programming. For Prolog, it is certainly possible to show a nice visual coarse-grained view of the execution space (with suitable 'compressions' and 'abstractions' for very large spaces), and allow the user to home in quickly on the trouble spot via a variety of fine-grained views. We've got a graphical Prolog debugger for the Mac which does precisely this (available via anonymous ftp over Internet on cs.toronto.edu in the directory pub/dgp or ask me to post a copy on AppleLink... also, the stuff is written up in various journal articles, of which the most accessible is: Brayshaw, M., and Eisenstadt, M. A practical graphical tracer for Prolog. /International Journal of Man-Machine Studies/, 1991, 35 (5), pp. 597-631.). But generalising these ideas to other languages is hard: the 'proof tree' metaphor for logic programming is a natural for visualization tools, but other kinds of execution metaphors are less suitable for visualization... I want to investigate which problems in other programming languages (i.e. besides Prolog) would be particularly amenable to a novel "PV" treatment. 3) I've done a lot of work on automatic program debugging. This is an exciting area of AI and software engineering, but (to cut a *VERY* long story short), PV systems appear to be a much better long-term bet for being able to scale up to handle arbitrary programs. Automatic debuggers need a huge amount of advance knowledge to catch bugs, and in general they won't have the advance knowledge needed to catch those novel hairy bugs that cause you to lose sleep. Nevertheless, there *ARE* elements of human debugging expertise that can still be (semi)automated, so my collection of 'my hairiest bug' tales is also partially aimed at scratching the surface of some of this expertise. (Call it a first pass knowledge elicitation exercise). 4) As a Cognitive Psychologist, I'm very interested in human problem solving. Debugging is a fascinating and important human endeavour, and I want to hear what the pros have to say about it. I hope that helps clarifies things! {I have sent this 'trawl' request out on USENET and BIX, and received a large number of replies... people seem to enjoy sending in their anecdotes... at some point in the near future I hope to circulate an edited and annotated collection.}* *The last sentence applies only to the AppleLink version ======================================================================= 2. Usenet --------- The replies below were collected in between 3rd March 1992 and 10th March 1992 in response to a request which was posted to the following Usenet newsgroups on Internet: alt.folklore.computers comp.lang.misc comp.bugs.misc comp.programming ------------------------------------------ U1 Not too exciting, but I'll bet it's awfully typical. I had a program (roughly 15,000 lines) in C, running on PCs and Unix. It does screen writes using the curses library. After a long period of development (mostly on the PCs) I started to see occasional odd characters popping up on the screen. The problems were not easily reproducible, but they gave me a queasy feeling. I started cursing the (public domain) curses library I was using on the PC. I started setting breakpoints and tracing, but any time I got a reproducible glitch, setting a breakpoint or inserting a debugging statement "cured" the glitch. Of course, by now you've probably guessed the problem. I eventually wrote some routines that put a debugging wrapper around the standard malloc() and free() calls. The routines do the following: log every malloc() and free() to a disk file by module and line number, record the number of bytes requested, insert checking signatures at the beginning and end of every allocated chunk, and check those signatures for overwrites at every free() or when explicitly requested to do so I found all sorts of intriguing things (all in my own code, by the way, none in the curses library). I sometimes free()ed memory twice (just trying to make sure, I guess). I sometimes overran malloc()ed buffers (usually by the infamous single '\0' at the end of a string). All in all, I think I found about 10 memory allocation/usage errors. I'm not sure exactly which were responsible for my glitches, but almost any of them had bad potential. The glitches are gone, now. I can concentrate on other problems... ------------------------------------------ U2 Not quite debugging, but I was trying to analyse the code generated by a compiler for an interactive language (bit like ML, but before ML). It generated code into garbage-collected memory, and I couldn't patch the compiler or adb it, or stop it before it was likely to have done a garbage collection. So I had to find the new piece of code which could be anywhere in the heap (post-GC). I ended up writing a heap analyser which told me the location, size and runtime-type of every object on the heap, ran it on before- and after- images, did a suitable filter through sed, sort, uniq, and diff, giving me the location of the new object, which I could then extract and run through a disassembler. ------------------------------------------ U3 Well this is -my- most nasty bug, I am sure others have had worse, but I am young, give me time :-) I was doing maintenance and upgrades on what amounted to a fileserver for networked device w/o its own disk. The language was C, the source code used to create an executable to run on VAX/VMS, VAX/ULTRIX and DECstations/ULTRIX. As a side note the system was a pain to debug because the device's code was under development too, if you single steeped through the fileserver once it got a request from the device, the device got impatient and sent lots of messages and timed out, which general meant you had to rebooted it, which took several minutes (and a walk across the building.) As a result you could only easily look at one network transaction from the inside (they were all logged so you could see what happened, you only found out what happened inside you code for the last reply.) The fileserver had an array to hold active file discriptors (which had to mapped because of the different ways VMS and ULTRIX handle files) The device would request a file to be opened, then the server would send back a file handle, the the device would use the file handle refer to file in future operations. We would see the message to open a file come in, and the handle go back, and then the printer would would make a read request with the correct handle and the fileserver would clam that it had never opened that file. So I looked inside, and I saw the file server open the file, and put the fp in the right place, and send back the correct handle. So I like inside during the read request, and sure enough the data structure for file handles was wrong. Not there was only about 3 places where the file handle structures was modified so debug printfs around them, and sure enough, it was only changed once, when the file was opened. Luckily the VMS debugger (which is linked as a linker/compiler option) has this option where you can set a break point a specify memory address, so I set it the file handle data structure and I run the fileserver and bingo right before the project engineering meeting I found the problem. There was a buffer (array of char) that was used to construct out going network packets, which had a maximum length of 1024 or so. The buffer was only 128, long, so it would get over run and munge the file pointer structure. Elapsed time 1 1/2 days. [Of course if this had been pascal it would have been a no brainer] [Then of course there was the time it took me 1 1/2 hours to find out that I forgot #include] ------------------------------------------ U4 Just as a general class of bugs, memory management bugs can be some of the trickiest to find since they usually don't show themselves where the bug occurred but some other place distant in time and space. (Hence my interest in automatic garbage collection schemes.) I experienced a difficult case of a memeory management bug when I was building a "real-time" GC system. The bug had all the appearances of looking like some other kind of bug except that it vanished when I looked too closely at where I thought is was supposed to be. Noticing this kind of behavior is the first step towards finding the real cause of the bug. Concurrent and/or real-time systems can also have nasty bugs that are, moreover, difficult to repeat. ------------------------------------------ U5 I had a large COMMON statement in a Fortran program. When I changed the name of one of the variables in it, the line became longer than 72 characters, which Fortran does not warn you about. So, the last variable was always coming in as zero, because it was being read as TOTO instead of TOTAL (or some such). Of course, in Fortran, TOTO is a legal variable, which was assigned the value zero... ------------------------------------------ U6 I had a bug in a compiler for 8086's running MSDOS once that stands out in my mind. The compiler returned function values on the stack and once in a while such a value would be wrong. When I looked at the assembly code, all seemed fine. The value was getting stored at the correct location of the stack. When I stepped thru it in the assembly-level debugger and got to that store, sure enough, the effective address was correct in the stack frame, and the right value was in the register to be stored. Here's the wierd thing --- when I stepped through the store instruction the value on the stack didn't change. It seems obvious in retrospect, but it took some hours for me to figure out that the effective address was below the stack pointer (stacks grow down here), and the stored value was being wiped out by os interrupt handlers (that don't switch stacks) about 18 times a second. The stack pointer was being decremented too late in the compiled code. ------------------------------------------ U7 We had one very nasty problem that did not manifest itself on our development system. It turned out to be a "C" pointer to a floating point number that was in an (otherwise unused) portion of code that was being dereferenced and thrown away. On our development system (Vax 8600) there was no apparent problem. On our customer's system (Vax 11/780) it blew to high heaven. I figured out the problem when we were stepping through the code in the Vax debugger. Or maybe the time I was debugging some graphics software that exercised the kernel panic() routine due to a bug in Sun OS? Or maybe the time I was debugging some innocent looking graphics software that duplicated a (hiddden) identifier out of one of the SunView libraries, and caused very wierd crashes (lint solved this one!)? I don't believe I've ever solved a bug in my sleep, but I am a firm believer in the Sherlock Holmesism, "Once you have eliminated the impossible, whatever remains, however improbable, must be the truth". Works for me. Hope this helps. ------------------------------------------ U8 Symptom: core-dump in malloc Interesting item: remove earlier popen("lpr") call and core-dump goes away What really happened: The developers had increased the number of open files per process to 100. The stdio package had NFILES=60. The stdio package uses the UNIX file descriptor to index into an array of buffer pointers. This resulted in a memory overwrite of whatever followed the stdio array of buffer pointers, which happened to be the malloc control variables. Debug process: The popen item suggested that the problem was in stdio. The developers did tell me about increasing # of open files. I just happened to remember the UNIX fd as an array index in stdio. I then used nm to get a map of the executable to find out what followed the stdio array ... bingo! Overall, the entire process from being given the problem to solving it took 12 hours. Debugging tools were useless because the cause of the problem happened much earlier than when something finally went wrong. ------------------------------------------ U9 Well, I spent one miserable summer of my grad-school career untangling a particularly bad example of Fortran spaghetti code to solve set covering problems. When I started, I had no reason to suspect a problem, and I don't even remember exactly what tipped me off (i.e. what I was testing for when the first anomalous result showed up; I'm sure I must have had an incorrect solution to a problem whose solution I knew, or an infeasible solution found by the code). The only applicable technique was a combination of hand-tracing (until I could identify the correspondence between subroutines, interfaces and actions), and debugger tracing of the offending problem (which was large!). I used a "wolf-fence" technique: Run the code up to a fixed point, and check for correct results. If OK then run the code to the next point and check again. If not OK then run the code from the start up to the last OK point, and insert a new breakpoint between the current point and the known bad point. If the interval between breakpoints is small, step one instruction at at time. I finally found the problem--an integer array which held a value (if the value was positive) or a status (if the value was negative). You can guess the rest: 0 was both a legal value (but rare) and a legal status. A status checker took a zero-valued cell and operated on it as if it were a zero-status cell, changing it's status. The code was so badly organized, and the assumption about how this array was used was buried so deep, that I advised that the program be rewritten from scratch (by my successor 8^). ------------------------------------------ U10 I think the worst bug I ever had to track was a bounds overrun bug in a non-protected system. This was on an IBM PClone, and I was debugging a complex TSR+program setup (the TSR was more like a 'resident runtime library' than a real TSR). The program needed to run for about 20 minutes before it would fail (with copious debugging output), and would *not* fail when run under the debugger. The cause was eventually traced to the TSR code writing just a bit above its top of memory, which ran into the bottom of the program. In the debugger, this didn't hit anything important; presumably it was all initialization code. I forget how we found this one, unfortunately; maybe just dedication. ------------------------------------------ U11 This was a long time ago, and may no longer be a problem. I had a huge fortran program that solved mechanical engineering stresses. There was a function like (sorry, it's been a long time) FUNCTION THING(A,B) INTEGER A,B : A = 2 : END at one point in the program, the function was called like this: X = THING(1,C) where the '1' is the numeral one. What (after much, much, much time and thought), I realized, was that the following was going on: o procedure where "X = THING(1,C)" existed had placed the value "one" in a (global!) literal table and instructed the loader to initialize it to "1". o this procedure passed the address of "1" in the literal table. o procedure "THING" placed the value two into the memory location pointed at by "A", which, in this one instance, was in the literal table. o all subsequent references to an integer "one" in the program used the value two. I had no guide in this. I felt that I had thought of, and tried everything. Finally I narrowed it down, through "diagnostic printing" to locate the first faulty expression. This expression had a one in it (although i've forgotten, an example would be) J = 3*(K + 1) I placed mucho WRITES before and after this, such as: WRITE (*) J,K J = 3*(K + 1) WRITE (*) J,K with the result looking like: 4 8 30 8 Which got me started. It was like a flash (from hell!) "... it thinks that 1 is 2 ..." After this, all it took was an hour or so with TECO to locate the source of my problems. I bet I spent three or four days on this. I don't know how fast others would have solved it though---I was 17 at the time, which is kind of young, but i've been programming professionally since I was 12, so maybe that make up for it. ------------------------------------------ U12 Debugging war stories? My favourite kind! The worst bug I've had to pin down comes from an artificial life model I've been working with. I "inherited" the code - really awful K&R C code with absolutely no structured programming. Functions are scattered throughout C files, lots of global variables, no comments, typical bad code. The whole system is rather small, actually - 4000 lines, so it is possible for me to understand the whole thing. But at the time of the bug, I hadn't really grokked the whole mess. The program only crashed after running about 45000 iterations of the main simulation loop. Running it this long takes about 2 hours and 8 megabytes of core. The crash was a segmentation fault. Somewhere, somehow, someone was walking over memory. But that somewhere could have been *anywhere* - writing in one of the many global arrays, for example. The bug turned out to be a case of an array of shorts (max value 32k) that was having certain elements incremented every time they were "used", the fastest use being about every 1.5 iterations of the simulator. So an element of an array would be incremented past 32k, back down to -32k. This value was then used as an array index. It points out several things of how C can really shoot you in the foot. No overflow errors on integer operations, so 32767+1 really is -32768. No bounds checking on array operations - a[-32768] = 0; is a perfectly legal operation with really negative effects. The actual bit of memory being written into eventually hit one of the malloc() chain data structures (lots of 4k data structs being malloced and freed), causing stupid Ultrix free() to do the Wrong Thing and trash the heap. But of course the actual seg fault was happening several iterations after the error - the bogus write into memory. It took 3 hours for the program to crash, so creating test cases took forever. I couldn't use any of the heavier powered debugging malloc()s, or use watchpoints, because those slow a program down at least 10 fold, resulting in 30 hours to track a bug. No good. The way I found it was to first use GNU malloc(), which has some very simple range checking features built in. That let me catch on to what was actually generating the SIGSEGV - heap trashing. I then just sort of zenned the bug, printing out data structures in the program and looking to see if they looked right. I finally found the negative number somewhere, then squashed the bug. It took me 3 days to find. ------------------------------------------ U13 Here're three to add to the list: I was working on some simulation code. We have a program (SD/Fast) which, given a description of a physical system, writes C code for the equations of motion goverining it (rigid-body dynamics). I had a planar simulation of a 3-legged galloping robot which I was trying to get working for a class project. The symptoms were that when it hit the ground with one foot, it wouldn't tip over. Turned out that when I took the variables the control system was using and put them into the array passed to SD/Fast's routines, I swapped yaw and pitch, but didn't when I got them out of the array. And, because it was a planar model, after retrieving the array I set yaw and roll to zero so I couldn't see where the problem was coming from. When I was working at Touch Technologies, Inc. (a small software company in San Diego), we had a sales tracking system I was working on. Once in a while, a date in a certain field would get the year and day swapped. It turned out that someone else wrote a hunk of code which at one point got a date which was returned in the wrong format (it was received in MMDDYYYY, but stored in YYYYMMDD so it could be sorted). Rather than put it in a temporary string, she wrote it out to the database in the wrong format and a few lines later corrected it. Normally, there was no problem even though this was a multi-user system, because of the record locking. However, there are a few situations where the record can become unlocked and one of them could occur in those screenful of lines before the date was corrected. So, _very_ infrequently, the record would get unlocked and someone else would access it, getting the invalid date. Even more infrequently, the original user would continue and the date would be corrected, only to be re-written incorrectly when the second person kept going. AACK! I found that one by sheer accident. We have an IBM RS/6000 here which periodically crashes. All the processes keep running, but if you do anything which tries to start a new one you get wedged. This happens about weekly. We're still working on it (as is IBM). ------------------------------------------ U14 It was the spring of 1991 and I was at IBM in the kernel development group for AIXv3. The RS/6000 had just been announced and we were headed into system test when ptrace() broke. Unfortunately, ptrace() was my baby and I had to fix it. I don't know what you know about ptrace(), but it's used to trace user level programs (in UNIX). It communicates to the traced process using SIGTRAP. There's special case code in the signal handling code to do the right thing if the process is being traced. Anyway, the problem was that every 15 or 20 thousand times, the SIGTRAP would kill the traced process instead of stopping it. How do you trace something fails that infrequently? But it was failing the automated test suite, so it had to be fixed. I worked on that one for two weeks. I would start up a tracing session and single step through both the ptrace code and the signal code using the kernel debugger. That week made me rich on overtime, but produced no results -- it never once failed single stepping. It had to be a timing problem, but where? There were only two processes involved and it was a uni-processor... Anyway, that Friday (as on every friday that spring) a new build of AIX was distributed and the problem dissapeared. I even compiled in the old signal and ptrace code, but it never re-appeared. I would love to know what caused it, but with the hundreds of thousands of lines of kernel code in AIX, I never will. Speaking of my time at IBM, we had all sorts of bugs related to the compiler. We were working with the brand-new RIOS processor and the compiler boys were working right along with us. I can think of at least a dozen problems in my code that turned out to be erroneous instruction sequences generated by the compiler. When you work on new architecture, I suppose that's one of the risks..... ------------------------------------------ U15 This is the one that jumped into my mind. Background: this was a port of a large financial planning package, ibm pc to mac. In general, the compiler and solver code ported flawlessly to the mac. Most of what we did was build a mac interface around it. We'd never had any trouble with anything in the compiler or solver, so no one on the team was particularly familiar with these parts. Symptoms: wrong answers in the final generated reports seemingly at random. In particular the answers were wrong only for very large financial models which took a fairly long time to run. the same models worked perfectly on the pc version. Tracking: the team leader tended to look down his nose at me because I was not proficient at the assembly based debugger. While he grovelled about with the debugger, I produced reams of printouts with WRITELN (yes this was Pascal). The runs took on the order of an hour each. then i would dig through huge printouts trying to spot where things were going wrong. Finally I spotted a variable that suddenly aquired an unreasonable value. Only a couple more debug cycles and I spotted the culprit, an uninitialized variable. On the pc, Pascal forced uninitialized variables to zero (the proper value in this case). The mac just gave you whatever happened to be there before which was almost (but not quite) always zero. This was all about 7 years ago, so details have become fairly fuzzy. It sticks out in my mind as the best example of print statements beating an interactive debugger. I doubt if anyone's stamina, or attention would have lasted through the hundreds (maybe thousands) of iterations it took to reach the one point when an uninitialized variable didn't contain zero. And since it had been zero every other time, who would have thought to check it. With the advent of reasonable symbolic debuggers for smaller platforms I now use both methods about equally. ------------------------------------------ U16 (1) A friend of mine was trying to write a general purpose binary i/o package for something deep in the innards of a C application. The idea was that struct decalarations and i/o code would be derived from the same set of macro calls (by using two sets of definitions). A favourite technique. Well, everything worked except strings (which were hopeless gibberish), and he was at a loss to figure out why. I said, Rene, does it ALSO fail for pointers to strings? He tried it and it did. I said, ok, the compiler (MSC in this case) is deriving alignment constraints from the basetype instead of the full type. And it was. Technique (works well): spend five years debugging code generators, and some of these things are obvious. (2) I was debugging a Lisp garbage collector back when generationalism had just been invented. We were a compacting mark/sweep collector, but I'd installed a system of shallow and deep garbage collections to try and get some of the same payoff that generational schemes were showing (this eventually worked quite well). Anyway, there was one programme that when we ran it on certain data, the system would bomb after something like 8 cpu hours, with a bad pointer. Eventually it turned out that the bug was as follows: When an object is byte-alignable AND It is misaligned w.r.t. gc cells AND The garbage collector tries to fix this by moving it UP in memory (!) AND It moves it from a position in which the header is in a different age-group from the body THEN Its header gets replicated, WHICH Causes a SECOND object to be built, containing some drek. This only manifested with cold strings, which are not very common, and these were at low addresses where garbage collection only reached them once in 32 cycles, and then it only happened for one alignment of the age-group boundary, which is why it took so long to manifest. Technique (works poorly): breakpoint the garbage collection before manifestation, and dump (a) the graph surrounding the problem, and (b) the relocate tables. Compute the OLD address of the offending object. Repeat on this new address, for gc n-1. When this results in a LEGAL address being generated, you're done. Elapsed time: about a month. Learned a lot from that one, though. Like, don't code bugs :-). (3) There were a lot of exciting bugs in the NetHack overlay system, caused by the fact that the compiler wasn't up to inserting trampolines for functions reached through pointers. These were cracked by having four people independently comb the source code for likely regexps (!). How's that for brute force? The debugging tool I most often want is watchpoints - breakpoints on data access. All the bugs I've caught with those are non-memorable because the process was so painless (if sometimes slow). ------------------------------------------ U17 Ok, the worst bug in my experience was one where a job running disconnected from any i/o was terminating because of a console interrupt. When I first encountered it, I couldn't quite grasp what was going on (the job had just vanished). After a few occurrances, I started getting worried -- I had stuff in there to trap errors of various sorts and leave a memory dump which I could peruse at my lesiure, but no dumps were showing up, and a debugging file (showing the messages from signon through to error). [Background: our system is rather busy during the day, so I'd moved a number of cpu intensive programs to run automatically at night. This particular one did a lot of text analysis on recent congressional bills and printed a dump of probable citations..] Anyways, I finally set things up so that the job _did_ have input. Basically, I had some commands there which would always save a memory image if anything happened to allow user input. [Of course, normal exit went through a procedure to quietly ignore these commands. Since the job itself did no I/O I could have these commands queued up without interfering with normal processing.] Anyways, it took a couple live tries to get this dump mechanism working properly, and by that time I'd alerted the systems people. >From this point on, I wasn't involved too the analysis of the problem, so I'll just summarize: Our operating system had been hacked to allow us to run background tasks (we're using VM/CMS [*ick*] which normally requires every active session to be on a different account -- we've got mods in to allow any user to have an arbitrary number of background tasks running under his/her own id). Our fileserver has a potential deadlock condition which it resolves by sending an unavoidable interrupt to kill the offending job. The code which detects this deadlock can be triggered if the main account is logged on, and idle, while one of these background tasks has a file locked. Actually, I didn't have to do much to track it down -- it just took several months for it to sink in that there was a problem and this wasn't just a one-time fluke. ------------------------------------------ U18 I don't know how nasty or bizarre this one is, but it sure puzzled me for quite a few days (I got to the point of thinking the compiler -- VAX-11 Fortran, a very stable product -- was at fault). The "bug" was that the VAX-11 compiler allowed an extension whereby you could use a TAB instead of 6 spaces at the start of any statement line but it would only count as ONE character toward the 72 character limit. The code I was working on had been written by different people at different times, upgraded, etc., and in some places still used spaces, while in most of the parts I was working on, it used tabs. I had become accustomed to the (implicit) rule that as long as the statement fitted on the terminal, it was okay. (Since fortran was not my "native" language, I was not used to worrying about such strict source code conventions, except when unavoidable, such as with continuation characters.) So, one day I added something to a line that happened to start with 8 spaces (to match the other lines around it with tabs, but presumably written by someone used to always putting spaces) and it would give strange results when executed. I tried printing the results of variables, etc. (but didn't know how to use the symbolic debugger if there was one). Eventually, I asked the system manager who, after some thought, simply put the cursor at the start of the line, typed "72 ", and the cursor ended up in the middle of an identifier. The truncated identifier either happened to be a valid identifier, or was implicitly declared, I don't remember which, but the effect was the same: no compile-time or run-time errors, but mysterious behaviour of apparently straightforward code! ------------------------------------------ U19 The following is a true story that happened to me about 8 years ago. I was working on a small team developing an Ada compiler in an academic setting. I was responsible for the code generator. One day I got a bug report from another member of the group that a certain Ada program of his crashed whenever he compiled it with our compiler, and it looked like the problem was a stack underflow. (Our target machine was a Perq Systems PERQ, running a microcoded stack-oriented instruction set similar to P-Code.) Examination of the disassembled object code revealed that, indeed, the compiler was generating (subtly) bad code. There were, however, other binaries that were purportedly built from the same sources by other members of the group, and they compiled the program just fine. At first, we suspected a version control problem, that somehow the version of the compiler that I had built was constructed using different sources than the others. We then suspected differences in release levels of the compiler and linker used on the various machines. After a few quick investigations, it became clear that something really fishy was going on, so we began a more systematic investigation. We took a common set of sources, and built a binary in which we tried every combination of { compile compiler, link compiler, compile test program, link test program, run test program } on each of two machines. It turned out that the problem appeared if and only if the compiler had been linked on my machine. We reported the problem to the hardware maintenance staff, a little reluctant to blame the hardware, but fairly confident that we had controlled for every other variable. The hardware people did not seem too terribly put off by our diagnosis that a problem that might seem so clearly to be a compiler bug was in fact a hardware problem. A technician swapped out the CPU card of my workstation, I relinked the compiler, and the problem vanished. ------------------------------------------ U20 I don't know if this is appropriate, but we found an elusive bug with the editor that we use (it's called EC, not written by us). The symptom was that the program always crashed, but only on a 486. When run under Turbo Debugger, the location of the crash could be found. It was executing a software interrupt, but seemed to be executing the wrong interrupt number. This was within a C library routine (probably int86()). However, when we set a breakpoint at the start of the routine and single-stepped it, everything was fine. Finally I had an inspiration. The problem was caused by the 486's instruction pipeline. The int86() routine was storing the interrupt number to executed (self-modifying code), loading the registers, then executing the int instruction, but the int instruction (and the int number) had already been read by the 486 (instruction pipeline), so the wrong interrupt was being generated. ------------------------------------------ U21 I once fixed 5 major bugs in a large program simply by reformatting the code (by hand). The interesting thing is, I didn't know the bugs were there when I started. I was modifying a "borrowed" copy of it for a particular application and no one told me there were 5 outstanding STRs against it, some of them 18 months old. The reformatting process revealed flaws in the logical flow that I fixed while I was working on my problem (removing 1000 lines of dead code helped, too. This was the cruftiest bowl of spaghetti I've ever seen). After that, the program was taken away from its original authors (who'd effectively refused to work on it anyway) and I've owned it ever since. ------------------------------------------ U22 I have two: First, programming an embedded system using PL/M-86. I don not recall the exact syntax there but in C my error would have been: char msg[4]={'h', 'e', 'l', 'l', 'o'}; That is, allocating one BYTE less than needed for the simultaneous initialization. The compiler issued a warning and stopped code generation but produced dummy linking information. The compilation/linking was a large batch job on a VAX: the warning did not halt the job (an error would have) and linking succeeded because of the dummy info. No sign of error then: but one of my code modules was actually missing! Of course the thing crashed, as my 'module' contained random bytes. Tracing it with a hex debugger - remember, this is *low-level 8086* stuff so I had to calculate addresses by hand, etc. - I saw that my process eventually jumped to the stack segment of an another process! This made it clear that something was wrong with the whole executable image. Second, doing a simple programming exercise for a C language course: recursive quicksort of strings, represented by pointers. My program ran out of stack space, so naturally I thought my recursion did not halt properly. So I added some printf's to trace its behaviour; this was MSC 4.0 - no fancy integrated debuggers here! Surprise: it was the printf's that did endless recursion - my original sorting went well, but crashed when printing the result! But printf isn't recursive! Well.. the routine I used for printing the result was named "write", and the run-time system used a routine with the same name to write out arbitrary strings, including printf's after dealing with the conversion characters. The linker let me "redefine" the name without even a warning. Nowadays I am paranoid about linkers ;-) ------------------------------------------ U23 I designed some software that had to run on a variety of machines. The code was written mainly in C with machine-specific assembler routines for certain speed-critical regions of code. The code ran correctly on most of the machines in our department apart from a MicroVAX running Ultrix. What distinguished the Ultrix machine from the others was that I had confidence in the assembler code because it had been thoroughly tested on a similar MicroVAX running VMS. I had used the VMS machine to profile the code and performed quite rigorous testing. I assumed, somewhat naively, that if the assembler routine was correct then something else in the program, perhaps the O/S specific I/O routines, had corrupted the data being fed to the speed critical region. I performed dry runs on paper writing down every state change within the program and could find nothing wrong. I then added code to log all of the internal state and ran this on both the VMS and Ultrix machines and found that up to the start of the assembler code everything was identical in either case - it had to be a bug in my well-tested assembler code. After single-stepping through the code I discovered that a particular machine instruction produced a different result on one MicroVAX to the other; at first I thought that this indicated a hardware fault. I tested the code on a different Ultrix machine and again it failed with the same symptoms. It turned out that the particular instruction, although it was present in older VAX machines, is emulated on the MicroVAX-II. When the instruction is encountered by a program the processor executes an unimplemented instruction trap and the operating system emulates the expected behaviour of that instruction. The emulation code within Ultrix had a bug in it and so the instruction didn't work correctly; the code in VMS was fine. Why the Ultrix and VMS groups chose to have two different implementations of something that ought to be identical I don't know. Although I rewrote my code so that it didn't contain the offending instruction I reran my old code under a subsequent release of Ultrix and they appear to have fixed their emulation code. ------------------------------------------ U24 How timely. I spent the majority of last night hunting this one down which all things considered is probably common enough (with a twist). Not wanting to be sloppy I carefully free() all malloc(ed) blocks of memory but it turns out I released one I was still referencing. It showed up on the screen looking like I had forgotten to terminate a ASCIIZ string but I couldn't find anywhere I had done that. The curious thing was in order to test the value I expected to be output I added a printf() to display it. Not only was the value correct but the routine suddenly worked properly and there was no screen garbage. So I pulled it out, recompiled and it was broken again. Add a printf() and it worked, remove it and it didn't. Right then I knew that it wasn't just some ordinary "bug" because nothing I could legitimately write in C would be affected the presence or absence of a printf() statement. As to how I found it I think I kept backing up from the point where the error occurred, stubbing things and noting if it continued to occur or not and finally I was staring the free() in the face. Not too terribly bad a lesson to learn and the next time something equally odd occurs I'll go check the free() calls first. ------------------------------------------ U25 Somewhere around (1978?) I made a trivial change to a shorthand-to-english translation program. I was investigating the work carry over from pass to pass, so I had a data file which caused several hundred very short passes. Suddenly it died on pass 90, with an error we had been getting for the past year or so. Sometimes the disk system (which we had written) would return the wrong sector. (This was on a Datapoint 5500: 48K, 2 x 2.5MB disk cartridges). We had looked pretty seriously for this bug, but could never repeat it, and eventually added code to detect it, and retry where possible. Interesting (getting the bug on pass 90), but I had code to test, so I ran it again. Same error - pass 40! Continued attempts got the same bug on always-different passes. Paranoia convinced me I had done something terrible (I had my share of fingers in the os, the disk driver, and the translater). I added code to the translator to see what was going on. It seemed the os was returning the previous sector (again) rather than the newly requested sector. Panic! What have I done to the os?!? Move the debugging code into the os. Wow! the os is getting the same sector twice from the disk driver - it's not in the disk cache or anything. But I haven't touched the os or driver for several days! Well, eventually I found that if I read sector A, then later, while sector A was just starting, if I read sector X, I got sector A instead, with no errors or anything. I wrote a little standalone program which tested for this. By fine tuning it for particular machines, I could get the wrong sector 1 time out of 3 or 4. I had spent about 3 days pulling my hair out up to now. Our boss took the cassette (yes!) with the program on it to a Datapoint convention, and the Datapoint people basically laughed at this long haired hippy. But a few days later, their head of repair (or at least very high up) flew out with only a new disk controller board for luggage. Fixed the problem too. Seems there were two sector registers - one counted the slots (hard sectored) in the disk, the other was the sector to read/write (set by the user). A flip-flop was set whenever the slot counter incremented to a match with the user register, and reset whenever it incremented to a mismatch. Any time up to the end of the 75 microsecond sector preamble, a read command would begin if the flip flop was set. (Write commands could not start during the preamble, since they wrote the preamble.) So - you've read A, it crosses into A some following revolution, the flip-flop gets set, you change your register to X, which DOES NOT RESET the flip-flop (its slot counter hasn't changed), you tell it to read, and bingo! wrong sector. The fix was a lifted leg and a new wire. As a nice side affect, Datapoint never again laughed at any of our complaints. Another side benefit, they had respect for our ability to write tight fast code - their os was so lame, it oculd never have started a read within the 75 microsecond window. ------------------------------------------ U26 Well, doung Mac hacking is one thing... I finally took this great tool pair "Heap Scramble" and "Mr Bus Error" for the mac and nailed some bugs in Mac NetHack. If you do mac stuff, you know what I mean, else a short explanation: The mac memory manager (instead of malloc) returns pointers to pointers to memory, known as handles. The "middle" pointer is "owned" by the mac OS, and when the OS moves the block of memory to de-fragment your heap, it updates that pointer. All references to handles must be made double-indirect, like: ( * fooHandle ) [ ix ] ; This makes for inefficient code, but you can instead do: bar = * fooHandle ; ... bar [ ix ] ; However, any call to the OS means the "copy" bar of the master pointer foo points to is made invalid. Thus you can lock blocks, marking them as non-movable: state = HGetState ( fooHandle ) ; HLock ( fooHandle ) ; bar = * fooHandle ; ... bar [ ix ] ; HSetState ( fooHandle , state ) ; However, there are sometimes when you forget to lock the handle, or don't call the OS but then later add a call and you use an invalid handle... And memory doesn't move ALWAYS, just when it has to, so these bugs aren't very repeatable. Heap Scramble and MR Bus Error are nice tools, and help you find this, and NULL dereferences, and other stuff. Heap Scramble scrambles the heap at EVERY system call to provoke unlocked- handles-bugs, and Mr Bus Error checks for writes to NIL or dereferences from NIL, among other things. ------------------------------------------ U27 Here's a debugging tale from the old days. In 1970, the same semester when campuses all over the US were erupting due to Kent State, I took my first computer course. It was conducted on an extremely primitive mainframe: an IBM 1401 with 8K (sic) of memory, no magnetic media, punched cards only, at Roosevelt University in Chicago. The math professor who taught the class was good. He introduced computer science in a "genetic" way, showing us first the miseries of machine language, then the slight lessening of misery encountered on the switch to assembler. But when he wanted to proceed to FORTRAN and the glories of high-level language he was stymied, for the compiler failed. He ended up directing us to an IBM service bureau to do FORTRAN. A year or so later, I was a programmer in the university computer center. I chanced upon the FORTRAN compiler. It was a deck of cards in a drawer. When I keyed in a simple FORTRAN program, of course the compiler failed in exactly the same way. However, I happened to have a copy, at that time, of John A. N. Lee's book, The Anatomy of a Compiler. This had some details on 1401 FORTRAN in an appendix and so served as an introduction to the mysteries of this program. I discovered a subroutine in the compiler deck...to do multiplication and division. This was puzzling, for although hardware multiply and divide were extra cost on the 1401, I knew from actual experience that we had these features. It dawned on me, however, that this was very possibly why the compiler failed, for Lee's book specified that the 1401 Fortran compiler required at least 8K of memory. I removed the subroutine and replaced calls by hardware opcodes, and the compiler successfully ran my code. Whenever a FORTRAN program compiled successfully, it printed out the object code (actually code for an interpreted FORTRAN machine) on our line printer in a peculiarly compelling rythymic pattern. The operators and I had a little dance to go with this pattern, which became known as the "successful compiler dance." Since I was a long-haired punk kid (in 1970) and since this computer was in a glass box visible to the public, our compiler-dance was strange, to say the least. A debugging story from the ancient history of computing! ------------------------------------------ U28 Probably my nastiest and most bizarre bug was when I was guiding a student to modify the SOS editor under TOPS10: there was an unexplained crash (occasionally) when exiting intra-line alter mode. Well, close observation finally lead to the conclusion that the culprit was the way we exited alter mode: if it was exited through everything was fine, but exiting with crashed it. The issue was why, and we finally nailed it with the following procedure: we traced (and captured in a file) both executions of SOS, and then ran SRCCOM (TOPS10 version of diff) through them: voila, the bug was PRECISELY, the first different instruction! Another weird bug was when I reversed the operands on an _OLD_ machine (IBM 1620) which converted from Alphanumeric to numeric: this was an easy bug to do because this was the _ONLY_ instruction where the operands were in reverse from the normal sequence: I caught the bug because the instruction took so long to execute that the opcode actually caught my eye while executing (the flashing lights always displayed what was going on): and yes, that opcode was the culprit. ------------------------------------------ U29 Here is my favourite debugging anecdote. I once had a program that only worked properly on Wednesdays. I had a devil of a time finding what the problem was. At the end of one cycle it would ask you if you wanted to continue, and unless you typed a "y" it would quit. (OK, OK, you caught me it was indeed a game program.) This program would always end the game even if you typed "y" unless you were playing on a Wednesday. On Wednesdays it would work correctly. The code for testing if a user has typed "y" or not is not very complex and I was unable to see what the problem could be. Re-arranging the code made the problem change symptoms but not go away. In the end, the problem turned out to be that the program fetched the time and date from the system and used it to compute a seed for a random number generator. The system routine returned the day of the week along with the date. The documentation claimed that the day of the week was returned in a doubleword, 8 bytes. In actual fact, Wednesday is 9 characters long, and the system routine actually expected 12 bytes of space to put the day of the week. Since I was supplying only 8 bytes, it was writing 4 bytes on top of storage area intended for another purpose. As it turned out, that space was where a "y" was supposed to be stored to compare to the users answer. Six days a week the system would wipe out the "y" with blanks, but on Wednesdays a "y" would be stored in its correct place. ------------------------------------------ U30 My worst nightmare occured when I was working on adding an interrupt service routine in a real-time system. I had the code working mostly, except it would crash on rare occasions, frequency about 1/day. I couldn't even figure out what the symptoms were. I had a hunch it was in the ISR because it failed randomly. I had all sorts of theories, most of them centering on a race condition. Because I didn't know where to look I started analyzing the ISR code, no race conditions. Everything looked OK. I got a lucky break about a week into it, I found a set of conditions which caused a crash within minutes. This was pure happenstance, just hit on the right set of circumstances. Using an in-circuit emulator and a complicated breakpoint I broke on the error condition and found that the error was indeed a race condition in my interrupt. Not believing that this could be so I spent another day breaking at various points and discovered to my horror that the processor (Intel 80186) had the race condition. When the integrated interrupt controller was set to memory mapped mode (as opposed to I/O mode) and you wrote to the mask register using a write after read instruction ( in this case an AND ) the processor would mask the interrupt, but if the interrupt was already pending waiting for the end of the AND instruction actually setting the interrupt mask then the interrupt would be entered with the interrupt masked. This was a condition that should be impossible. An interrupt should not occur if its mask is set. When I exited the interrupt I set the interrupt mask to clear and returned. I would get the next interrupt immediately, during a critical section where the interrupt should have been disabled and the system would crash. The fix was to never, ever muck with the interrupt controller with processor interrupts enabled. Even though the data book implied that it would work. To this day I am amazed that I found out what this bug was. I attribute it mostly to good fortune to find a repeatable set of condtions that would cause the bug to appear and a very good in-circuit emulator. ------------------------------------------ U31 The author requested copyright on the original, as it appeared as a trade journal editorial I can make this available once I get permission from the author. �Ed. ------------------------------------------ U32 Generally I find that sitting at a machine typing things often does not help much. If I get away from the m/c and do something unrelated, I sometimes find that the answer to the problem comes to me. I'm afraid this is not a reliable technique, though! ------------------------------------------ U33 I don't know if you are looking for funny stories ... We had a lot of troubles with our machine based on Z8000. The guys who wrote the Operating System didn't care about vararg functions and the z8000 stores function args first in registers. Lots of utilities crashed because they were not ported but only recompiled from V7 UNIX. My friend Harti spend a lot of time using adb to re-C (rewrite in C) the executables and to fix those bugs (xargs, /etc/init, ...). There was also a funny bug in the malloc() routine. The routine was running stable, the bug only occured in special situations, leaving the program in an endless loop. Reassembling and rewriting was the only possibility. Another problem, with our hardware, caused the character A written on disk to be read as ^@. My friend Harti fixed this "bug" after a time, changing the capacity of a condensator in the disk controller. ------------------------------------------ U34 Hmmm.... code that interacts with fast things like other code or things in the real world is always interesting because bugs go away when you slow the code down with a debugger. But that's mundane. The best story I know involved a large office management system. There was a site that couldn't do any printing from this package. Word Perfect and every else could print things perfectly well. This program would say it was printing then do nothing. Ok - this couldn't be the printer driver because another package with the same routines worked perfectly well on the same printer. The troublesome package also worked fine on all of the other printers on that system.... Much looking through the code later, it wasn't obvious there was a nasty bug trashing things, so, they get a version with a debugging printer driver that logs everything sent to a printer. The log looks fine but nothing comes out of the printer. Ok, your printer's not working, fix it. The fixit shop says there's nothing wrong with the printer and blames the software. (*This* is when the users tell us that the other packages using the same driver all print fine.) At this stage, we take the thing off their hands and loan them a system. We take the thing to examine it in detail. Sure enough, it doesn't print. Switch the printers and everything works fine, the same printer works fine with every other machine in the place. Things are getting confusing. Next a breakout box goes in between the printer and the confuser - TA-DA! it all starts working.... Is there a problem with the cable.... No - it works with any cable plus a breakout box and no cable without it... Someone did work out that with more than a meg of code running, that printer driver, that printer and that machine wouldn't work without a breakout box between them. No one ever worked out why. I think the printer was swapped for one of ours and everyone was happy again. I have no idea why it all worked with small code running and not when big code ran without the aid of a breakout box. It never was debugged. ------------------------------------------ U35 When I was working at Data General, several years ago, I was porting some graphics code to a new DG machine. The code was relatively stable, so I was surprised when it seemed to go into an infinite loop. After debugging for the better part of two days, I eventually realized that somewhere, a jump was being executed to location zero. Under DG's AOS/VS operating system, location zero can be read, and its contents are usually zero. In the DG MV architecture, a zero instruction is a jump to zero. Hence, once you jump to zero, you loop there for ever. I then needed to track down where the jump was originating. This took awhile, but I eventually traced it to a particular arctangent instruction (hardware floating point). I set a breakpoint on that instruction and the program stopped looping. After some more experimenting, I came to the definite conclusion that when this particular arctanger instruction was executed normally, the program would jump to location zero and loop for ever, but when a breakpoint was set at the same place, the program would run just fine. Other arctangent instructions did not exhibit this behavior. After about three or four days of this, I called the person responsible for the microcode for the new machine and explained the problem. "Oh yeah," he told me, "we know about that one." It turns out that when an arctangent instruction was on a page boundary, a microcode defect caused the jump to zero. ------------------------------------------ U36 ... a recent example was debugging a problem in a TCP/IP network kernel I am writing for MS-DOS. The symptoms were: A TELNET remote login session could be started to a host, which worked well. You could log in, start doing things, and after a random amount of time (sometimes 30 seconds, sometimes minutes, sometimes never) the TCP protocol would get confused, and just stop transmitting or receiving characters. The machine was not crashed. The problem only ever occured when using a particular terminal emulation product. It seemed to be speed-dependent - fast screen output worked well, but anything which took characters out of the receive-buffer at a fairly slow rate was prone to crashing. The kernel had to run as a Terminate Stay Resident utility, so no debugger could be used. Even single stepping through the assembler was fruitless, as the protocol would time out while I was verifying the right assembler code was being executed. On top of that, we knew of many bugs in the implementations on the machines at the other end, so the propblem could be with them, except that other machines had no trouble talking to the same remote hosts. To get outy of the halted condition frequently left hung jobs and occupied network ports on the remote host, which could often be cleared only by shutting the UNIX host down. I had a packet monitor for the LAN< but it didn't show all the informatrion from the network headers, but I had a newer version which showed the information that the old one lacked, but had bugs which prevented it showing the header values the old version displayed correctly. After three weeks of hand tracing the protocols, based on the known parts of the network header, and guesses about the rest, scanning the code with a fine-toothed comb by eye looking for logic flaws (about 70K of dense C), trying to single-step the assembler of the driver while not stepping on any other interrupt processing, hand-converting jump addresses to routines so I could try not to debug the routines from the compiler's library, and eventually placing the standard trace-writes [things like output("here 1"); ] all through the code to find out what the %^$%$#% it thought weas happening, which produced output on the screen too fast to read, and couldn't be redirected to a file, I found the problem. I had managed to find that the protocol was getting confused within a particular 10-line code section, which I had been staring at, re-coding, removing bits, etc. for 3 days when inspiration struck. Despite all the warnings and errors the compiler COULD have notifyed me about, it wouldn't warn about testing if an UNSIGNED number was less-than-or-equal-to-zero. If a packet came in that I had already received (which only happened if the output buffer was being read at a particular rate which triggered time-outs and re-transmissions at the remote end), I was testing to see if the last byte of the packet was already received, (which made the difference between the current place in the byte stream, and its place, negative), and if so discarding the packet. The difference was being placed in an unsigned int, which was subsequently never <=0, so the old packet was being used to update the protocols idea of where in the data stream it was up to. The line in question had been on the screen for a fraction of a second as I was paging through the source, and had stuck in my mind one night, and triggered the AHAH! light bulb after I had already gone 4 pages beyond that section... The code now works, I'm glad to say. Sometimes, with all the wonderful debugging tools around, you have no choice but to sprinkle the code liberally with things like: printf("here"); printf("there"); printf("here2"); printf("called XXXX, value of this_ptr = %04x\n",this_ptr); ------------------------------------------ U37 I ported the game Omega from Unix to Atari ST. The Atari is a M68K machine. Omega sources are 1.2 Mb, my machine has 1 Mb RAM, binary is 400 Kb. In these circumstances, source-level debugging is a pain, since a source level debuggable bianry is >450 Kb, the debugger takes 200 Kb and I need memory for the data too. Bug: the program deleted a list that contained pointers to other objects. The other objects were not deleted. They were referred to by other parts, however, and also pointed *back* to the list that was occasionally and partially deleted. To complicate this further, memory allocations and delete operations happen in bunches, so once every so often a lot gets done and then quite some time nothing is done. Symptom: sometimes, the objects behaved a bit weird... Found: Keeping records of all allocations and deletes, and then reading all of the source code. On the faulty code in question, I nearly choked on my coffee. Repaired: after searching the code for >4 weeks, and finding the error, it took less than 5 minutes to repair this code. ------------------------------------------ U38 Attempting to debug some software that was being targetted to a unique piece of h/w. The h/w had shared memory between 2 Zilog Z80's, 2 Zilog DMA units and a Z8000. I was programming part of the Z8000 s/w. We wrote something to memory and then when we retrieved it later it was incorrect. Since this was a device driver we put an ICE into the system to see what was happening....Voila, on a single screen we saw a write and later a read from a location with different bytes. (I lead the h/w guy over and told him it wasn't my problem anymore ]-). It turned out that an obscure bug in the DMA units was corrupting this memory area. (It only took 2 weeks to find that it wasn't our code...) ------------------------------------------ U39 Several years back (before the workstation revolution), a colleague and I went to track what we knew to be a simple bug in a large server program running on a timeshared Vax 11/780. (By "large" I mean about as large an executable as the bsd Unix of the time could comfortably handle.) We knew in advance about what we expected to find -- the symptom was clear and well-defined, and could obviously have arisen due to a change we'd just made. We fired up dbx. And waited. And waited. And waited. (The dbx of the time was even buggier than it is today, and had even fewer optimizations for reading in large symbol tables.) Finally I gave up and turned to another terminal and fired up trusty old adb. I found the bug and fixed it, and had started the recompile, and dbx had still not even presented its first prompt yet. (I don't remember if it ever did, or if we killed it.) This illustrates an all-too-common paradox with debugging tools. The older a tool is, the fewer features it has, but (unless progressive software degradation has set in) the more reliable it is. Newer tools with fancy features are inevitably full of bugs or size limitations, or do not scale well to large problems. Yet it is the large problems which are the hardest to debug, and which presumably could use the best tools. (The problem is worst for brand-new, "revolutionary" tools, which sometimes only seem to work well, or work at all, on the toy problems you can imagine the implementor used for design and testing.) ======================================================================= 3. BIX ------ The replies below were collected in between 3rd March 1992 and 8th March 1992 in response to a request which was posted to the following conferences on BIX, the BYTE Information Exchange: c.language/tools c.plus.plus/tools One reply also came directly to my personal email address. ------------------------------------------ c.language/tools #2843, from 'B1', 478 chars, Tue Mar 3 09:11:00 1992 Comment to 2842. Comment(s). More refs to 2842. ---------- The bottom line for me is all the above. I have "cracked" bugs systematically, in my sleep, while driving my car, while not even thinking of it (!), by sheer intuition, by being familiar with the problem, by being unfamiliar with the problem, by choosing the most probable out of a list, by choosing the least probably out of a list, by throwing the list out of the window sight unseen, by using past experiences, by using heuristics, by simply doing the unobvious, and so on. ------------------------------------------ c.language/tools #2846, from 'B6', 47 chars, Tue Mar 3 12:12:53 1992 Comment to 2843. Comment(s). ---------- debugging is a search process. So is testing. ------------------------------------------ c.language/tools #2850, from fboness, 151 chars, Wed Mar 4 00:49:54 1992 Comment to 2846. ---------- Debugging is one of the few remaining areas of life where the hunter gatherer skills/talents honed over so many millenia are still valuable. ------------------------------------------ c.language/tools #2844, from 'B1', 3995 chars, Tue Mar 3 09:58:17 1992 Comment to 2842. Comment(s). More refs to 2842. ---------- Oops, just realized you asked about tools and anecdote in specific. Tool-wise, I rarely use a debugger (except the one in my head!), and when I do, it is typically in the development phase, and only then to get a stack backtrace (UNIX debuggers support this, I really don't know if the others do too). "My" best debugging story is as follows (all names have been changed to protect the guilty): Company C cut a new version of their software and sent it up to QA. When X (the developer in charge) found out, they were disappointed at this cut was intended to fix some other problem, and they needed it yesterday. They were able to verify QA's crash by going through the same sequence. They spent a few days and finally narrow down the proximity of the source. By the end of the week they even had the line of code. By darn, something was stopping onto their code and that's why it was crashing. They spent some more days. Nothing. Suffice it to say that at this point overtime was in order. Nothing. They started soliciting some of the better people in the company from other projects. They either was too busy, didn't care, or if they looked at it, didn't spend enough time. Or or two of them checked in a few times a day though to keep up with the story and see if "a view from outside" could perhaps be the key. Still nothing. It's something like 2 weeks now that they are trying to find out what is stomping on their code (there is supposed to be a call to the OS at that address and their isn't). I got dribs and draps and start feeling sympathetic: Say X, I could lick that problem in a few minutes (ok, maybe not that sympathetic), why not let me come in and take a look at it? Naw, besides we'd get into trouble with an outsider being there. It looked bad. I then insisted that I was going to their company the next day. They finally agreed that I should come in on Saturday when nobody was there. Got there at something like 10:30. By the time they set up, chatted, explained the thing abstractly and more intimately, I suggested a number of things. They had either tried them, or they didn't work. Nothing. Let's get out of here and go to lunch says I. I found out more, and suggest tons of things. They tried those all. Some they didn't. We try them when we get back. Worthless. Hmmm. Time to switch gears. Tell me: how do you know that the code is getting stomped? Because it's not the call. Ok, show me. Well it wasn't the call. But the generated code looked so beautifully templatized (a sequence that could have been generated by the compiler) and appeared to have no garbage gaps in it. Well, perhaps just the way the thing stomped. But it was a long shot that was needed now. Ok, duplicate that line of code so I can see what it was supposed to look like. Now this is interesting: both line of code generation look the same. Er guys: nothing's stomped anywhere? It got a bit tense as they called me crazy (I'm the crazy one, but they are so smart, right?). No, no. Listen. Get out of the debugger. Run the debugger. Load your code. DON'T EXECUTE IT. Yep, there is was. The stomp with no execution. That lousy compiler they screamed. I have this feeling that that is not the case. They started to argue with me. (Like I needed this: here I am on my Saturday, solving their problem, and now getting flack). NO! Show me the source code. Nothing obvious. Ok, well, so long as are here why not code the call and its argument differently. Still crash. (). The re-edit the source. There it was. I knew it. Yo, bring back the original source code. You go get the language manual. I did not tell them why. I was crazy again. Just do it! They could not find their copy. Another few minutes rummaging everybody's desk in the company. Ok, we had one. Index.... ok.... temp. There is was. Plain and simple: temp was a keyword, and when not declared, did some builtin thing. We put in the declaration and went home. ------------------------------------------ c.language/tools #2847, from 'B2', 970 chars, Tue Mar 3 19:58:54 1992 Comment to 2844. Comment(s). ---------- That reminds me of the time (back in the days of hardcopy source listings and punched-card decks) when I spent 3 days debugging a COBOL problem that just couldn't be. All the symptoms showed that it was making it to a line within an IF/THEN/ELSE block which it should never have made it to. It was obvious that it could never make it to that line, because the prior line clearly terminated the block, because it ended in a period -- you could see the period right there on the page, there was no question about it. The period was in column 72. Thus it was taken as a comment, and didn't end the statement. You sure couldn't tell that by reading the hardcopy though. The tool that solved this problem was low-tech indeed: a 15-inch steel ruler that was marked in 10ths of an inch (ie, in characters), when laid against the line, showed the period was in column 72. That was the one and only time it took me 3 days to think of using a ruler on the listing. ------------------------------------------ c.language/tools #2848, from 'B1', 285 chars, Tue Mar 3 20:03:13 1992 Comment to 2847. ---------- Ugh. That war story brings back bad memories. Actually, I'm going out the door right now, but I probably has a slew of ones like that. Plus, I've just remember a bug that wasn't that happened to another buddy of mine. I'll see if I can remember to mention some of it when I return. ------------------------------------------ c.language/tools #2845, from 'B3', 294 chars, Tue Mar 3 11:55:25 1992 Comment to 2842. More refs to 2842. ---------- Last bad bug (i.e., random crashes on some machines with some onfigurations) ultimately was solved when not thinking about the problem. The program was restructuring an array of pointers such that the one that needed to get free'd was no longer the first one (under some circumstances). ================================ ------------------------------------------ c.language/tools #2849, from 'B4', 710 chars, Wed Mar 4 00:20:52 1992 Comment to 2842. Comment(s). More refs to 2842. ---------- >Trawl for debugging anecdotes (w/emphasis on tools side)... This may not be exactly what you're looking for. Years ago I was working on a fairly long BASIC program, a business game simulation used in a marketing class for MBA degree candidates. I must have spent a few days on one bug, planting print statements everywhere and pouring over reams of printouts. I was to the point of snapping pencils when the guy in the next booth looked over the partition and said "Oh, you didn't mean to do that, did you?". I had used the same index variable for two FOR loops, one inside the other. I must have looked at that segment of code hundreds of times. He spotted it in a few seconds. ------------------------------------------ c.language/tools #2851, from 'B1', 1058 chars, Wed Mar 4 01:35:51 1992 Comment to 2849. More refs to 2849. ---------- >two FOR loops, one inside the other My friend's debugging mishap, which unfortunately involved me again, came down to something like this. We were doing a bunch of stuff on an IBM Series/1. He came to me one day to ask if I could help him out. His program was crashing the machine. But not exactly. That is, the console would print "IEW1234 IMMINENT SYSTEM FAILURE" or something like that. We spent 1/2 doing all the obvious stuff. At that point I suggested that we find out what an IEWblah was. It wasn't listed. He continued to track down which I went to get the system programmer. Turns out he was trying to impress some girl, and after three times waiting for him ended up having an argument with him. We spend another 1/2 hour. Things were almost sporadic. I told him: either the OS or the hardware is going bonkers, or your program is printing this out. At which point he gasped and got hysterical laughing. His program was printing it out and he forgot. It was the end of the day, so instead of slapping him around, I just went home... ------------------------------------------ c.language/tools #2852, from 'B5', 739 chars, Wed Mar 4 02:15:54 1992 Comment to 2849. Comment(s). ---------- This is a common occurance -- we tend to see what we _wanted_ to code, instead of the truth. (gee, could this happen elsewhere in life, too?) I frequently will explain a problem to my girlfriend - she's willing to listen as I babble on about pointers, and ISRs, and DOS, and other TLAs. Even her "dumb" questions often serve to clarify in my own mind that what I wrote is what I want. Often, it will spark the thought about where the bug is -- or at least how to go about attacking it. (to be fair, I listen to her babbling about her job; she works in an ICU and talks about IABP (Intra-Aortic Balloon Pumps) and Swans (some kind of cardiac catheter) and tachycardia and other wonderfully unpronouncable words...) :) ------------------------------------------ c.language/tools #2853, from 'B6', 176 chars, Wed Mar 4 11:41:45 1992 Comment to 2852. More refs to 2852. ---------- Code analysis tools help. Even linting C code helps. RSN we'll have tools that use AI technology to help (NS btw, this stuff is really being worked on by some AI researchers) ------------------------------------------ c.language/tools #2854, from 'B1', 285 chars, Wed Mar 4 11:49:32 1992 Comment to 2852. ---------- >Even... "dumb" questions... serve to clarify This is helpful too. In fact, it is a skill. One must be able to switch gears into a less knowledgeable bystander. I have solved countless situations this way. Ditto goes for coming into it it an "obviously" wrong tangent on purpose. ------------------------------------------ c.language/tools #2855, from 'B2', 1057 chars, Wed Mar 4 13:43:50 1992 Comment to 2842. Comment(s). ---------- Since your question involves debugging tools, I should mention that I've found an execution profiler is sometimes more valuable than a debugger in tracking down some problems. It can be especially handy for bugs like "The program runs fine for about 5 minutes, then seems to go into a coma." It could take an hour or more with a debugger to step your way through thousands of loop iterations to the point where the bug is happening. The worst part is early in the debugging, when you don't have the slightest idea where the problem is, so you can't use any of the debugger's advanced features for breaking on a condition or loop count. There's been several times where a profiling run shows that after 10 minutes of running, the program has spent 90% of it's time in one function. That gives you a good place to starting poking around with the debugger. Sometimes I wish for a combined profiler/debugger in a single package. It'd be nice to set a breakpoint such as "stop when any routine has aquired more than 50% of the accumulated runtime." ------------------------------------------ c.language/tools #2856, from 'B1', 171 chars, Wed Mar 4 14:04:03 1992 Comment to 2855. Comment(s). ---------- >It could take an hour or more with a debugger to step you way through >thousands of loop iterations to the point where the bug is happening. What pray tell do you mean? ------------------------------------------ c.language/tools #2857, from 'B2', 704 chars, Wed Mar 4 14:25:36 1992 Comment to 2856. Comment(s). ---------- If your symptom is just that the program locks up (ie, seems to be hitting a "do forever" loop somewhere), and you haven't a clue as to where it is happening, a debugger isn't always the best tool to start with. You can use a debugger to step through the code, but without any clue as to where to set a breakpoint, or what data item to set a watch on, you could end up stepping through a LOT of code a line at a time, or even a function at a time, just trying to find out whereabouts the do-forever loop is happening. Once the profiler has shown where the do-forever loop is, you can go back to the debugger and start setting breakpoints and data watches to try to zoom in on the actual problem. ------------------------------------------ c.language/tools #2858, from 'B1', 554 chars, Wed Mar 4 17:19:47 1992 Comment to 2857. ---------- >but w/o any clue as to where to set a break*.... you could end up >stepping through a LOT of code... just trying to find.. the >do-forever loop I've just never found the necessity for that kind of involvement, so I'm a bit taken by this. This is not to say that I've never been presented with or coded an infinite loop. Under UNIX, a keyboard generated core dump and stack back trace is quite easy. Failing that, I've used the profiler trick on non-UNIX boxes. Failing that, I've never found the resort of watching a debugger movie of this nature. ------------------------------------------ c.language/tools #2859, from 'B7', 361 chars, Wed Mar 4 23:09:57 1992 Comment to 2853. ---------- >Code analysis tools help. Even linting C code helps. I've found that with the work I've been doing with BC++ 3.0 that the compiler in C++ mode is a very helpful "linter". Since all functions have to be declared, it finds all the cases where I have arguement type mis-matches and such. I've had much fewer hangs than in years past working with C compilers. ------------------------------------------ c.language/tools #2860, from 'B4', 712 chars, Thu Mar 5 00:20:50 1992 Comment to 2852. ---------- [reply to 'B5'] Notice how all the stories so far have been about some incredibly simple but overlooked problem? Does this mean that the really complicated bugs are just work and not that interesting? Or perhaps the debugging tools available today are so good that only the simple stupid mistakes that no debugger could save us from are hard. The one thing I find that makes debugging non-stupid errors the hardest is an intermittant bug, one that only appears after I do a sequence of things in the program that I almost never do. Trying to isolate the bug and make it reproducable may be extremely frustrating. One trick I've resorted to is to add code to the beta version that stores every keystroke. {My reply to some of the BIX CoSy messages copied to this conference above} [2861... from meisenstadt] [comment on 'B5's #2852] > we tend to see what we _wanted_ to > code, instead of the truth. (gee, could this happen elsewhere in > life, too?) Amen. Many (twenty!!) years ago a colleague and I did some studies of board- game players in which their 'lookahead' moves were made explicit (using various tricks involving light-pens and 'ghost' pieces). We found that they spent much more time during lookahead thinking about their own moves rather than their opponent's moves, leading them to make naively optimistic projections about what might happen next. One sees the same naive optimism in debugging behaviour. > I frequently will explain a problem to my girlfriend - she's > willing to listen as I babble on ... Articulating the behaviour of the program helps to put things in perspective, possibly by encouraging (forcing?) you to work at a higher level of abstraction, thereby giving you a better chance of seeing the forest rather than the trees (or rather being able to see enough of the forest that you can then home in usefully on the troublesome tree!!!). Hardened software engineers might argue that such clear articulation and abstraction, if applied early in the design process, would help to avoid some of those mind-numbing bugs in the first place. The real evidence for this is unclear (and besides, it's a question of taste and style: many people don't seem to enjoy working in that way). Is it even conceivable that a software tool could help you (a) articulate the behaviour of your program or (b) see what it's doing at this higher level of abstraction? I'm convinced that the answer is 'yes'! Anyone out there interested in providing some (difficult-to-do) detailed introspection on exactly HOW they cracked those difficult bugs??? [2862... from meisenstadt] Regarding 'B2's comments in #2857, i.e. >but w/o any clue as to where to set a break*.... you could end up >stepping through a LOT of code... just trying to find.. the >do-forever loop and your response in #2858 that >I've just never found the necessity for that kind of involvement some of this boils down to questions of taste & style (cf. my immediately preceding message). For some people, stepping through the code is cognitively simpler, because they can then map the stepping behaviour directly on to their own mental model of how the interpreter or compiler works. That is, the 'work-through' is itself important for some types of users (and some types of bugs!). But there's a nasty penalty: frequently, you can't see the forest for the trees. This is why 'B2' likes the profilers, and indeed says (#2855): > Sometimes I wish for a combined profiler/debugger in a single package. My lab has spent a lot of time worrying about forest-vs.-trees issues, and we've made a *LOT* of progress in that area for certain classes of language, notably logic programming. For Prolog, it is certainly possible to show a nice visual coarse-grained view of the execution space (with suitable 'compressions' and 'abstractions' for very large spaces), and allow the user to home in quickly on the trouble spot via a variety of fine-grained views. We've got a graphical Prolog debugger for the Mac which does precisely this (available via anonymous ftp over Internet on cs.toronto.edu in the directory pub/dgp damn, I must get it mounted on BIX... also, the stuff is written up, if anyone's interested, in the following article: Brayshaw, M., and Eisenstadt, M. A practical graphical tracer for Prolog. /International Journal of Man- Machine Studies/, 1991, 35 (5), pp. 597-631.). But generalising these ideas to other languages is hard: the 'proof tree' metaphor for logic programming is a natural for visualization tools... other kinds of execution metaphors are less suitable for visualization... but we're workin' on it! ------------------------------------------ c.language/tools #2863, from 'B5', 571 chars, Thu Mar 5 07:29:19 1992 Comment to 2860. More refs to 2860. ---------- I've been tempted to write up my latest bug (a race condition - that's been in there for months and just didn't happen often enough to detect it even existed!) but it's a dull and boring story... maybe I can liven it up with some gratuitous sex and violence! :) I was feverishly coding away while she was removing her outer clothing... Suddenly my screen went blank, as the power failed. Power tends to do that when the wires are ripped out of the walls by a Terminator. I was face to face with death! fortunately, I had my stand-alone debugger... ------------------------------------------ c.language/tools #2864, from 'B8', 512 chars, Thu Mar 5 12:43:08 1992 Comment to 2860. Comment(s). ---------- A complicated bug exhibits such a rich variety of behavior that it reveals much of what you need to know to eliminate the problem. I think we can all agree that the hardest bug to fix is one that appears to be intermittent. I say "appears" because it's nearly impossible for a program to run differently twice under precisely the same conditions. Depending on what we ignore in the examination, it can seem to, but only because we aren't careful enough in discarding conditions as relevant factors. I think. ------------------------------------------ c.language/tools #2865, from 'B1', 189 chars, Thu Mar 5 13:21:32 1992 Comment to 2864. ---------- >under precisely the same conditions That the easy side. The fun is when its under many conditions and you need decipher it based up "something that happened while you were on vacation". ------------------------------------------ c.language/tools #2868, from 'B9', 1968 chars, Fri Mar 6 12:07:18 1992 Comment to 2847. ---------- [comment to 'B2' Tue Mar 3 19:58:54 1992] Several years back, I was given a job of programming a pretty big piece of code in VAX pascal. Now on day two I started writing the code that would read and write a file filled with complex records. OK. I like to go step by step and do a lot of testing, so I first write the routines for writing the file, so that I may have something to test the reading routines on. I code, I test, it runs OK. Then I take the reading routines and change them around so they read the record from the file and return the values in parameters. I get garbage. Actually it's not random garbage, it's some kind of old data that shouldn't be anywhere near those parameters. Right. I fire up the compiler. I walk with the compiler through the offending routine. The debugger says the routine assigns those parameters just fine. Obviously, (I thought) the debugger's presence moves something so the code is allright. I put in write statements. They show that the routine assigns the parameters OK, but when I come out of the routine, it's the same old stuff - garbage. So now I stare at the code. And I stare. and then a guy passes by. "Hi, how 'r you doin?" "I've got this bug.." "Can I help?" "I'll bet a can of Coke you can't" (after all, what have I got to lose?) So I explain the problem to him, and I show him the code. He looks at it, and says, "Why didn't you declare the routine parameters as VAR?" At which point I buy him a Coke to shut him up... (For non pascalers: Pascal normally pases by value. A var declaration means that the procedure is given a pointer to the variable instead of its value). Since the procedure was first used for writing I didn't declare the parameters VAR - I wasn't changing them. And for reading I just reshuffled the code, without touching the declarations. What amazes me to this day is how quick I was to assume that there was an error in the compiler or in the debugged or in the operating system... :-) ------------------------------------------ c.language/tools #2869, from 'B17', 905 chars, Fri Mar 6 20:31:32 1992 Comment to 2868. More refs to 2868. ---------- >What amazes me to this day is how quick I was to assume that there >was an error in the compiler or in the debugged [code] or in >the operating system... What amazes me is how much offense people take when you point out that they, like everyone around, are human -- and as human are prone to making mistakes. All too many people assume that "I'm right, therefore *you* must be wrong if you disagree with me." I still recall the time my manager sent me out of town for 10 days. One of the dim-bulbs of the department was having a really rough time trying to find some bugs which was holding up a two million dollar project. The reason: he was afraid I would step in and "help", with me working on the assumption that pointing out the area in error would speed things up. The guy found it on day 9. Otherwise I might still be (a) working for this company and (b) still in Atlanta! ------------------------------------------ c.language/tools #2870, from 'B10', 1383 chars, Sat Mar 7 11:57:45 1992 Comment to 2868. ---------- All of which underlines the basic fact that, many times, the best debugger in the world is a fresh perspective -- be it someone else's, or your own after walking away for a while. At a company I used to work at, I got a reputation as "the human debugger" in our group. Somebody would be having tremendous trouble with a mysteriously hanging, crashing, just-plain-wrong, etc. section of code. Everybody would take turns trying to find it, all reaching the same "just shouldn't happen conclusion"... I'd walk in, look around for a minute, and usually point to the offending line or two. Everybody attributed this to my being (a) the principle designer of the project in question or (b) some sort of really annoying whiz kid (I was 19 or 20, they were 25 & up)... I think the _real_ difference was that I tended to ignore everybody else's advice on the suspected nature of the bug, and just payed attention to the code and the nature of the bug. You know, the programmer who writes the code knows what he MEANT to say, explains it to the first person who helps, who passes it on to the next, and everybody's working on the same (wrong) assumptions about where the problem is. Maybe if you'd have told your friend about all the debugging steps you'd taken, theories about the addresses getting stomped, whatever, he wouldn't have spotted the missing VAR's so quickly. ------------------------------------------ c.language/tools #2871, from 'B8', 1228 chars, Sat Mar 7 13:15:26 1992 Comment to 2870. More refs to 2870. ---------- Perhaps a "good debugging tool" would be a utility that translates your source code into a different language (from C to english, pseudocode, pascal, ada... ) that the programmer understands. He could scan the code with a fresh perspective (in many ways he'd be seeing it for the first time) and this would help reveal problems in the code. Perhaps a utility that converted C into an arbitrary internal language and then converted it back into C would work. I know a problem in early human language translators was the way they dealt with idiomatic expressions. If you translated English to Russian back to English, it would often be grammat- ically correct but really often made no sense. This feature is something that would actually work in a programmer's favor. The idea is, really, to show the programmer what the compiler THOUGHT he meant. The most straightforward way to do this now is often examining a lower-level dump of the program (often assembly or machine code) which shows you what the computer is doing, but doesn't necessarily tell you much about how what you instructed the computer to do mapped to what was done. If anyody implements this and makes a zillion dollars, I'd gladly accept a royalty. :) ------------------------------------------ c.language/tools #2872, from 'B24', 190 chars, Sat Mar 7 13:30:14 1992 Comment to 2870. ---------- > walk in ... a minute ... usually point to the offending > line or two I've had many similar experiences. I enjoy being able to do this; it's just an exercise in clear thinking. ------------------------------------------ c.language/tools #2873, from 'B11', 774 chars, Sat Mar 7 23:43:24 1992 Comment to 2855. ---------- > Sometimes I wish for a combined profiler/debugger in a single package. Most debuggers *are* fairly good profilers. Just run the program and hit the break-key or break-out button a few times, and look at the call stack where it stopped. In almost no time, you'll have a very good idea where it is spending most of its time. This technique works well for getting the first, "easy" half of the fine- tuning for performance done. For the last bit, you need more detailed information, because you have to work on chunks of code that may only be taking 5-10% of the CPU time (each), and hitting the break key a few times in the debugger won't tell you which those are. A debugger which won't break you out of an infinite loop is a pretty sorry excuse for a debugger, IMO. {same request was posted on c.plus.plus/tools #865} From: 'B12' Date: Wed, 4 Mar 92 07:08:13 EST To: meisenstadt Message-Id: Subject: Debugging anecdote. Hmm, there are so many. I program in a variety of lanquages, including C C++ Pascal, Basic, Cobol, Databus, and all the Unix "text" languages like sh, perl, awk, sed, so forth. My debugging approach is both systematic and intuituve :-). Many times I can look at suspect output and derive the general location of a bug. Sometimes they just come out of the blue. Other times I single step the application through a debugger with a known bad dataset. (My favorite :- ). An example, at my place of employment we use MCBA cobol packages, which are generally large, cumbersome. The MRP in particular was plagued by bugs. Several of the bugs were caused by sloppy coding in patches. It got to the point where I would get the problem reports, and zoom right to the program and look for an extra or missing period in an IF statement. Another bug: This one took 2 days. 1. Software prints year to date sales reports fine for all fiscal months except 11 and 12. No obvious logic errors in the code, yet the output is full of garbage. 2. Solution, A SORT record data area was declared too small. Normally the compiler is supposed to catch junk like this, but in this case it did not. You guessed it, the SORT area was right next to the print variables. Boom. ------------------------------------------ c.plus.plus/tools #866, from 'B16', 238 chars, Tue Mar 3 17:08:01 1992 Comment to 865. More refs to 865. ---------- we have had a number of character-building debugging experiences, associated with a moderate sized c++ application (about 175 KLOC...to me that's not very big), using Sun's shared libraries, in the presence of static member objects. ------------------------------------------ c.plus.plus/tools #867, from 'B13', 267 chars, Tue Mar 3 18:11:05 1992 Comment to 865. Comment(s). ---------- I don't know if this is the sort of thing you are looking for, but I had a bug that took me forever to figure out: Basically it boiled down to a difference of opinion between the computer and me: it thought 256k == 262144, whereas I though 256k == 256000:) ------------------------------------------ c.plus.plus/tools #868, from 'B1', 124 chars, Tue Mar 3 19:56:31 1992 Comment to 867. Comment(s). More refs to 867. ---------- That's a typical (and interesting) "mental block" case. 262144 is the correct number, but what clued you into what you did? ------------------------------------------ c.plus.plus/tools #869, from 'B2', 259 chars, Tue Mar 3 20:02:10 1992 Comment to 867. Comment(s). ---------- Did you code 256000 as a constant in your code, leading to the problem? I've gotten in the habit (in C, of course, not C++) of defining powers-of- two constants using things like #define KBYTES_256 (256*1024) since it'll get folded at compile time anyway. ------------------------------------------ c.plus.plus/tools #870, from 'B13', 313 chars, Tue Mar 3 23:28:10 1992 Comment to 868. ---------- It was sort of a dual clue; I watched the memory get clobbered a dozen times in the debugger but it still didn't hit me, then I was looking at the code one time, and I saw all of those 0's after the 256, and thought, 'you know, you hardly ever see that many zeros in a computer- type of number.' Duh! ------------------------------------------ c.plus.plus/tools #871, from 'B13', 234 chars, Tue Mar 3 23:30:14 1992 Comment to 869. ---------- Yeah, the 256000 was a constant. I have taken an approach similar to yours now; I only use hex constants for the large numbers, they seem to be easier to remember: 0x40000 == 262144:) (goodness knows what 256000 is in hex) ------------------------------------------ c.plus.plus/tools #872, from 'B14', 947 chars, Wed Mar 4 22:56:44 1992 Comment to 865. Comment(s). ---------- I wrote the [... well-known ...] compiler. As part of its test suite, it must successfully bootstrap itself. Some of the most aggravating problems are when the bootstrapped compiler fails the test suite, but the compiler compiled with the previous version works! (Did anyone follow that? :-)) The solution is an arduous process of finding the bug in the bootstrapped compiler (it is usually a code-generation bug), and then tracking that error back through the previous compiler. Ug. I used to have terrible problems tracking down pointer bugs. That was the genesis of MEM, a heap debugging package (now shipped with the compiler package). MEM has practically eliminated 85% of those types of problems. I cannot recommend using MEM (I've seen equivalents advertised in DDJ) highly enough. Everyone I have browbeaten into using it has been sold on it after it quickly tracked down several "intractable" bugs. The only way to find the bad bugs is persistence! ------------------------------------------ c.plus.plus/tools #873, from 'B2', 391 chars, Thu Mar 5 00:49:08 1992 Comment to 872. ---------- I have the same problem sometimes with a public domain C compiler I take care of. I was talking to 'B13' one night about the bootstrapping process sometimes causing bugs which don't show up until a later generation of the compiler is made with itself. He termed this type of bug a "genetic defect", and I kind of like that analogy -- I always think of it in those terms now. ------------------------------------------ c.plus.plus/tools #875, from 'B25', 411 chars, Thu Mar 5 09:28:21 1992 Comment to 872. ---------- > I wrote the [well-known] compiler. As part of its test suite, it must > successfully bootstrap itself. Some of the most aggravating problems are > when the bootstrapped compiler fails the test suite, but the > compiler compiled with the previous version works! (Did anyone follow > that? :-)) Yes. THAT problem is one every compiler writer have nightmares about years after . And we have all been through it. ------------------------------------------ c.plus.plus/tools #876, from 'B17', 465 chars, Thu Mar 5 10:44:13 1992 Comment to 874. Comment(s). ---------- The ruler thing was a common debugging tool in IBM System/360 shops. JCL in particular was sticky about where you put things, although the assembler syntax had its problems. In both JCL and ASM the hassle was "continuation cards" where you had to get the indicator in JUST THE RIGHT PLACE and the continuation in JUST THE RIGHT PLACE. Ditto the old FORTRAN. In some respects the abandonment of card images (except in RPG) has made programming a LOT easier. ------------------------------------------ c.plus.plus/tools #877, from 'B8', 1195 chars, Thu Mar 5 12:16:31 1992 Comment to 876. Comment(s). ---------- Well, see, when the 360 came out, the notion of having dozens of terminals connected to a mainframe was yet to emerge. Everything was entered via 80-column cards (which is why we still have 80 column terminals prevailing today). When you punched a card, you usually put a punched card on a little drum on the keypunch. Holes in that card determined what kind of information you could enter into each and every column of the card. A blank column would send the card you were punching on to the next nonblank column. Striking a tab would send you, in this case, to the continuation column. This worked GREAT. It seems archane and all of that, but it really did work well. The problem came when the interface changed from punched cards to terminals but the environment stayed the same. Programmers began to view program text as less structured while the compilers were pretty much oblivious to the fact that the input was no longer coming from the card reader. All I'm trying to say is that the column-dependent design was a very, very good thing when it was first used. It simply lost its relevance. Anybody that wants to complain about JCL and ASM shouldn't leave out RPG, either. :) ------------------------------------------ c.plus.plus/tools #878, from 'B1', 424 chars, Thu Mar 5 13:29:47 1992 Comment to 877. ---------- >This wored GREAT. Yeah, until you got a freaking zigamorph or friend, and then it's lot of fun. When I give my [... well known ...] seminars, I present a cliche sequence early on as follows: "What you see is what you get" -- Anonymous "What you see is all you get" -- Brian W. Kernighan "What you see is what you see. What you get is what you get." -- 'B1' I think my version is the truest and is broadly applicable. ------------------------------------------ c.plus.plus/tools #883, from 'B1', 429 chars, Fri Mar 6 11:44:29 1992 Comment to 882. ---------- >3)As a Cognitive Psychologist, I'm very interested in human problem solving. >Debugging is a fascinating and important human endeavour, and I want to hear >what the pros have to say about it. Interesting comments. My minor in college was psychology and part of my initial and continued interest in computers comes from the gist of what you say, hence why I love talking with people about some of these issues all the time. ------------------------------------------ c.plus.plus/tools #884, from 'B15', 939 chars, Fri Mar 6 22:20:57 1992 Comment to 883. Comment(s). ---------- < debugging & cognitive psych One of my key 'weapons' for debugging is Karma. Karma gets me past the initial decision of whether a bug is mine or whether it's a Compiler/Linker bug. Before I developed a sixth sense about this it was taking me forever to track down the latter; in my former life as a mainframe programmer I took the (useful) view that ALL bugs were in my own code, and that there was no such thing as a hardware/OS/Compiler bug. This view is not practical on the PC. The apogee of my career as a Karma debugger came about a year ago when a friend called me on the phone wailing that she had spent three days trying to debug a block of MSC code that was giving a wrong result. After five minutes I began to suspect for no articulable reason that she was being victimized by Microsoft's optimizer. I told her to recompile the program with optimization disabled (/Od). This she did while I waited on the phone; BINGO. ------------------------------------------ c.plus.plus/tools #885, from 'B18', 83 chars, Fri Mar 6 22:30:36 1992 Comment to 884. Comment(s). More refs to 884. ---------- Hmm. This anti-mystic finds the Karmic approach very interesting. ------------------------------------------ c.plus.plus/tools #886, from 'B7', 767 chars, Sat Mar 7 00:41:20 1992 Comment to 881. Comment(s). ---------- >RPG I still remember when I took my Junior level college Comp Sci sequence... There was one class which had a quick overview of different programming languages. We normally used Fortran on our IBM 1130, with occasional recourse to assembler. They gave us quick tastes of RPG, APL (we had a special boot disk and ball for the teletypewriter), COBOL, and other such languages. So here I am, stuck punching these RPG statements on an IBM 026 card punch (if I was lucky, I could get access to the 029) with a weak ribbon. You can imagine how many times I got indicator selectors in the wrong column... I think the best was to enter RPG II _now_ would be with an editor which understood the syntax of RPG. But I'm not going near one if I have any say about it... ------------------------------------------ c.plus.plus/tools #887, from 'B15', 146 chars, Sat Mar 7 07:57:54 1992 Comment to 885. ---------- < This anti-mystic... Hell, nobody is more anti-mystic than I am. Karmic, I guess, is a euphemism for "I dunno how it's done, but I can do it". ------------------------------------------ c.plus.plus/tools #888, from 'B1', 487 chars, Sat Mar 7 11:15:05 1992 Comment to 884. ---------- >Karma Karma does have its place. >no such thing as a hardware/OS/Compiler bug.... view not practical on the PC It is not a practical view for an IBM mainframe either. I've had slews of bad experiences from my previous life, and they are not that much different as a whole from my life with the PC> >recompile with optimization disables... BINGO I have no doubt that she it is possible that she get hit with something like that. However, such things are often serious illusions. ------------------------------------------ c.plus.plus/tools #889, from 'B1', 77 chars, Sat Mar 7 11:16:09 1992 Comment to 886. ---------- >the best way to enter RPG II The best way to enter RPG is to not enter it. ------------------------------------------ c.plus.plus/tools #891, from 'B19', 513 chars, Sat Mar 7 13:11:17 1992 Comment to 888. More refs to 888. ---------- > [optimizer bugs] such things are often serious illusions. Don't know what version of MSC she was using, but 5.1 at least had the *documented* bug (call it deficiency if you prefer) that it only supported the _syntax_ of 'volatile', not the _semantics_. I.e. you could use volatile w/o generating an error, but MSC did not do anything with it. Specifically, it did *not* prevent the optimizer from hoisting a volatile variable out of a loop. The only way around this was to disable loop optimizations. ------------------------------------------ c.plus.plus/tools #892, from 'B17', 873 chars, Sat Mar 7 17:37:37 1992 Comment to 888. ---------- When I was reviewing compilers for Infoworld a few years back, I took some code which I had developed under Lattice C and ported it to the remaining compilers. I first compiled the code with all optimization off in order to be sure the code was doing what it was supposed to do. I then turned on every optimization in the book, and tried to measure the difference. Microsoft C worked fine unoptimized, but failed when optimizations were turned on. After many calls to Microsoft (not to mention visiting them at Spring Comdex and then trying the stuff in the room a night -- what deadline pressure!) I was able to work around the compiler's mis-optimization. If you want to know what version, I'd have to go back and check. The *same* code worked under Lattice, Watcom, Meta-ware (almost), Borland Turbo, and one other which slips my mind at the moment. ------------------------------------------ c.plus.plus/tools #893, from 'B20', 530 chars, Sat Mar 7 18:38:13 1992 Comment to 887. Comment(s). ---------- I'm not sure why anybody would be *anti*-mystic. But, in any case, I believe you're using the wrong analogy. Karma has nothing to do with ESP or the like. Karma usually refers to the consequence of one's actions. The Sanskrit word "karman" literally means "an act." Karma is commonly used in the same way as the word sin. However, it differs in that it is possible to have "good" karma--a state that is supposed to yield a just reward in the religious sense. I think the word you needed was magic. ------------------------------------------ c.plus.plus/tools #894, from 'B21', 464 chars, Sat Mar 7 20:58:51 1992 Comment to 893. ---------- RE: Karma It's been a long time since I read any of that stuff, but I believe that in the "mystic" sense of the word, karma refers to "old debts" that need to be paid. In other words, in a previous lifetime, one may have been proficient in some areas while deficient in others. Your current "karma" determines what your current challenges are in life, in order to even out those deficiencies from a previous existence. Anyway, that's how I remember it. ------------------------------------------ c.plus.plus/tools #895, from 'B22', 386 chars, Sat Mar 7 21:53:28 1992 Comment to 890. ---------- [Reply to meisenstadt #890] I suspect that some of the "karma" related debugging instances might actually be mysteriously filed memories. I sometimes have similar experiences but I have traced my intuition back to an earlier similar experience in some instances. I have lots of trash filed away in my memory, some of which is useful, even if it isn't stored in a logical fashion. ------------------------------------------ c.plus.plus/tools #896, from 'B15', 170 chars, Sun Mar 8 03:23:28 1992 Comment to 891. ---------- < MSC optimizer bug re 'volatile' No, that wasn't it. It was something pretty routine, if I recall, perhaps involving some very pedestrian floating-point computations. ------------------------------------------ c.plus.plus/tools #897, from 'B15', 547 chars, Sun Mar 8 04:52:59 1992 Comment to 888. ---------- < It's not practical to exclude the possibility of hardware/OS bugs < on a mainframe either Strictly speaking, this is true. However, I did find the mainframe hardware, OS etc more reliable than their PC counterparts. What I'm really saying is that there were plenty of wimpish bellyachers in the mainframe world who would all too readily attribute a bug to the hardware, OS, etc. as an excuse to avoid digging assiduously into their own code, wherein most of the bugs really lay. < ...such things are often serious illusions Whaddaya mean? ------------------------------------------ c.plus.plus/tools #898, from 'B15', 2537 chars, Sun Mar 8 04:53:30 1992 Comment to 890. Comment(s). ---------- < tell us what you heard during those 5 minutes and what < triggered your suspicions... Though it seems like there < is no reason, the reason is there!! OK, I guess in non-mystical terms that by 'karma' I mean simply that I have 25 yrs experience and a pretty good memory for my own mishaps. As far as the specifics of the case in question: I have spent several years developing C/C++ software for sale in the low-price retail PC software market. Since this is a very competitive market one is tempted to take risks in order to improve the performance of the software one is developing. These risks include the use of optimizers that may be very powerful but not entirely reliable (read: Microsoft C /Ox). Being unwilling to step away from the MSC optimizer I learned to co-exist placidly with its risks, which meant assembling a mental library of cases where I got burned and the workarounds I came up with. After a while some patterns began to emerge where I would be testing a program and something would go haywire and I would say to myself "Hell, that's ridiculous. I don't really believe that I screwed up this little function. Let's check out the generated object code." And I would often enough discover that the object code contained some kind of mistake. I would confirm by recompiling with the optimizer off, and if that fixed it, I would start messing with workarounds until I could turn the optimizer back on again without blowing everything up. (Incidentally, this is one of the merits of "printf" as opposed to "CodeView" debugging -- you can set optimizers to your heart's content without making debugging impossible.) So, when my phone-caller summarized the code, and read little pieces of it verbatim, the "that's ridiculous" bell rang in my head because something resembled one of the patterns I had seen somewhere before. Incidentally, some of my critics have said to me [reasonably enough] that if I am willing to use untrustworthy tools, how can I ever be sure that my programs really are correct? That is a fair question. My answer is that if I am writing a program that keeps accounts straight at a bank or that guards people's lives by manipulating railway signalling lights, I wouldn't use tools like this. But there are other programs that can do very little damage even if they misfire slightly. For this category of program, I test it as well as I can then put it on the market and let my customers tell me what goes wrong. If I'm conscientious about tech support people seem to be happy... ------------------------------------------ c.plus.plus/tools #899, from 'B15', 404 chars, Sun Mar 8 04:53:49 1992 Comment to 893. ---------- < I think the word you needed was 'magic' [not 'karma'] As a certified anti-mystic, I consider karma to be a fraud. Therefore I was using it in an ironic, self-deprecatory way to represent some kind of thinking process that I have not taken the trouble to analyze properly. < Karma is commonly used in the same way as 'sin' Yes, some people consider my whole approach to programming to be sinful :-) ------------------------------------------ c.plus.plus/tools #900, from 'B23', 389 chars, Sun Mar 8 15:11:55 1992 Comment to 884. ---------- The *real* question is why would anyone ever trust optimizations so much (okay, on a pc)? Over the years of my experence with these things, I've learned that it's much much better to develop all my code without any sort of optimizaiton! If, afterwards the code isn't fast enough or if there is time, i'll go back and slowly turn on the options one-by-oen and see where it breaks. ------------------------------------------ c.plus.plus/tools #901, from 'B14', 208 chars, Sun Mar 8 17:17:25 1992 Comment to 898. ---------- I propose that if you are writing mission critical software, you *always* assume that the tools are pathologically untrustworthy. The only way is test, test, test... Never depend on your tools being correct. ======================================================================= 4. Open University ------------------ The replies below were collected in between 3rd March 1992 and 8th March 1992 in response to a request which was posted to a personal mailing list and also to the following conference on CoSy, the Open University's in-house conferencing system: debugging/anecdotes Some replies also came directly to my personal email address. ------------------------------------------ From: 'OU1' 3-MAR-1992 11:42:12.12 To: M_EISENSTADT CC: Subj: bugs The problem: given the ground-plan of a single-storey house, with the positions and sizes of each wall, window or door known, how to decide what is a room - so as to check that it has legal access to other rooms. The solution was a bizarre algorithm which could find top left-hand corners and then chase round contiguous wall-stretches. I don't know if you'd be interested in how to find timing bugs in a concurrent language which has no debugging tools at all? No, I thought not! (PRINT statements are not much help because they drastically alter the timing, which may not be the same on each run anyway.) ------------------------------------------ From: ACSVAX:: 'OU2' 3-MAR-1992 12:28:28.89 To: ACSVAX::M_EISENSTADT CC: 'OU2' Subj: RE: Trawl for debugging anecdotes... Can't recall any of the really big ones. We used to use lattice C, and it didn't have a debugger we could use with our GEM applications. There were a few times when I had to add pauses, or get it to type values to the screen before I could trace something. But usually I got it by thinking, and looking. My normal routine was: 1) run the software & note the bugs. 2) fix the obvious ones. 3) often there would be just one recalcitrant bug. I'd have it written on a piece of paper on my desk & puzzle over it for a bit. 4) I'd go for a coffee & put it entirely out of my mind. 5) The moment I looked at the paper, on returning to my room, the solution would come to me. The feeling was of standing back and disengaging from the problem. Whilst engaged with it my mind was tied to too narrow a range of ideas - to preconceptions about what it was likely to be. It was as if standing back gave me access to a much bigger pool of ideas. And as if the unconscious had gone on working on the problem in the absence of conscious thought. When I was first using C I got caught for a while by having a subroutine return a pointer to one of its automatic variables. I hadn't thought about what was going on in terms of what was stored where. Although I'd worked out where the problem was, it was discussing it with a friend that clarified why it didn't work. Other slips which sometimes creep in, & which I always check for are: 1) = instead of == in if/while conditions 2) assigning to too small a string 3) semicolon after the condition of an if statement, or before the body of a for loop. 4) comment close missing 5) using = instead of strcpy. We now use Borland C. The compiler picks up more potential problems than the Lattice one did - eg code that will never be reached; functions which are supposed to return a value, but only do so from conditional parts of the code. ------------------------------------------ From: 'OU1' 3-MAR-1992 15:15:17.86 To: M_EISENSTADT CC: Subj: bugs I've just found one (timing bug)! Our language has a SEQUENCE command to force such things as PRINT statements to occur when you want them to; but given that each process within any routine would normally just go off and do its thing in its own good time, extra SEQUENCES can slow parts of the program down and so distort the results. I'm reduced to commenting out suspect lines of code to see if the error goes away, or trying to hold in my head the possible execution-paths of all these multiply-spawning processes. The main difficulty with timing errors is that they don't always happen, and that the machine can't tell you "this is a timing error": all it knows in its usual way is that "argument 2 is illegal" or something. If you look at the code it's clear that argument 2 /isn't/ illegal, so what must have happened is that an object didn't exist at the time argument 2 was sent to it, or that the object did exist but was locked by some other process. You'll have gathered that the Joshu language is an object system - it's based on Scheme. When a method runs, the instance whose class the method beongs to is locked against all interference from outside (in particular, the values in its instance variables can't be changed). The programmer decides whereabouts in the method the lock can come off, and again one is trading speed against comprehensibility if one refuses to unlock any object until none of its methods is running at all. Conditional forms have implicit SEQUENCing, as do the variable-assignments in LET*; and there is an explicit DELAY command which allows you to say "wait until this circumstances arises before doing that". Otherwise, scheduling is handled by "mu-values". I think of these as bistable variables: they can be created with no value at all, and once given a value can never be changed again. But you can write (I have written) quite large programs without knowingly using mu-values, because they are implicit in many circumstances. For example, you could say (let ((obj (ask class: new))) (ask obj class-name)) and the ASK would automatically create a mu-value which had to be set by the creation of OBJ before it would send OBJ the message. Anyway, I didn't mean this to turn into a tutorial: I just thought you might be interested. The trick for debugging seems to be to try to imagine worst-case scenarios: "Suppose this particular object were locked out when this particular method ran; what would be likely to happen?". It seems clear to me that the larger the system the more likely it is that obscure timing bugs will still be in it even after considerable efforts to get them all out. (Whether process A completes before process B is ultimately down to Heisenberg! - even with our simulated parallelism, you can see on screen processes completing in a different order from the commands in the code which set them off.) I've been trying to figure out what would make a useful debugging tool for timing errors. Maybe an after-the-event map of what was locked and unlocked at what points - a map which one could play via video controls in good HCRL fashion. It wouldn't hold much promise of finding all the bugs, because no two runs of the program can be guranteed to be the same, but I think I could really use the gestalt picture, if you see what I mean. ------------------------------------------ From: 'OU3' 3-MAR-1992 18:15:02.72 To: ACSVAX::M_EISENSTADT CC: Subj: RE: Trawl for debugging anecdotes... I remember some cases. [The] first is linked with omitting bracketes in [the] declaration such as int *pntdt[10] instead of int (*pntdt)[10] and i had [an] array of pointers instead [of] pointer on array. [The] second is error which may be named as unnneseccary step in loop. As Turbo C doesn't check limits of the loop i had such [an] error. ------------------------------------------ From: 'OU4' 4-MAR-1992 18:01:39.44 To: ACSVAX::M_EISENSTADT CC: Subj: RE: Trawl for debugging anecdotes... Now you're asking. My easiest example is a nasty interaction between JCL and a Cobol on an ICL1903.The bug was very hardware-dependent: solved by reversing the punch cards (??!) Other than this, no simply described bug stories occur to me. More recent experience indicates that I might remember a bug or two thousand in the following kinds of endeavour: * Hacking a buggy WP document (Microsoft of course) so that flow-around is the same on screen and printer * Optimizing, e.g. memoization/"caching" of instantiated variables in a custom- built rule interpreter * "Apparently recursive" processing of descriptions, nested without limit, where the property-structure of the "outer thing" and "inner thing" are interdependent (Remember the old LOOPS @instvar cludge?) * Flow control in a "REP loop", where the Reader interacts with external events or with internal (editing) events, etc. * Problems of tokenizing, where supported by someone's compiler which performs ill specified "in-line" expansions of certain symbols * Modifying the McCarthy (boolean) conditional to simulate a 3-truth-value conditional (??!) * Flushing the screen in DEC20 Prolog by translation of code from Poplog's Prolog (???!) ------------------------------------------ debugging/anecdotes #2, 'OU5', 4258 chars, 3-Mar-92 22:22 This is a comment to message 1 There is/are comment(s) on this message. -------------------------- > stream-of-consciousness reply ..... This is probably of no interest to anyone but me, but as it accounts for most of the last month of life, here goes: For the last decade only one high energy physics group (in Moscow) has had a functioning code (called MINCER, written in SCHOONSCHIP now upgraded to FORM) for the most demanding Feynman diagrams of quantum chromodynamics (QCD). It has proved to have bugs, now hopefully corrected. At the AI-92 conference I was encouraged to write to write a new program (I call it SLICER) from scratch, in REDUCE, to check MINCER (inter alia). The reason no-one has checked MINCER is that the algorithm is almost impossible to describe, let alone implement. It took me 2 weeks to think how to construct SLICER, 2 days to write it, and 2 weeks (i.e. 140 hours) to debug it, which is what interests/puzzles me. The reason it was so difficult to debug, as far as I can tell, is that in writing it I was working very fast, in order to keep the ideas alive. (They would have died before I finished had I been slow and careful!). It thus contained an incoherent mixture of (a) simple typos (b) untested intuitive logic (c) almost unreconstructable ingenuity (d) flowchart-defying recursiveness. So in scanning a typical single-statement procedure (example appended) it was impossible to wear a single hat (typo checker, logic chopper, if/then/else flow analyst...) Wearing just one, the code to be debugged is unintelligible; wearing all at once, one is is the same mindsplit as at writing. Nor are the (algebraic!) results from chaining together a score of such recursive procedures of much diagnostic utility in debugging, even when the result on the bottom line is manifestly absurd. (That is, I guess, why MINCER was wrong for 8 years and unchecked for 10.) What then to do? (Sorry for lengthy preamble, but the problem is specific to computer *algebra*). I finally found (hopefully!) all the typos, dud logic, unforseen paths, etc by *reciting* the program aloud (I kid you not) all the while *trying* to think about only *one* type of thing (plus versus minus, order of arguments, lettering of functions and arguments, if/then/elsery, etc) per singsong. Obvious you may say. But the way it worked was far from obvious. Because I practically never found what I was looking for. Rather I stumbled in one singsong on something I was not supposed to be checking for and had merrily sung through on the appointed pass. From this I tentatively conclude that finding errors in such monstrous code is a highly irrational process, best done (in my case) by the subconscious, allowed maximum freedom from conscious intent! If this makes me look crazy, I'm not surprised. It felt pretty crazy. I'm less interested in people telling me how I should have done it than in hearing if others have to fall back on such craziness. ------------------------- typical single(sic)-statement procedure: procedure lmsli(p2,pk,pl,q2,qk,ql,k2,l2,kl,pq,dk,dp)$ if min(ifsli(p2),pk,pl,q2,qk,ql)<1 then lmr1sli(p2,pk,pl,q2,qk,ql,k2,l2,kl,pq,dk,dp) else if max(ifsli(k2),ifsli(l2))<1 or (ifsli(kl)<1 and min(ifsli(k2),ifsli(l2))<1) then lmr2sli(p2,pk,pl,q2,qk,ql,k2,l2,kl,pq,dk,dp) else if l2<1 and dk=0 then lmsli(p2,pl,pk,q2,ql,qk,l2,k2,kl,pq,dk,dp) else if k2<0 then lmsli(p2,pk,pl,q2,qk,ql,k2+1,l2,kl-1,pq,dk,dp) +lmsli(p2,pk,pl,q2,qk,ql,k2+1,l2-1,kl,pq,dk,dp) +lmsli(p2,pk,pl,q2,qk,ql,k2+1,l2,kl,pq,dk+1,dp) else if k2=0 and dp>0 then lmsli(p2-1,pk,pl,q2,qk,ql,k2,l2,kl,pq,dk,dp-1) +lmsli(p2,pk,pl,q2-1,qk,ql,k2,l2,kl,pq,dk,dp-1) -lmsli(p2,pk,pl,q2,qk,ql,k2,l2,kl,pq-1,dk,dp-1) else if k2=0 and dp=0 then rpsli(d+dk-2*l2-p2-q2)* (lmsli(p2+1,pk,pl,q2,qk,ql,k2,l2-1,kl,pq,dk,dp)*p2 -lmsli(p2+1,pk,pl-1,q2,qk,ql,k2,l2,kl,pq,dk,dp)*p2 +lmsli(p2,pk,pl,q2+1,qk,ql,k2,l2-1,kl,pq,dk,dp)*q2 -lmsli(p2,pk,pl,q2+1,qk,ql-1,k2,l2,kl,pq,dk,dp)*q2) else if pq=0 then rpsli(d+dp-2*q2-qk-ql)* (lmsli(p2,pk,pl,q2-1,qk+1,ql,k2,l2,kl,pq,dk,dp)*qk -lmsli(p2,pk,pl,q2,qk+1,ql,k2-1,l2,kl,pq,dk,dp)*qk +lmsli(p2,pk,pl,q2-1,qk,ql+1,k2,l2,kl,pq,dk,dp)*ql -lmsli(p2,pk,pl,q2,qk,ql+1,k2,l2-1,kl,pq,dk,dp)*ql) else flsli(p2,pk,pl,q2,qk,ql,k2,l2,kl,pq,dk,dp)$ ------------------------------------------ debugging/anecdotes #3, 'OU6', 749 chars, 4-Mar-92 11:31 This is a comment to message 2 -------------------------- ... I /never/ worked out an algorithm for debugging Reduce. If there is a method at all, I think it's much as 'OU5' described - simply checking that the source-code correctly represents your intentions (that you've coded the algorithm correctly, with all the signs and powers of i in the right place). If you've done all that and it's still wrong, in-line diagnostics are a desparate last measure. Come to think of it, Lisp programmers must have the same problems. How do they debug programs? Off the point: the worst, singularly most unhelpful error message I have /ever/ been given, was from a Reduce program, and was of the form `identifier 3bc0839c0fa20 ... (several HUNDRED blocks of hex) ... 34890c2f0e is not a valid operator'. ------------------------------------------ debugging/anecdotes #4, 'OU6', 1018 chars, 4-Mar-92 11:39 -------------------------- The only recent debugging mountain I can remember climbing was in a large (and not particularly well-written (I wrote it)) C program. Part of the thing was to do a confusing, but not particularly complicated, 3d transformation on some data points. I'm afraid the way I found the bug wasn't particularly illuminating - after I'd patiently confirmed that the problem wasn't being introduced at a later stage, I finally had to resort to systematically going through just about every possible form of the transformation, even the ones I was sure I had tried and found wrong, until I found one which gave the correct result. I did get it eventually, and it /was/ one of the ones I was sure I'd tried already, but it wasn't a satisfying way of doing it. This whole operation took about two weeks, partly because it was confusing, but mostly because the program was reeeeeally boring. I managed to stumble across a couple of other minor bugs, and add a few bells, while I was doing it, though. ======================================================================= 5. AppleLink ------------- The replies below were collected in between 21st July 1992 and 29th July 1992 in response to a request which was posted to a personal mailing list and also to the following conference on AppleLink, the worldwide email and bulletin board facility of Apple Computer, Inc.: Discussions/Developer Talk/Debugger Discussion/'My Hairiest Bug' Tales Some replies also came directly to my personal email address. ------------------------------------------ Item 6620829 21-July-92 16:11PDT From: 'A1' I once spent 3 or 4 days trying to track down a bug which occurred because there was one stray call to printf() (or a variant) left in a bunch of code which I was converting into several XCMDs. I think that I knew that printf()s wouldn't work in XCMDs, but the error message from the linker didn't mention the actual name of the offending routine. I believe that I finally found it by reading the code straight through... ------------------------------------------ Item 1757424 21-July-92 17:03 From: 'A2'� I had a nasty problem in a smallest numerical C program, which led me to some really wrong conclusions for a while when I didn't know I had a bug. The crux of it was an index of 1 where it should have been i. I found it with my usual tools: staring at the source and pondering the output; the tools worked about as well as usual, which isn't very. ------------------------------------------ Item 7150400 22-July-92 09:53 From: 'A3'� I'm not quite sure why, but it seems that no matter what kind of debugging tools my environment has (I've used MPW assembler, Think C & Pascal, and Macintosh Common Lisp primarily) I always 'home in' on a bug by inserting loads of beeps or print-statements in my code. I start with them just at the tops of suspect functions, then fill them in at a finer granularity as I narrow the suspect code. ------------------------------------------ Item 8420771 �� 22-July-92 23:09 From: DISCUSSIONS/Developer Talk� Sub: One bug and some thoughts Author: 'A4' Path: Debugger Discussion 'My hairiest bug' tales 'Hairiest bug' motivation In addition to being a developer on the Mac since 84, I teach Mac programming for Developer University and Software Development Training, so I've seen my share of bugs. The bugs that are most memorable are ones that just better compilers would catch nowadays. For instance, this line of code took me 2 days to find: SetPort(&oldgp); nestled in the middle of a 400k program. Modern C compilers catch this immediately (the bug is that the & shouldn't be there). Many of my students have problems that could easily be solved by smarter compilers. They use THINK, which doesn't warn if you use uninitialized variables. C in particular has inherent problems for Mac programmers, particularly because of the lack of distinction between reference variables (VAR in pascal) and pointers. The types of problems that are thorniest are: 1. timing problems in a single process 2. timing problems with multiple processes 3. problems with standalone and system software (ie, CDEFs, etc) 4. sporadic problems, or ones with involved configuration The solutions that are currently available that help debugging are rather varied. I personally use Jasik's Debugger for the thorniest problems, because of its extensive discipline, mixed source/assembler and ability to easily look at my and system data structures and system code. As long as system software is buggy, this is a necessity. It seems that there are lots of areas where automation would be useful, particularly regression testing, stress testing and coverage checking. VU requires too much be scripted, most small developers would rather not spend the time. Basically, expert debuggers soak in data and make conjectures about internal (mis)behavior from their own experience. So data, data and more data is really the most useful thing in most cases. What that data needs to be may vary (timing sometimes, data structures other times), and most all debugging tools for production coding fall far short on effective presentation of the data. Graphical presentation of things like program flow could be helpful at times, but isn't really as important as other things. But graphical display is useful. One of the handiest things in both Macsbug and Jasik is a way to display the shape of a region on the screen while you're stopped. And to collect the data, expert debuggers conduct experiments. Any tools which make these experiments easier to set up, and data collection easier, are welcome additions. ------------------------------------------ Item 0597453 23-July-92 08:49 From: 'A5'� I've done a lot of debugging of low-level software (including microcode and hardware) in several languages. By far the most time-effective debugging technique I know is to guess correctly what the problem is and then prove that the guess is correct by using debugging tools. Of course I can't tell how to make accurate guesses. Non-trivial bugs yield much better to this "synthesis" attack than to an "analysis" attack using tools to slowly home in on the problem. Another anecdote for you. This has happened to me a couple of times, although I don't remember very many specific details to make it a good story. This has happened both with "crash" bugs and with performance problems where the program executes correctly but too slowly or in too much space. I spend a lot of time figuring out what the problem must be and coming up with a fix. I test the fix and, lo and behold!, the problem goes away. Much later, I discover that what actually fixed the problem was recompiling the program to install the fix, and none of the changes I made actually had any effect. In one case of a performance problem, I discovered this because the problem came back a few weeks later, and the real source of the problem was interference in a direct-mapped cache; recompiling the thing had moved it to a different address and thus eliminated the interference. ------------------------------------------ Item 8483386 24-July-92 19:58PDT From: A6 Well I had this 3 day bug that nearly killed me. In the end we dont know the exact cause but have some good ideas. Essentially i was writing some serial code which would be called from an XFCN. So I built a piece of code in think C which would open a serial port, configure it, ask for a record from a polhemus, parse the record and close the port. I put this code in a WHILE loop with the test being button down. In this way I could simulate the action of an xfcn. The code buzzed along returning records, about 3000 or so then it crashed. So I thought it might be a malloc/free kinda problem. Checked that than ran it again. Crashed again, on a different iteration. Now there was no macs bugs being invoked and using the Think C debugger was even less useful. What the [$#@%], I thought. Being a novice Mac programmer (2 months to be exact), I started to freak out. So I heard that ANSI code was trouble. I removed it all and replaced it with Toolbox code. Same thing. It would run from anywhere from a few hundred to a few thousand iterations then crash. But occassionally, I would get a Macs bugs error: error number 28. Stack overflow. Ok so I checked all my optimization parameters, put prototypes on and made sure I was returning something from a routine when I was supposed to. Everything looked fine. I was on my 2nd day. I put in lots of MemError calls, checked every single return value. I made sure there was no garbage in any allocated buffers. Shut off all weird inits. Increased the memory size of my app so it would have enough. I was on my third day. The stack overflow would come from the Mac routine that runs during the VBL which checks for stack/heap collision. But my routine that was called right before this was always different. It looked hopeless. Now I did what everyone debugging should always do. I called in someone else. In my case, it was the big guns: the mac programming gurus of the group. It was me and two down and dirty assembly hackers giving it a whirl. They showed me all sorts of macs bugs secrets and we thought we should write some assembly to catch that runaway stack pointer. Just then, one the aforementioned mac gurus started laughing and said "you know I tried to do exactly what you are trying to do last year. I wanted to rapidly open up and close serial ports for my sound app. The program would run for about 20 minutes and crash. Try this. Write your code so you open the port once, pass state back to hypercard, read then close when done." Ok so I took the 15 minutes and did this. Lo and behold we all watched in amazement as the code ran for tens of thousands of iterations. Though the device manager should be robust enough to deal with the rapid opening and resetting and closing of serial ports, it just cant deal. So we worked around it. We are going to leave it to someone else to find the real cause of the bug. ------------------------------------------ Item 3976801 29-July-92 04:12 From: 'A7'� My hairiest bug was fixed while I was at Ampex Corporation in 1980 (or about). We had a version 6.0 UNIX system, which we had upgraded to have "special" real-time and terminal-handling capabilities. The system had been stable and operational for over 6 months. We received the version 7.0 UNIX upgrade (I might have reference the wrong version numbers - its been a long time). I installed the "special" modifications and we tried to run. However, the system would crash after about 2 hours of operation. Depending on load, it might crash after 30 minutes or stay up for about 4 hours. So, figuring that our modifications were hazardous, we went back to the original source and tried to run. Same crashing problems occurred, indicating our changes were not the source of the problem. Calls to others running Version 7.0 indicated that is was not a general problem with this release and/or or processor (a PDP11/55, I believe). Crashes were so severe that the crash dumps never occurred. Lots of printf's were inserted, but no hint of what the problem was caused by. Some of our hardware types decided to attach a bus analyzer to the backplane, to monitor the exact sequence causing the crash. The then localized the jump causing the problem, and discovered the instruction was jumping to an incorrect location. Extensive monitoring (a difficult thing to due, since it could only be done at night and took several hours to trigger a crash) found the problem. It turns our that the processor-local memory boards were custom-tuned to the processor board. Many months previously, these had been swapped (to test for an unrelated failure). Since they were not properly tuned, the data was being returned in time for the parity check to be stable, but the data was unstable when loaded into the register. Thus, bad data was fetched from memory. Why did this problem not occur in the previous running of Version 6.0? Our best guess was that the bad memory location was infrequently used in Version 6.0, and therefore the problems did not appear. However, the code upgrade changed the memory-reference pattern, and the marginal memory address was accessed more frequently. Lesson from the experience. Your software "bug" may actually be a hardware error. Hardware may be flakey, even thour previous versions of the software have (and continue to) run without failures. ------------------------------------------ Item 8343740 29-July-92 20:58PDT From: 'A8'� Here are some stories: For more than a year a wierd phenomena has occured using HyperCard 2.0. Cards, with seeming magic disappear. By lucky chance I noticed a child in the act of destroying a card. I was extactic, he was upset - it was a great juxtaposed picture. I am screaming that he is doing great work - just solved a year old mystery and he is worried about his lost work. (I subsequently retyped all of his lost info.) Anyway many of kids have learned to use Command key and click the mouse to create a button or field. Many have also leanrd to hold Option to create a copy of a button or a field. Many also know to use Option-Shift to create a copy of button or field and keep it aligned. In this case the child creates a button or field with Command key held down and in his/her confusion with which buttons is also holding down the shift key. Alas, the button or field is not right (often times just not the right size) With the command and shift keys still held down the decision to get rid of the newly created button or field is made and the delete key is pressed. Viola- the card is gone. I guess it is a shortcut to delete cards and seems innocent enough - requires holding down 3 keys, use of two hands. Just goes to show you the world of kids is different than the world of adults.