CS259 Summary Notes
© 2012 Chris McGinlay CC BY 3.0
Licence: Creative Commons Attribution 3.0 Unported
Idea of debugging by conducting a systematic experiment to establish cause(s) of failure.
Failed Program ->
Compare expected and actual outputs for given inputs ->
Find defect causing failure.
Illustrated with the ‘fruit fly’ model - an HTML stripper implemented via a finite state machine.
Defect - An error in code. Preferable to the term ‘bug’ which implies that something crawled in when you weren’t looking, whereas in fact the programmer put the defect there in the first place.
Failure - an externally visible error.
Error - a deviation from correct behaviour.
Infection - error in program state, propagating forward to successive states.
Program - succession of states represented by a state vector (of variables, objects etc).
A defective statement causes a change of state, such that a sane input state becomes an insane state. Marked by an infection with no precursor infection.
Possible: defect -> infection -> failure
Sometimes: defect -> infection -> no failure
However a failure can always be traced back to a defect via cause-effect chain. This can be HARD; tens of thousands of variables, tens of thousands of intermediate states.
Posit a hypothesis. (Something you expect to happen, or think may be happening to program state). Devise an experiment to test this hypothesis. E.g. use ‘assert False’ to mark code which should never be reached (such as an undefined branch of if..elif..else).
Hypotheses -> Prediction -> Experiment -> Observation.
Observation will result in either confirmation or rejection of hypothesis. So either you refine or reject your hypothesis.
Finally, if you have established cause of defect ensure you
...as opposed to keeping the whole situation in your head.
Keep a written log of what you’ve tried. No need to memorise. Easy to interrupt work and resume later. The act of writing down structures forces thinking to take place and can reveal solution. Similarly, talk to someone about the problem. Act of explaining forces you to understand the situation and structure your thoughts prior to opening your mouth.
Exercise to illustrate is SPYDER - Simple PYthon DEbuggeR. Using sys.settrace(traceit).
Arguments to traceit are frame (stack frame), event (including line, call, return), arg (locals, retval).
Assertions provide automatic runtime error checking. Provide key to automation. Topic looks at writing assertions, manually and automatic inference of appropriate assertions.
assert condition -> true: OK, proceed with program
-> false: Interrupt execution, raise exception.
Are asserts available in your language? If not, roll-your-own. Built in asserts typically:
Assertions do not yield location of defect.
Assertions can be used in tests (such as unit tests) or used in code as specification (for function args and return values).
Asserts as a specification. Precondition: checking arguments to functions at start of execution of function. Postcondition: checking return value just prior to returning it.
In cause-effect chain, asserts automatically check large portions of states, aiding in identifying defect. Earlier detection results in shorter cause-effect chain.
You can wrap function call for function being tested in try:except block to catch exceptions and provide diagnostic output. (Example discussed of square_root postcondition)
Catching errors. Testing (pre and post conditions, try:except diagnostics). Documentation (precondition=spec).
Pre condition assertions
Post condition assertions
will miss bugs
valid inputs excluded
Using __repr__() internal method.
Example: wrong input type can induce ValueError exception. This is a TIMEBOMB in python. Assertions can be used to mandate valid data when objects are instantiated or functions called. E.g. assert 1<=day<=31. Place such assertions in __init__() , possibly in a checkRep() helper function.
Invariant is a condition that always holds for an object as viewed by the user. (Invariance not guaranteed during internal operations).
By developing a checkRep() helper function, call it in object constructor or just before function return in a method modifying internal state.
Example of usefulness is the Red-Black tree structure. This structure is good for search, insert, delete (log time). Insertion and removal of nodes DIFFICULT: many cases.
If implementing such a structure (don’t: use a library instead), first thing you write is an invariant checker.
In the Eiffel language, invariants are a builtin language feature, defined in class definitions and checked in pre and post conditions in public methods.
...and hence must not change internal behaviour of program.
Why would you turn ‘em off? Assertions can take time, hit performance. Test this, maybe turn off expensive asserts in production code. Leave the cheap ones.
In particular don’t use asserts in checking public preconditions: 3rd party data input, SECURITY hazard as can be turned off. Use non-optional input validation and raise your own exceptions or pop up error boxes in webapp etc.
+better to fail with an assertion than continue with bad data.
+helps prevent defect ever reaching the field
+helps reveal info about the failure if defect does get to the field
-performance hit, selectively disable expensive asserts
?aborted execution not user friendly. However alternative is bad data and arbitrary behaviour.
These are properties holding throughout execution, such as flags or maybe buffer allocations. Can use sys.settrace on each line of code execution to see violation of this invariant. Could call a checkRep() helper function.
Memory buffer overflow. System invariant would be checking C style array boundaries, protecting memory. Packages such as Electric Fence or valgrind.
1&2 are easy to implement, so do them first. 3 is more worthwhile once 1 and 2 are done since we know data input should be sane!
DAIKON tool can infer invariants from repeated program executions. (DAIKON, Michael Ernst et al).
“After removing some part of a failure inducing input, the failure no longer occurs”
-> the bit you just removed includes the failing input
Idea is to reduce and simplify large failure inducing inputs.
In a bug report - what is relevant? Condense to a test case in which every part if necessary to induce failure.
This requires a strategy and an automatic test. Simplest strategy is binary search (recursive, split + test). The automatic test must test whether the reduced input passes or fails (does nor does not trigger failure).
Binary search will often leave non-essential parts included with reduced input. Introduce delta debugging instead: ddmin(). Algorithm:
It is of course useful if your implementation of ddmin can take a generic test function as an argument.
It is also efficient if your test function can take advantage of domain knowledge to intelligently leverage the input structure, splitting into list of elements. Test function needs to recombine list into entity for testing. (In example given, achieved via python list .join())
Generate random input data. Good way to test your software. Used as an attack vector by the bad guys to generate failures.
Describing the “yesterday it worked” problem. Someone has updated a library, or applied a patch. Now your code fails. Question is which patch or sub-patch or library? DDMIN can help, except the testing algorithm is intensive: apply diffs, build, execute. Requires automated build process and CPU intensive. Popular tool is git bisect which can do this.
DDMIN returns a failure cause. Using a counterfactual definition of causation.
Cause. Suppose there are events A, B with A preceeding B. A causes B if (B does not occur unless A occurs)
When fixing a defect, you want to be sure that:
Some errors are not causes of a failure. (May be unrelated)
Some causes are not errors. (Cause may be legal input).
You want to find errors which are causes.
There can be many causes. Frivolous ones, such as your existence is a cause.
Actual cause A - assumes and changes as little as possible, yet A till changes the event B.
Python sys module. settrace specifies system trace function. “The local trace function should return a reference to itself (or to another function for further tracing in that scope), or None to turn off tracing in that scope.” - http://docs.python.org/2/library/sys.html#sys.settrace
frame.f_code.co_name - function name
frame.f_code.co_filename - source code filename
frame.f_code.co_varnames - local variable names in tuple
frame.f_locals - dictionary of local variables
call, return, line, exception, c_call, c_return, c_exception
None, return value, exception tuple, function object depending on context.
Goal is to obtain a cause-effect chain. Perhaps you can observe it happening. If you can’t observe it directly, you can reconstruct it from deductive reasoning (as in the Sherlock Holmes examples).
Consider propagation of errors, trace back dependencies. There are automated deduction tools such as PEX (try it online at www.pexforfun.com). This finds inputs that cause assertions to fail. Code can be C#.
Any proof of a fix is only as good as the specification. Other, novel inputs may cause different failures to the one you’ve fixed. (Good assertions help here).
At least two ways to fix the example, either
Generally, follow data and/or control dependencies to assist with deductive reasoning.
Statement B is data dependent on statement A if statement A write a variable which B later accesses, without another statement writing that variable in between.
NB: B has no data dependency on anything if B reads no variables!
B is control dependent on A if statement A determines whether statement B gets executed or not.
This is the set of statements that S would transitively depend on. (Not entirely sure what ‘transitive’ means in this context)
NB: Anything not in the backward slice cannot influence S itself or influence whether S is reached. In this way, backward slice acts as a filter on what code you need to look at.
This is the set of all statements that transitively depend on S.
There exist slicing tools that can find forward and backward slices automatically.
These apply to actual executions rather than to the programs themselves. They describe what has happened as opposed to what could happen. They are based on program traces:
Execute program => obtain trace => dynamic slice
They contain only the executed lines, only having statements which influence the output. Typical coverage tools reveal around 7% of code lines are covered in typical executions, whereas dynamic slice tools reveal around 2.8% of lines can actually influence output.
In practical terms, in trace function, look for event==”line”. Can then read control and data dependency from these covered lines. (This is the slice). Hopefully will reveal cause-effect chain.
Having obtained a set of possible causes (origins) for a failure, use scientific method or deductive reasoning to test for actual origin, or eliminate non-relevant origins respectively.
Where to start if there are a large number of possible origins? In order or best place to start:
Essentially, we use ddmin() to establish minimum failing state. If you can change variables such that failure goes away, then you have found failure cause.
Use values from a successful run to make changes during a failing run (oft underused feature of debugging tools). ddmin() will find the minimum set of such changes required.
Only thing we have manually supplied at this point is the location and iteration to test. See later.
Need to be able to reproduce failing runs - in order to observe behaviour and check that a fix actually works.
Bohr Bug: reliably repeatable under well defined conditions.
Heisenbug: disappears or alters behaviour as you explore it.
Mandelbug: appears chaotic or non-deterministic.
Schrödingbug: does not manifest until someone realises the program should never have worked in the first place.
Real world program inputs include:
Implement via a ‘capture-replay’ tool: layer interposed between the user and the program. Implemement by intercepting library or system function calls handling user input. For GUI record, don’t record mouse movements etc as this is highly variable from host to host. Instead record function calls and parameters on GUI objects. Selenium tool provides capture/replay for web projects. Risks with recorded ‘scripts’ are sensitive data exposure, changes to environment of GUI rendering recorded scripts redundant.
Implementation of simple tool in Python: two modes ‘capture’, ‘replay’. In capture, trace function interecepts call events, storing function name and arguments in a list. Replay mode uses Python eval function to trigger reconstructed function calls.
Find features of execution statistically correlating with failure. For example, could be that function f is always called when failure occurs, never called in passing run, or could be a particular line is always executed, or a function give a particular return value etc.
Idea is to collect whatever execution features may be of interest and calculate correlation co-efficient with failing or passing runs. There are various ways of visualising the coverage/correlation results.
Pearson correlation co-efficient is suitable for linear dependent variables.
For binary correlation, use Phi co-efficient, also developed by Pearson.
Line coverage example. Line covered or not covered - how does this correlate with PASS or FAIL. Deduce phi ϕ for each line in program or function.
first c/nc↓ second p/f
n is the total number of executions.
ϕ = n11 n00 - n10 n01 / normaliser
Where normaliser = sqrt(nx1 nx0 n1x n0x)
+1 = maximum +ve correlation with failure, 0 = no correlation.
Can compute phi for function coverage and return value. Usually helpful to categorise return values, in example given as [-1,0,1] with suitable definitions for non-integer return values and exceptions. Correlate each category and its complement (remember this is a binary correlation!)
(-1 or not -1) with (pass/fail),
(0 or not 0) with (pass/fail),
(+1 or not +1) with (pass/fail).
I.e. 3 phi values for each function.
Track and organise debugging efforts, aids in preventing re-occurence, investigation of where and why bugs are appearing in software.
Bug reports vary in quality. Can program in facilities to gather vital information automatically during a crash. Need to know how to reproduce failure (1 most useful)
Store bugs. Ensure each bug is handled in finite time. Problem database can be huge such as bugzilla.
Fixed, WONTFIX etc.
Severity - impact on user. Blocker, Critical, ...
Notification - who is to be told about fixes etc?
Unconfirmed -> New -> Assigned -> Resolved* -> Verified -> Closed
Resolved can be Fixed, Duplicate, Invalid, WONTFIX, worksforme etc.
Closed -> reopened -> assigned is possible.
Resolved -> reopened -> assigned is also possible.
Duplicate bugs needs management. Mark obsolete problems to maintain morale and reduce clutter.
Given that software changes will be stored in a database (e.g git repo or some other version control) it is a good idea to link patches to bug id in bug database. (“Checkin closes bug #567, lines #45,300-320”).
With this, can map out problems in the codebase and generate defect counts for modules. Can identify insecure packages .
20% of modules contain 80% of bugs.
Some observations and interesting results of where bugs come from.
Study your own project defect map as these generalities won’t capture all correlations with bug density.
Automate and simplify (test asserts, ddmin)
Find possible infection origins
Focus on most likely origin (stastical debugging)
Isolate infection chain - develop hypotheses
Correct defect, then verify fix.
Get initial requirements right. Spec differentiates failure from success. Use automation to find bugs. Develop and automate testing.
Reduce program complexity.
Set up assertions. Test early, test often.
Review your code, analyse problem history.
Thanks Andreas, I did. Great course :-)