Vivek Haldar

Bug finding and static analysis in the real world

If you want to get a good idea of how industry currently uses static analysis and bug-finding tools, you should read these 2 papers:

What both these papers have in common is that they document the experience of computer science academics applying their tools to real, and large, code bases. The first one talks about Bill Pugh’s sabbatical at Google, which he spent applying his FindBugs tool to the massive mountain of Java code there. The second describes Dawson Engler’s experience commercializing his group’s research in extensible static analysis for bug-finding.

If you are curious about the details, you should just read the whole papers, but these are some of the things that stood out to me:

  • FindBugs did find a large number of bugs, but only very few of them were serious or caused production problems. The authors attribute this to the numerous other “filters” that code has to go through at Google: code review, unit tests, integration tasks, canary deployments etc. The major value of regularly using such a tool would be to filter out some of these bugs earlier, as opposed to the usual selling pitch for static analysis tools, which is to avoid catastrophic bugs.
  • Both tools are unsound: they make their best effort to find genuine problems, but might also report false positives. This is a major departure from most academic research, which looks down on unsoundness.
  • In the case of Coverity, I found it very amusing to read how and why their tool had to be essentially dumbed down, because complex bug reports that were genuine but not understood by developers were simply dismissed as false positives. They also have a number of juicy quotes from clueless programmers.
  • There are two broad kinds of properties of code that tools can try to infer. Liveness properties say that the code is good: does what its supposed to, makes progress etc. Safety properties speak about the absence of bad things: no null pointer references, out-of-bound reads etc. Both these tools, as well as every other one I know of, completely give up on liveness and focus on safety. This makes sense, because, in general, liveness properties are much harder to prove, and also require programmers to provide a rigorous description of what the code is supposed to do.
  • These tools are context-insensitive: they examine a piece of code locally, without taking into consideration all the paths that could lead there. Context-sensitive analysis is too hard and explodes combinatorially for large codebases.

The cost of a bug

In this context, it is useful to ask: what is the cost of a bug? Sure, we would like to eliminate every bug if possible, but what are the tradeoffs?

Let’s take a serious class of bugs: those that crash your application. Not quite as serious as something that destroys data, but something in the middle, and that happens frequently.

The question then becomes: what is the cost of a crash? This is where the economics become interesting.

For client applications, the cost of a crash is pretty high. A crash makes your users unhappy and complain and leave, and justifiably so, because the application they were using just crashed on them! On top of that, you do not control the environment in which your application runs, and have to bend over backwards to get crash information to help you debug.

Things are very different for server-side applications served over the web. They are typically request-oriented, served by lots of instances. If one of them crashes, and the application stack is smart enough with retries, there is no user impact at all. What’s more, you control the environment the crash ocurred in, and have all the debugging and trace information you need. And you can push a fixed binary at your pleasure without anybody even noticing.

This means that the cost of a crash is much, much higher for client software than for server software.

The future of bug-finding

The major roadblock for no-holds-barred context-sensitive analysis of code is the exponential blowup in the space being analyzed. But this is only a hurdle if you are holding on to the single-machine model of building and analyzing code, i.e., that you have one machine (albeit a monster) that builds your code and tests it and analyzes it.

But what if you could throw a cluster of a few hundred machines at the problem, and have them running continuously? How much deeper and fine-grained could your analysis be?