Some further details on how Darum works¶
(There’s also a less technical writeup in the form of a blog post )
Terminology¶
Dafny Assertions (DA): Dafny internally works with both explicit assertions (those in the code) and implicit ones (generated internally from the code). For clarity, we’ll refer to both of them as DAs.
Assertion Batch (AB): . The way in which DAs are grouped internally by Dafny for verification. They are the minimal unit of observable cost in Dafny’s results.
Brittleness: A verification result is said to be brittle when its cost isn’t stable: an unchanging Dafny file can happen to always verify correctly, but with wildly different Resource Usage. [1]
Isolate Assertions mode (IA): Dafny’s mode. In standard verification mode, Dafny groups all DAs generated by a member into a single Assertion Batch. A radical alternative is to use the
--isolate-assertions
argument, which causes Dafny to isolate each DA into a separate Assertion Batch. This is the finest verification granularity that Dafny offers.Member: Dafny’s methods, functions or fields.
Resource Unit (RU): the cost of a verification run, as reported by Z3. RU costing is more deterministic than timing-based costing.
Out of Resources (OoR): It’s the result when Dafny/Boogie/Z3 are given a resource limit and the verification takes more RU than this limit. Equivalent to, but more precise than a timeout.
How does Darum work?¶
dafny measure-complexity
allows the user to run multiple verification iterations with changing random seeds. One can then analyze the RU costs for each run and for each member.
Dafny’s official docs and tools use statistical measures like stddev and RMS% to derive verification brittleness from the RU costs. However, we argue that it’s more useful to think of simple min/max values. For example, consider the case of running 10 iterations of the verification process, in which 9 of the results are closely clustered but one single result deviates far away, being either much cheaper or much costlier than the rest. Taking the stddev or RMS of these 10 cases would obviously smooth out the extremes, while we argue that they are precious hints that need to be highlighted instead. Indeed, each time that the verification runs, these rare but extreme values are the ones that will turn things unexpectedly slow or fast, and so they point towards problems or fixes.
Furthermore, DA variability seems to compose worse than linearly when batched. This all suggests that, for reliability, it’s necessary to minimize the span of Resource Usage costs.
It’s worth noting that, while we’re focusing on RU variability to combat brittleness, these tools are also useful for plain optimization, since they help account for plain RU usage and rank where the verification time is being spent.
In a nutshell, the key insights that Darum exploits are:
Dafny’s RU cost to verify code is a sample from a probability distribution of an evolving shape.
As the Dafny code grows, the verification cost distribution tends to grow wide and multimodal, as opposed to a cost varying across a smooth range.
The span of the cost distribution of a DA compounds worse than linearly to much wider distributions for their containing batches.
An evolving probability distribution?¶
Consider the evolution of a piece of Dafny code and its verification cost:
When the code is simple, the distribution is close to a spike: every verification run has a predictable cost.
As the code grows and turns more complicated, the cost distribution starts to widen, and so time/RU cost for verification starts to vary. Perversely, this is hard to notice initially because probably the worst case is still fast enough that the user doesn’t stop to think about it: growing verification time is in principle to be expected from growing code. Worse, even when individual members start to misbehave in a bigger codebase, the total cost distribution (both per file and whole project) naturally tends to smooth and statistically compensate for the individual variation.
At some point, some AB’s verification gets complex enough that its distribution turns multimodal: sometimes verification runs fast, sometimes it runs much more slowly. The randomness of this AB will keep affecting the total distribution even while working on other ABs. At this point the user wonders why errors or timeouts appear and disappear.
As work progresses, other ABs’ distributions will keep widening and also turn multimodal, each of them affecting the total distribution with new modes.
The result is that one starts editing member X of a Dafny file and suddenly verification fails or timeouts in a surprising, uncomprehensible way - because of the varying cost of member Y, which is untouched. Making some small secondary change seems to restore the good behavior of X - but this is an illusion caused by the multimodality of the distributions. In fact, undoing the secondary change now seems to let the code verify after all! One shrugs the problem off and pushes forward… but 2 lines later again the failure reappears.
Unfortunately, the situation is amusingly similar to that of Skinner’s superstitious pigeon experiments: random actions seem to trigger unexplained positive outcomes, causing the subject to develop superstition-like behaviors with the hope of triggering further positive outcomes.
In our experience, this situation turned into enough of a time and morale sink that we started looking for alternatives to Dafny. Indeed, this kind of problem motivated Z3’s original developer Leo de Moura to create the interactive theorem prover Lean as an alternative.
Why would a cheap verification turn expensive?¶
Dafny code needs to provide enough information to Z3 for it to find proofs. As the search space grows, Z3 will randomly [3] take different paths (with different costs) to find a proof. If there’s unneeded, “distracting” information, Z3 might even follow it into unproductive rabbitholes. This unneeded information comes from other DAs introduced by the context built up by the containing member. Hence reducing the unneeded information introduced into an AB by previous DAs is key to stop Z3 from getting lost.
Meanwhile, Dafny offers an “isolated assertions” (IA) verification mode, which breaks down a member’s default single AB into many ABs, each containing a single, isolated DA. E.g., for member M containing DA1 and DA2:
Default verification would generate a single AB including both DA1 and DA2.
In IA mode, 2 ABs will be generated. AB1 would only contain
assert DA1
, while AB2 would be equivalent toassume DA1; assert DA2
.
Hence, IA mode offers an easy way to partially isolate costs. Intriguingly, a typical result is that default verification of a member is usually much cheaper (~10x) than the total cost of the isolated DAs; but also much less stable. This suggests 2 strategies:
Stabilisation by pessimisation: break down every member into smaller members [2][4].
Conversely, sometimes DAs require higher RU than the containing member, or even fail to verify. This likely means that previous DAs in the member built up some context that helped / was necessary for the current DA to verifiy. Importantly, this context that was built in the AB possibly included facts that the solver “discovered” while proving previous DAs; but that discovered context is missing when those DAs are
assume
d instead. This is a case where DAs grouped into an AB help each other.
Remember that, as code evolves, DAs change, creating different possible paths for the solver. An AB whose DAs support each other but don’t unduly widen the horizon would then create a robust path, resilient to small changes. In contrast, a member whose DAs only tangentially build on each other and that widen the horizon unnecessarily will be vulnerable, or even prone, to brittleness.
While the first impulse might be to limit the length of members, and this might help somewhat indirectly, note that the key consideration is whether the DAs in the AB build on each other robustly.
So how does Darum help?¶
Darum can be used to:
Analyze verification cost and brittleness at any AB granularity, e.g.:
Default verification mode: discover the costs and their stability for default, member-level ABs.
IA mode: discover what DAs are most likely to contribute to brittleness, and what “stabilization by pessimization” would look like.
Anything in between: various Dafny functionalities allow to limit the size of batches (
split_here
,isolate
, etc)
Compare verification cost and brittleness across AB granularities.
Knowing how the distribution of verification costs varies across different AB granularities can point to stabilization strategies.
When is a potential speedup a sign of trouble?¶
Empirically, we have observed:
In IA mode, individual AB speedups seem to either stay lower than 1.02 or grow abruptly.
In default mode, this seems to happen with individual speedups over 1.1.
Further research pending, this probably means that when the cost’s monomodal random distribution widens enough to cause those speedup values, it turns into a multimodal distribution – AKA brittleness.
Hence, speedup values over these should probably be considered a brittleness warning.