Some proponents of a competing, proprietary version control system have suggested, in a usenix paper, that the use of a cryptographic hash function such as sha1 as an identifier for a version is unacceptably unsafe. This section addresses the argument presented in that paper and describes monotone's additional precautions.
To summarize our position:
The paper displays a fundamental lack of understanding about what a cryptographic hash function is, and how it differs from a normal hash function. Furthermore it confuses accidental collision with attack scenarios, and mixes up its analysis of the risk involved in each. We will try to untangle these issues here.
A cryptographic hash function such as sha1 is more than just a uniform spread of inputs to an output range. Rather, it must be designed to withstand attempts at:
Collision is the problem the paper is concerned with. Formally, an n-bit cryptographic hash should cost 2^n work units to collide against a given value, and sqrt(2^n) tries to find a random pair of colliding values. This latter probability is sometimes called the hash's “birthday paradox probability”.
One way of measuring these bounds is by measuring how single-bit changes in the input affect bits in the hash output. The sha1 hash has a strong avalanche property, which means that flipping any one bit in the input will cause on average half the 160 bits in the output code to change. The fanciful val1 hash presented in the paper does not have such a property — flipping its first bit when all the rest are zero causes no change to any of the 160 output bits — and is completely unsuited for use as a cryptographic hash, regardless of the general shape of its probability distribution.
The paper also suggests that birthday paradox probability cannot be used to measure the chance of accidental sha1 collision on “real inputs”, because birthday paradox probability assumes a uniformly random sample and “real inputs” are not uniformly random. The paper is wrong: the inputs to sha1 are not what is being measured (and in any case can be arbitrarily long); the collision probability being measured is of output space. On output space, the hash is designed to produce uniformly random spread, even given nearly identical inputs. In other words, it is a primary design criterion of such a hash that a birthday paradox probability is a valid approximation of its collision probability.
The paper's characterization of risk when hashing “non-random inputs” is similarly deceptive. It presents a fanciful case of a program which is storing every possible 2kb block in a file system addressed by sha1 (the program is trying to find a sha1 collision). While this scenario will very likely encounter a collision somewhere in the course of storing all such blocks, the paper neglects to mention that we only expect it to collide after storing about 2^80 of the 2^16384 possible such blocks (not to mention the requirements for compute time to search, or disk space to store 2^80 2kb blocks).
Noting that monotone can only store 2^41 bytes in a database, and thus probably some lower number (say 2^32 or so) active rows, we consider such birthday paradox probability well out of practical sight. Perhaps it will be a serious concern when multi-yottabyte hard disks are common.
Setting aside accidental collisions, then, the paper's underlying theme of vulnerability rests on the assertion that someone will break sha1. Breaking a cryptographic hash usually means finding a way to collide it trivially. While we note that sha1 has in fact resisted attempts at breaking for 8 years already, we cannot say that it will last forever. Someone might break it. We can say, however, that finding a way to trivially collide it only changes the resistance to active attack, rather than the behavior of the hash on benign inputs.
Therefore the vulnerability is not that the hash might suddenly cease to address benign blocks well, but merely that additional security precautions might become a requirement to ensure that blocks are benign, rather than malicious. The paper fails to make this distinction, suggesting that a hash becomes “unusable” when it is broken. This is plainly not true, as a number of systems continue to get useful low collision hashing behavior — just not good security behavior — out of “broken” cryptographic hashes such as MD4.
Perhaps our arguments above are unconvincing, or perhaps you are the sort of person who thinks that practice never lines up with theory. Fair enough. Below we present practical procedures you can follow to compensate for the supposed threats presented in the paper.
A successful collision attack on sha1, as mentioned, does not disrupt the probability features of sha1 on benign blocks. So if, at any time, you believe sha1 is “broken”, it does not mean that you cannot use it for your work with monotone. It means, rather, that you cannot base your trust on sha1 values anymore. You must trust who you communicate with.
The way around this is reasonably simple: if you do not trust sha1 to prevent malicious blocks from slipping into your communications, you can always augment it by enclosing your communications in more security, such as tunnels or additional signatures on your email posts. If you choose to do this, you will still have the benefit of self-identifying blocks, you will simply cease to trust such blocks unless they come with additional authentication information.
If in the future sha1 (or, indeed, rsa) becomes accepted as broken we will naturally upgrade monotone to a newer hash or public key scheme, and provide migration commands to recalculate existing databases based on the new algorithm.
Similarly, if you do not trust our vigilance in keeping up to date
with cryptography literature, you can modify monotone to use any
stronger hash you like, at the cost of isolating your own
communications to a group using the modified version. Monotone is free
software, and runs atop
botan, so it is both legal and
relatively simple to change it to use some other algorithm.