probability

Hash collisions

At least in the game industry, hash functions are often used to generate unique identifiers for assets such as textures, models. This is a little dangerous in theory, as collisions are always possible if unlikely, so the correct approach is to use the hashes for fast lookup and then do a full equality comparison on the returned values. But what's the actual probability of a collision?

It's more likely than you might think, as this is an example of the birthday paradox. The approximate formula for the probability of a collision is:

p = 1 - e^[-n^2/(2*N)]

where N is the size of the hash space, and n is the number of values used.

With 10,000 values in a 32-bit space, plugging the numbers into the formula gives a collision probability of 1.1%. So fairly unlikely, but it wouldn't be too surprising.

With 100,000 values in a 32-bit space, the probability jumps to 68%. Much higher than you might intuitively expect, given the space of 4 billion possible values.

Changing to a 64-bit hash and 100,000 values, probability of a collision is 2.7e-8%, much more comfortable.

Even with 10,000,000 values in the 64-bit space the probability is still only 0.00027%.

With 1,000,000,000 values in the 64-bit space, the probability is 2.6%... starting to become a more serious concern, but if you have a billion assets then you have bigger problems than hash collisions.

With a 128-bit hash (e.g. md5) and 1,000,000,000 values, the probability is 1.4e-19%... cosmic rays flipping bits in your memory chips should be a bigger concern at this point.

Syndicate content