[go: up one dir, main page]

|
|
Log in / Subscribe / Register

In-band deduplication for Btrfs

In-band deduplication for Btrfs

Posted Mar 10, 2016 23:23 UTC (Thu) by nybble41 (subscriber, #55106)
In reply to: In-band deduplication for Btrfs by martin.langhoff
Parent article: In-band deduplication for Btrfs

Even if you generated a new 256-bit hash every picosecond (10^12 hashes per second), it would be 10^19 years before the probability of a collision reached 50%, taking into account the Birthday Paradox. That is over 700 million times the current age of the universe, with a 50% probability of *one* collision. The probability of finding any collisions is still less than 10^-9 after 500 trillion (5*10^14) years.

Even filesystems don't "roll the dice" often enough to make 256-bit hash collisions a serious consideration.


to post comments

In-band deduplication for Btrfs

Posted Mar 11, 2016 14:05 UTC (Fri) by bgoglin (subscriber, #7800) [Link] (6 responses)

Unlikely doesn't mean it won't ever happen.

In-band deduplication for Btrfs

Posted Mar 11, 2016 15:24 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (5 responses)

For all practical purposes, in this case it does mean exactly that. It is far, far more likely that the filesystem will suffer catastrophic failure from some other cause (for example, a freak surge in cosmic gamma radiation sufficient to wipe out human civilization) than that you will ever see a SHA-256 hash collision occur by random chance in the context of a single filesystem. It is so far down the list of things to worry about that developers would be more productively employed working on almost anything else compared to implementing bit-for-bit block comparison to supplement the SHA-256 match.

In-band deduplication for Btrfs

Posted Mar 11, 2016 15:30 UTC (Fri) by micka (subscriber, #38720) [Link] (3 responses)

So, nobody does cryptanalysis on SHA-256 and try to create attacks on it ?
OK, that's not by random chance anymore, but...

In-band deduplication for Btrfs

Posted Mar 11, 2016 16:25 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (2 responses)

Sure, cryptoanalysis uncovering weaknesses in the SHA-256 algorithm is a possibility. I was replying only to the "it's not mathematically impossible, ergo it must be treated as a realistic possibility" argument. However, if attackers can arrange for a SHA-256 hash collision on demand then I think we'll have bigger problems to worry about than some corrupted filesystem data. There are also various well-known methods to thwart attacks dependent on predictable hash results, like seeding the hash function with a hidden per-filesystem salt, and in the event that such an attack is discovered the workaround is simple: just disable online deduplication until the hash function can be updated.

In-band deduplication for Btrfs

Posted Mar 15, 2016 16:10 UTC (Tue) by intgr (subscriber, #39733) [Link] (1 responses)

> However, if attackers can arrange for a SHA-256 hash collision on demand then I think we'll have bigger problems to worry about

I disagree, that actually is a very scary failure mode for a file system. If an attacker is allowed to influence what gets stored in a file system, then a preimage attack or possibly a clever application of a collision attack would allow poisoning the file system.

For instance, an attacker knows that some user wants to store document A on the system. The attacker can prepare a colliding document B and upload it before the user gets the chance to upload A. When document A is written later, the file system will throw away A and keep the tampered document B instead.

Consider that document A can be, for example, a system package update that the system administrator installs. Lulz ensues.

In-band deduplication for Btrfs

Posted Mar 15, 2016 17:26 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> > However, if attackers can arrange for a SHA-256 hash collision on demand then I think we'll have bigger problems to worry about
> I disagree, that actually is a very scary failure mode for a file system.

I wasn't trying to downplay the problems it would create for a filesystem, and I agree with everything else you said. However, SHA-256 hashes are used for more than just identifying blocks within filesystems for deduplication. The ability to create SHA-256 hash collisions would undermine the entire digital signature system, for example. Implementing a workaround is also easier in the context of deduplication—in the short term you can just turn it off while you reindex the drive with a different hash function. Unless the content of your filesystem is *really* important, the odds of anyone wasting a SHA-256 collision 0-day attack on it are vanishingly small, and even a major issue with the algorithm which cut the effective bit length in half would not represent an immediate practical threat.

In-band deduplication for Btrfs

Posted Mar 24, 2016 16:18 UTC (Thu) by nye (subscriber, #51576) [Link]

>It is far, far more likely that the filesystem will suffer catastrophic failure from some other cause (for example, a freak surge in cosmic gamma radiation sufficient to wipe out human civilization) than that you will ever see a SHA-256 hash collision occur by random chance in the context of a single filesystem

Somewhat more to the point, if you have a function which checks for duplicate data - whether by comparing the hashes, or comparing the data itself - the chance of a hash collision is less than the chance that random memory errors will happen to cause that function to return the wrong result. That is to say, the reliability of the byte-for-byte function is no greater than the compare-the-hashes function, at least when it comes to order of magnitude.

In practical terms, if you have a pipeline which relies on every step operating correctly, there's not much point in paying an ongoing cost to improve the reliability of one given component if there are others that are orders of magnitude more likely to fail. Sure, technically it makes a difference, but only in the sense that you can make the ocean bigger by tipping a bucket of water into it.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds