zevv / duc

Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage
GNU Lesser General Public License v3.0
592 stars 79 forks source link

[Feature Request] Handle btrfs transparent compression #225

Open traverseda opened 5 years ago

traverseda commented 5 years ago

btrfs reports file size, not size on disk. The tool compsize can tell you the realize size on disk, how much is deduplicated (which is different from hardlinks because of copy-on-write?), and how much is compressed.

It would be nice if duc supported these btrfs-specific features.

zevv commented 5 years ago

Quoting Alex Davies (2019-08-17 22:28:22)

btrfs reports file size, not size on disk. The tool compsize can tell you the realize size on disk, how much is deduplicated (which is different from hardlinks because of copy-on-write?), and how much is compressed.

It would be nice if duc supported these btrfs-specific features.

I see no technical problems with this, although I guess it would make sense to make the feature not btrfs-specific, but make it map on any kind of compressing file system instead. At scan time Duc should be able to figure out the proper way to acquire the numbers from the specific fs type.

The only downside is that each file entry in the database would need an additional field to store the new size. This is probably the right time to add an optimization I wanted to implement for a long time: if Duc stores the real file size as an (var)int, we could store the block size and compressed size as relative numbers to the real size. That should shrink the DB a lot since the relative sizes are much smaller, and will result in smaller entries because of the varint encoding.

-- :wq ^X^Cy^K^X^C^C^C^C

zevv commented 5 years ago

I had some advice on the #btrfs irc channel today, and while technically this should be feasible, it is not easy or trivial. There is one single ioctl which is used to get this info (BTRFS_IOC_TREE_SEARCH_V2), but the resulting data needs to be properly cooked to get the required info. This will also require a lot of bookkeeping similar to hard-link accounting since btrfs might share the same extents for multiple files.

I'll leave this issue open, I might one day feel very bored and brave and pick this up.

traverseda commented 5 years ago

That's more or less what I was expecting, thanks for taking the time to look into it.