With the proliferation of deduplication in storage, it would be desirable to have a compression algorithm that was friendly with deduplication.
We’re using the excellent tarsnap which provides both deduplication and compression (as well as strong encryption).
When using both, it is important to deduplicate first, and then compress. If you compress first, changing one byte early in the source file will result in a significantly different compressed file. Therefore two compressed files of similar source data may not dedup effectively.
However, due to a variety of business reasons, I would like to efficiently tarsnap historic database snapshots. These snapshots are stored in gzip’d sql files. We do not have the space to store them unencrypted, and we are not willing to use tarsnap for storage (only for backup). Therefore, we have files that are similar which are compressed, then deduped, then compressed. The result is very inefficient.
One alternative is to use a compressed filesystem. However, in the general case, it would be nice if users and administrators were able to compress files ad-hoc, while still enabling efficient storage.
To put some real world numbers behind this, I’ve taken 5 production database snapshots, each about a month apart. I’ll use tarsnap to backup these 5 database files when they’re gzipped, and then once again with uncompressed sql files. Each tarsnap run will be from scratch (so there’s no deduplication with existing data).
Here is my test data in compressed form. Note that ls -h and du -h use powers of 2 while tarsnap –humanize-numbers uses powers of 10.
And here is my test data uncompressed.
This is the tarsnap commands (anonymized) that I will run against each directory.
Here are the results:
After deduplication the compressed files are only reduced marginally, while the uncompressed source files take heavy advantage of deduplication. If we reran the test with more files, the results would be even more dramatic.
Given that it is a requirement for users and administrators to be able to compress files, what can be done about this inefficiency? Are there any compression algorithms that address this dilemma?