Automatically Simple Since 2002
20 October 2010
With the proliferation of deduplication in storage, it would be desirable to have a compression algorithm that was friendly with deduplication.
We’re using the excellent tarsnap which provides both deduplication and compression (as well as strong encryption).
When using both, it is important to deduplicate first, and then compress. If you compress first, changing one byte early in the source file will result in a significantly different compressed file. Therefore two compressed files of similar source data may not dedup effectively.
However, due to a variety of business reasons, I would like to efficiently tarsnap historic database snapshots. These snapshots are stored in gzip’d sql files. We do not have the space to store them unencrypted, and we are not willing to use tarsnap for storage (only for backup). Therefore, we have files that are similar which are compressed, then deduped, then compressed. The result is very inefficient.
One alternative is to use a compressed filesystem. However, in the general case, it would be nice if users and administrators were able to compress files ad-hoc, while still enabling efficient storage.
To put some real world numbers behind this, I’ve taken 5 production database snapshots, each about a month apart. I’ll use tarsnap to backup these 5 database files when they’re gzipped, and then once again with uncompressed sql files. Each tarsnap run will be from scratch (so there’s no deduplication with existing data).
Here is my test data in compressed form. Note that ls -h and du -h use powers of 2 while tarsnap –humanize-numbers uses powers of 10.
drue@nest:~/test_dedup$ ls -lh compressed/
total 1617856
-rw-r----- 1 drue drue 145M Mar 17 2010 2010-03-18-00:34:19.sql.gz
-rw-r----- 1 drue drue 146M Apr 9 2010 2010-04-09-19:31:39.sql.gz
-rw-r----- 1 drue drue 162M May 11 20:28 2010-05-12-01:23:16.sql.gz
-rw-r----- 1 drue drue 163M Jun 24 20:34 2010-06-25-01:30:09.sql.gz
-rw-r----- 1 drue drue 173M Jul 15 20:17 2010-07-16-01:13:04.sql.gz
drue@nest:~/test_dedup$ du -hs compressed/
790M compressed/
And here is my test data uncompressed.
drue@nest:~/test_dedup$ ls -lh uncompressed/
total 28344832
-rw-r----- 1 drue drue 2.4G Oct 20 10:49 2010-03-18-00:34:19.sql
-rw-r----- 1 drue drue 2.4G Oct 20 10:49 2010-04-09-19:31:39.sql
-rw-r----- 1 drue drue 2.8G Oct 20 10:48 2010-05-12-01:23:16.sql
-rw-r----- 1 drue drue 2.8G Oct 20 10:48 2010-06-25-01:30:09.sql
-rw-r----- 1 drue drue 3.0G Oct 20 10:48 2010-07-16-01:13:04.sql
drue@nest:~/test_dedup$ du -hs uncompressed/
14G uncompressed/
This is the tarsnap commands (anonymized) that I will run against each directory.
tarsnap-keygen --keyfile key --user [email protected] --machine test
tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test compressed/
tarsnap --keyfile key --nuke
tarsnap-keygen --keyfile key --user [email protected] --machine test
tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test uncompressed/
tarsnap --keyfile key --nuke
Here are the results:
drue@nest:~/test_dedup$ tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test compressed/
Directory cache created for "--cachedir cache"
Total size Compressed size
All archives 828 MB 796 MB
(unique data) 828 MB 796 MB
This archive 828 MB 796 MB
New data 828 MB 796 MB
drue@nest:~/test_dedup$ tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test uncompressed/
Total size Compressed size
All archives 14 GB 1.0 GB
(unique data) 4.8 GB 358 MB
This archive 14 GB 1.0 GB
New data 4.8 GB 358 MB
After deduplication the compressed files are only reduced marginally, while the uncompressed source files take heavy advantage of deduplication. If we reran the test with more files, the results would be even more dramatic.
Given that it is a requirement for users and administrators to be able to compress files, what can be done about this inefficiency? Are there any compression algorithms that address this dilemma?