Can Compression Play Nicely with Deduplication?

20 October 2010

With the proliferation of deduplication in storage, it would be desirable to have a compression algorithm that was friendly with deduplication.

We’re using the excellent tarsnap which provides both deduplication and compression (as well as strong encryption).

When using both, it is important to deduplicate first, and then compress. If you compress first, changing one byte early in the source file will result in a significantly different compressed file. Therefore two compressed files of similar source data may not dedup effectively.

However, due to a variety of business reasons, I would like to efficiently tarsnap historic database snapshots. These snapshots are stored in gzip’d sql files. We do not have the space to store them unencrypted, and we are not willing to use tarsnap for storage (only for backup). Therefore, we have files that are similar which are compressed, then deduped, then compressed. The result is very inefficient.

One alternative is to use a compressed filesystem. However, in the general case, it would be nice if users and administrators were able to compress files ad-hoc, while still enabling efficient storage.

To put some real world numbers behind this, I’ve taken 5 production database snapshots, each about a month apart. I’ll use tarsnap to backup these 5 database files when they’re gzipped, and then once again with uncompressed sql files. Each tarsnap run will be from scratch (so there’s no deduplication with existing data).

Here is my test data in compressed form. Note that ls -h and du -h use powers of 2 while tarsnap –humanize-numbers uses powers of 10.

drue@nest:~/test_dedup$ ls -lh compressed/
total 1617856
-rw-r-----  1 drue  drue   145M Mar 17  2010 2010-03-18-00:34:19.sql.gz
-rw-r-----  1 drue  drue   146M Apr  9  2010 2010-04-09-19:31:39.sql.gz
-rw-r-----  1 drue  drue   162M May 11 20:28 2010-05-12-01:23:16.sql.gz
-rw-r-----  1 drue  drue   163M Jun 24 20:34 2010-06-25-01:30:09.sql.gz
-rw-r-----  1 drue  drue   173M Jul 15 20:17 2010-07-16-01:13:04.sql.gz
drue@nest:~/test_dedup$ du -hs compressed/
790M    compressed/

And here is my test data uncompressed.

drue@nest:~/test_dedup$ ls -lh uncompressed/
total 28344832
-rw-r-----  1 drue  drue   2.4G Oct 20 10:49 2010-03-18-00:34:19.sql
-rw-r-----  1 drue  drue   2.4G Oct 20 10:49 2010-04-09-19:31:39.sql
-rw-r-----  1 drue  drue   2.8G Oct 20 10:48 2010-05-12-01:23:16.sql
-rw-r-----  1 drue  drue   2.8G Oct 20 10:48 2010-06-25-01:30:09.sql
-rw-r-----  1 drue  drue   3.0G Oct 20 10:48 2010-07-16-01:13:04.sql
drue@nest:~/test_dedup$ du -hs uncompressed/
 14G    uncompressed/

This is the tarsnap commands (anonymized) that I will run against each directory.

tarsnap-keygen --keyfile key --user [email protected] --machine test
tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test compressed/
tarsnap --keyfile key --nuke

tarsnap-keygen --keyfile key --user [email protected] --machine test
tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test uncompressed/
tarsnap --keyfile key --nuke

Here are the results:

drue@nest:~/test_dedup$ tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test compressed/
Directory cache created for "--cachedir cache"
                                       Total size  Compressed size
All archives                               828 MB           796 MB
  (unique data)                            828 MB           796 MB
This archive                               828 MB           796 MB
New data                                   828 MB           796 MB

drue@nest:~/test_dedup$ tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test uncompressed/
                                       Total size  Compressed size
All archives                                14 GB           1.0 GB
  (unique data)                            4.8 GB           358 MB
This archive                                14 GB           1.0 GB
New data                                   4.8 GB           358 MB

After deduplication the compressed files are only reduced marginally, while the uncompressed source files take heavy advantage of deduplication. If we reran the test with more files, the results would be even more dramatic.

Given that it is a requirement for users and administrators to be able to compress files, what can be done about this inefficiency? Are there any compression algorithms that address this dilemma?