The Rub

Automatically Simple Since 2002

Compression Can Play Nice With Deduplication (and rsync)

08 November 2010

Previously I wrote about the problems encountered when combining compression with deduplication.

There is a little known patch to GNU gzip floating around that is included in Debian-based linux distributions which resolves the compression problem, at a small cost to compression ratios. Recently and with the help of Josh Paetzel and Gabor Kovesdan, the patch has been committed to the FreeBSD port archivers/gzip.

The “–rsyncable” option was created with the same problem in mind. Rsync uses a binary diffing algorithm to transfer partial files, and this patch was designed to allow rsync to efficiently transfer gzip files with similar source.

Taking my 5 sample files from my previous example, using gzip –rsyncable:

[email protected]:~/test_dedup/rsyncable$ ls -lh
total 1746272
-rw-r-----  1 drue  drue   157M Nov  8 13:25 ordering-2010-03-18-00:34:19.sql.gz
-rw-r-----  1 drue  drue   157M Nov  8 13:23 ordering-2010-04-09-19:31:39.sql.gz
-rw-r-----  1 drue  drue   175M Nov  8 13:20 ordering-2010-05-12-01:23:16.sql.gz
-rw-r-----  1 drue  drue   176M Nov  8 13:17 ordering-2010-06-25-01:30:09.sql.gz
-rw-r-----  1 drue  drue   187M Nov  8 13:15 ordering-2010-07-16-01:13:04.sql.gz

On average gzip –rsyncable created files that are 7.9% larger using my example sql files. Backing up the “rsyncable” directory with tarsnap:

# tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test rsyncable/
Total size  Compressed size
All archives                               894 MB           872 MB
(unique data)                            446 MB           435 MB
This archive                               894 MB           872 MB
New data                                   446 MB           435 MB

* Tarsnap numbers converted to base 2

The source data is 5 uncompressed SQL files pulled from a production system. The source files range from 2.4GB to 3.0GB, and span 4 months of changes.

The table shows what we would expect. The best compression ratios come from uncompressed source files. Gzip –rsyncable gives us slightly worse deduplication, but still much more efficient than standard gzip.

Gzip –rsyncable provides a nice compromise between efficient system storage, and efficient backups.