Previously I wrote about the problems encountered when combining compression with deduplication.
There is a little known patch to GNU gzip floating around that is included in Debian-based linux distributions which resolves the compression problem, at a small cost to compression ratios. Recently and with the help of Josh Paetzel and Gabor Kovesdan, the patch has been committed to the FreeBSD port archivers/gzip.
The “–rsyncable” option was created with the same problem in mind. Rsync uses a binary diffing algorithm to transfer partial files, and this patch was designed to allow rsync to efficiently transfer gzip files with similar source.
Taking my 5 sample files from my previous example, using gzip –rsyncable:
On average gzip –rsyncable created files that are 7.9% larger using my example sql files. Backing up the “rsyncable” directory with tarsnap:
* Tarsnap numbers converted to base 2
The source data is 5 uncompressed SQL files pulled from a production system. The source files range from 2.4GB to 3.0GB, and span 4 months of changes.
The table shows what we would expect. The best compression ratios come from uncompressed source files. Gzip –rsyncable gives us slightly worse deduplication, but still much more efficient than standard gzip.
Gzip –rsyncable provides a nice compromise between efficient system storage, and efficient backups.