Automatically Simple Since 2002
08 November 2010
Previously I wrote about the problems encountered when combining compression with deduplication.
There is a little known patch to GNU gzip floating around that is included in Debian-based linux distributions which resolves the compression problem, at a small cost to compression ratios. Recently and with the help of Josh Paetzel and Gabor Kovesdan, the patch has been committed to the FreeBSD port archivers/gzip.
The “–rsyncable” option was created with the same problem in mind. Rsync uses a binary diffing algorithm to transfer partial files, and this patch was designed to allow rsync to efficiently transfer gzip files with similar source.
Taking my 5 sample files from my previous example, using gzip –rsyncable:
drue@nest:~/test_dedup/rsyncable$ ls -lh
total 1746272
-rw-r----- 1 drue drue 157M Nov 8 13:25 ordering-2010-03-18-00:34:19.sql.gz
-rw-r----- 1 drue drue 157M Nov 8 13:23 ordering-2010-04-09-19:31:39.sql.gz
-rw-r----- 1 drue drue 175M Nov 8 13:20 ordering-2010-05-12-01:23:16.sql.gz
-rw-r----- 1 drue drue 176M Nov 8 13:17 ordering-2010-06-25-01:30:09.sql.gz
-rw-r----- 1 drue drue 187M Nov 8 13:15 ordering-2010-07-16-01:13:04.sql.gz
On average gzip –rsyncable created files that are 7.9% larger using my example sql files. Backing up the “rsyncable” directory with tarsnap:
# tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test rsyncable/
Total size Compressed size
All archives 894 MB 872 MB
(unique data) 446 MB 435 MB
This archive 894 MB 872 MB
New data 446 MB 435 MB
* Tarsnap numbers converted to base 2
The source data is 5 uncompressed SQL files pulled from a production system. The source files range from 2.4GB to 3.0GB, and span 4 months of changes.
The table shows what we would expect. The best compression ratios come from uncompressed source files. Gzip –rsyncable gives us slightly worse deduplication, but still much more efficient than standard gzip.
Gzip –rsyncable provides a nice compromise between efficient system storage, and efficient backups.