Many years ago I wrote a simple PHP script to calculate transfer times or bandwidth requirements for file transfers. I still haven’t found a better bandwidth calculator on the internet that is as simple or as flexible.

Today I have updated the script to support terabyte file sizes and “days” as a transfer time value, to fit modern usage. I also published the source on github in case someone else might find it useful.

http://therub.org/pub/calc
http://github.com/danrue/bandcalc

 
pw useradd gimpy -d /usr/local/gimpy -s /bin/sh
passwd gimpy # enter password
mkdir /usr/local/gimpy
chown root:gimpy /usr/local/gimpy # root must own the chroot directory
mkdir /usr/local/gimpy/bin
cp /rescue/sh /usr/local/gimpy/bin/
# at the bottom of /etc/ssh/sshd_config:
        Match User gimpy
                ChrootDirectory /usr/local/gimpy
                X11Forwarding no
                AllowTcpForwarding no
/etc/rc.d/sshd reload
 

The following snippet will allow you to specify your preferred PCBSD mirror without using KDE or PCBSD tools. This works in PCBSD 8.1:

### Setup local mirror path
echo 'SoftwareManager\currentMirror=http://my.happy.mirror/pub/pcbsd/' >> /root/.config/PCBSD.conf
echo 'SoftwareManager\mirrorType=2' >> /root/.config/PCBSD.conf
 

I wrote these to do a simple system clone using a USB disk. It’s pretty simple using dump and restore, but it does take a bit to get all of the commands and arguments correct.

To use, install the two scripts on a USB drive. Mount the USB drive on the source system, and run “dump_all.sh”. Be sure to change the labels to match your configuration.

To restore, boot a FreeBSD DVD into Fixit mode, mount the USB drive to /mnt, and run restore_all.sh. Again, be sure to change the device labels to match your system. If your disks are different sizes, you’ll also have to modify the bsdlabel file that is generated by dump_all.sh.

Warning: Do not blindly run these scripts. They’re posted here for reference only, and will break your system if you do not understand how to use them.

dump_all.sh:

#Filesystem       Size    Used   Avail Capacity  Mounted on
#/dev/mfid0s1a    496M    267M    190M    58%    /
#devfs            1.0K    1.0K      0B   100%    /dev
#/dev/mfid0s1e    9.7G    1.7G    7.2G    19%    /usr
#/dev/mfid0s1d    1.9G    102M    1.7G     6%    /var
#/dev/da0          45G    865M     41G     2%    /mnt

dump -0Lauf - /dev/mfid0s1a | gzip > slash.dump.gz
dump -0Lauf - /dev/mfid0s1d | gzip > slash.var.dump.gz
dump -0Lauf - /dev/mfid0s1e | gzip > slash.usr.dump.gz
bsdlabel mdif0s1 > mfid0s1.bsdlabel

restore_all.sh:

#!/bin/sh
set -o errexit

dd if=/dev/zero of=/dev/mfid0 count=4
fdisk -BI /dev/mfid0
bsdlabel -B -w mfid0s1
bsdlabel -R mfid0s1 mfid0s1.bsdlabel
newfs /dev/mfid0s1a
newfs -U /dev/mfid0s1d
newfs -U /dev/mfid0s1e

if [ ! -d tmp ]
then
    mkdir tmp
fi
TMPDIR=/mnt/tmp
export TMPDIR

if [ ! -d root ]
then
    mkdir root
fi
mount /dev/mfid0s1a root
gzcat slash.dump.gz | (cd root && restore -rf -)

if [ ! -d root/var ]
then
    mkdir root/var
fi
mount /dev/mfid0s1d root/var
gzcat slash.var.dump.gz | (cd root/var && restore -rf -)

if [ ! -d root/usr ]
then
    mkdir root/usr
fi
mount /dev/mfid0s1e root/usr
gzcat slash.usr.dump.gz | (cd root/usr && restore -rf -)
 

Commonly I create temporary files using dd like this:

dd if=/dev/random of=blah bs=1m count=16

That will generate a random 16MB file named blah.

Doing the same thing in python looks something like this:

with open("blah", 'w') as f:
    for i in range((16*2**20)/512):
        f.write(os.urandom(512))

Posting here because it took me about 15 minutes longer than it should have to find that function.

 

Previously I wrote about the problems encountered when combining compression with deduplication.

There is a little known patch to GNU gzip floating around that is included in Debian-based linux distributions which resolves the compression problem, at a small cost to compression ratios. Recently and with the help of Josh Paetzel and Gabor Kovesdan, the patch has been committed to the FreeBSD port archivers/gzip.

The “–rsyncable” option was created with the same problem in mind. Rsync uses a binary diffing algorithm to transfer partial files, and this patch was designed to allow rsync to efficiently transfer gzip files with similar source.

Taking my 5 sample files from my previous example, using gzip –rsyncable:

drue@nest:~/test_dedup/rsyncable$ ls -lh
total 1746272
-rw-r----- 1 drue drue 157M Nov 8 13:25 ordering-2010-03-18-00:34:19.sql.gz
-rw-r----- 1 drue drue 157M Nov 8 13:23 ordering-2010-04-09-19:31:39.sql.gz
-rw-r----- 1 drue drue 175M Nov 8 13:20 ordering-2010-05-12-01:23:16.sql.gz
-rw-r----- 1 drue drue 176M Nov 8 13:17 ordering-2010-06-25-01:30:09.sql.gz
-rw-r----- 1 drue drue 187M Nov 8 13:15 ordering-2010-07-16-01:13:04.sql.gz

On average gzip –rsyncable created files that are 7.9% larger using my example sql files. Backing up the “rsyncable” directory with tarsnap:


# tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test rsyncable/
Total size Compressed size
All archives 894 MB 872 MB
(unique data) 446 MB 435 MB
This archive 894 MB 872 MB
New data 446 MB 435 MB


* Tarsnap numbers converted to base 2

The source data is 5 uncompressed SQL files pulled from a production system. The source files range from 2.4GB to 3.0GB, and span 4 months of changes.

The table shows what we would expect. The best compression ratios come from uncompressed source files. Gzip –rsyncable gives us slightly worse deduplication, but still much more efficient than standard gzip.

Gzip –rsyncable provides a nice compromise between efficient system storage, and efficient backups.

 

See part two of this post

With the proliferation of deduplication in storage, it would be desirable to have a compression algorithm that was friendly with deduplication.

We’re using the excellent tarsnap which provides both deduplication and compression (as well as strong encryption).

When using both, it is important to deduplicate first, and then compress. If you compress first, changing one byte early in the source file will result in a significantly different compressed file. Therefore two compressed files of similar source data may not dedup effectively.

However, due to a variety of business reasons, I would like to efficiently tarsnap historic database snapshots. These snapshots are stored in gzip’d sql files. We do not have the space to store them unencrypted, and we are not willing to use tarsnap for storage (only for backup). Therefore, we have files that are similar which are compressed, then deduped, then compressed. The result is very inefficient.

One alternative is to use a compressed filesystem. However, in the general case, it would be nice if users and administrators were able to compress files ad-hoc, while still enabling efficient storage.

To put some real world numbers behind this, I’ve taken 5 production database snapshots, each about a month apart. I’ll use tarsnap to backup these 5 database files when they’re gzipped, and then once again with uncompressed sql files. Each tarsnap run will be from scratch (so there’s no deduplication with existing data).

Here is my test data in compressed form. Note that ls -h and du -h use powers of 2 while tarsnap –humanize-numbers uses powers of 10.

drue@nest:~/test_dedup$ ls -lh compressed/
total 1617856
-rw-r----- 1 drue drue 145M Mar 17 2010 2010-03-18-00:34:19.sql.gz
-rw-r----- 1 drue drue 146M Apr 9 2010 2010-04-09-19:31:39.sql.gz
-rw-r----- 1 drue drue 162M May 11 20:28 2010-05-12-01:23:16.sql.gz
-rw-r----- 1 drue drue 163M Jun 24 20:34 2010-06-25-01:30:09.sql.gz
-rw-r----- 1 drue drue 173M Jul 15 20:17 2010-07-16-01:13:04.sql.gz
drue@nest:~/test_dedup$ du -hs compressed/
790M compressed/

And here is my test data uncompressed.

drue@nest:~/test_dedup$ ls -lh uncompressed/
total 28344832
-rw-r----- 1 drue drue 2.4G Oct 20 10:49 2010-03-18-00:34:19.sql
-rw-r----- 1 drue drue 2.4G Oct 20 10:49 2010-04-09-19:31:39.sql
-rw-r----- 1 drue drue 2.8G Oct 20 10:48 2010-05-12-01:23:16.sql
-rw-r----- 1 drue drue 2.8G Oct 20 10:48 2010-06-25-01:30:09.sql
-rw-r----- 1 drue drue 3.0G Oct 20 10:48 2010-07-16-01:13:04.sql
drue@nest:~/test_dedup$ du -hs uncompressed/
14G uncompressed/

This is the tarsnap commands (anonymized) that I will run against each directory.

tarsnap-keygen --keyfile key --user me@example.com --machine test
tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test compressed/
tarsnap --keyfile key --nuke

tarsnap-keygen --keyfile key --user me@example.com --machine test
tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test uncompressed/
tarsnap --keyfile key --nuke

Here are the results:

drue@nest:~/test_dedup$ tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test compressed/
Directory cache created for "--cachedir cache"
Total size Compressed size
All archives 828 MB 796 MB
(unique data) 828 MB 796 MB
This archive 828 MB 796 MB
New data 828 MB 796 MB


drue@nest:~/test_dedup$ tarsnap --cachedir cache --keyfile key --print-stats --humanize-numbers -c -f test uncompressed/
Total size Compressed size
All archives 14 GB 1.0 GB
(unique data) 4.8 GB 358 MB
This archive 14 GB 1.0 GB
New data 4.8 GB 358 MB

After deduplication the compressed files are only reduced marginally, while the uncompressed source files take heavy advantage of deduplication. If we reran the test with more files, the results would be even more dramatic.

Given that it is a requirement for users and administrators to be able to compress files, what can be done about this inefficiency? Are there any compression algorithms that address this dilemma?

 

Due to a variety of circumstances out of my control, I found it necessary to control access to an OpenVPN server without depending on a certificate revocation list. After some effort, I discovered a way to execute a script that can check the common name of the client certificate and use the return code to authorize the connection.

First, create a whitelist file. One CN per line. For example, let’s say you have three openvpn clients:

# cat CN_whitelist
client1
client2
client3

Second, create a verify-cn script. Here’s mine:

#!/usr/local/bin/python
# original from http://svn.openvpn.net/projects/openvpn/trunk/openvpn/sample-scripts/verify-cn

# verify-cn -- whitelist based on common name of client certificates
#
# Return 0 if cn matches the common name component of
# X509_NAME_oneline, 1 otherwise.
#
# In OpenVPN, enable like this:
#
#   tls-verify "./verify-cn whitelist_file"
#
# This would cause the connection to be dropped unless
# the client common name is in whitelist_file.
#
# Format of whitelist_file is simple one CN per line.
#

import re, sys

if len(sys.argv) < 4:
    sys.exit("Usage: %s whitelist_file depth x509_subject" % sys.argv[0])

whitelist_file = sys.argv[1]
depth = int(sys.argv[2])
x509 = sys.argv[3]

## If depth is nonzero, tell OpenVPN to continue processing
## the certificate chain.
if depth!=0:
    sys.exit(0)

m = re.search(r'\/CN=([^\/]+)', x509)
if m and m.group(1):
    # read whitelist
    f = open(whitelist_file, 'r')
    for line in f:
        if (m.group(1) == line.rstrip()):
            sys.exit(0)

sys.exit(1) # reject

Then add the following to openvpn.conf:

tls-verify "./verify-cn CN_whitelist"

I'm not sure this should be a primary means of security. However, it could be useful in cases like mine where the crl was not sufficient.

It is useful in addition to a crl because a crl is a blacklist, while this is a whitelist. If it's possible that some keys have been created that you may not be aware of, this might prevent them from slipping through.

 

At my job we used to have to have to log every hour worked.  I really hated it at first, but I got used to it.  After a while, I took for granted the ability to look back to see what I did on any given day.  Recently, they changed the policy so that we no longer have to log all of our hours, and I immediately began to lose track of my work history.

My solution is very simple.  Since I always have many terminals and screens running, I figured it would be nice if I could just leave notes for myself.  Here’s what I did:

#!/usr/bin/env python
from datetime import datetime

def log(msg):
    if len(msg)<2:
        return

    f = open('work.log', 'a')
    now = datetime.now().strftime("%Y-%m-%d %H:%M")
    f.write("%s: %s\n" % (now, msg))
    print "%s: %s" % (now, msg)
    f.close()

while(True):
    log(raw_input('work? '))

I simply run it in a terminal screen and whenever I'm working on something or complete something I jot down a note to myself.

Likewise, I did the same thing but for general notes. I generally use a wiki for note taking. The trouble is, the model doesn't work very well for me for transient and petty things. This way, I can just paste/type whatever I'd like into my note file and not really worry about publishing or styling or anything.

The method has yet to prove itself, and it is extremely rudimentary. I prefer to start as simple as possible and grow organically instead of attempting to construct a large solution that may or may not be used as anticipated (even by myself).

 

gmirror is a simple-to-use RAID-1 implementation for FreeBSD. One common issue with RAID is that people may not actually know when a disk failure has occurred, in business and at home (RAID does little good if the disks are not replaced upon failure).

When I run gmirror at work, I hook it up to our centralized monitoring (Nagios), and all is well. At home, I need a quick way to be alerted to disk failures, without spending a lot of time setting up (and maintaining) monitoring.  Here is the cron job that I use:

*/5 * * * * (gmirror status | grep -q COMPLETE) || (gmirror status | mail -s “Array on pepin is not COMPLETE” drue@example.org)

I like this solution because it is effective, and it does not require any 3rd party packages or external scripts. Of course, if the box is down, it doesn’t help. But if the box is down, you have a bigger problem.

Let’s dissect the cron job.

*/5 * * * *

Run every 5 minutes.

(gmirror status | grep -q COMPLETE)

‘gmirror status’ will display the status of the array.  grep -q will look for the string “COMPLETE” in the output and set the exit code to 0 on success. The result of this entire command, then, is that it will return exit code 0 if the gmirror array is COMPLETE.

One caveat here is that if you have multiple gmirrors, this will happily return success if any of the arrays are complete. You probably need a cron line for each array, with an extra grep to isolate each array individually.

|| (gmirror status | mail -s “Array on hostname is not COMPLETE” drue@example.org)

|| is an “or”. If the previous part is successful (returns 0), then the second part is not executed.

In the failure case, simply, the output of gmirror status is mailed to the system administrator. I guarantee that an email every 5 minutes will be noticed.

© 2011 The Rub Suffusion theme by Sayontan Sinha