An rdiff-backup love story

Sam Hart

2008-03-14 17:59:48

In my general browsing today I encountered the following very useful post from Carlo Wood concerning recovering deleted files from an EXT3 filesystem. I've never done an "rm -fr ~" on any of my machines, but I've certainly had enough filesystems die on me, and done enough stupid things to appreciate and fear the ramifications of lost data. Hell, not two weeks ago I had my desktop die on me in a spectacular way.

Anyway, I began to wonder what I'd do if I encountered a similar problem as I use XFS instead of EXT3. (Several years ago, I used Reiserfs, but after a catastrophic Reiserfs-related meltdown I switched.)

Then I realized I use rdiff-backup and have incremental backups of all of my data since Summer 2007 (when I started using rdiff-backup :-) So I probably wouldn't need to go through the pain of having to restore low-level XFS transactions.

I am very enamored with rdiff-backup. In fact, if rdiff-backup were a woman, it would be a no brainer to cheat on my wife with her (unless she already was my wife, of course). I also think I have a pretty clever system set up for my backups, so I'm going to share it with you all...

The Problem


I work from home, which means I have a little office in my house that houses all of my systems I use for work. Here is my workspace (a bit cluttered at the moment, the desktop you see to the right is the one that died two weeks ago, and I haven't replaced it yet):


This office used to be a bedroom. It had a small closet in it, which we converted into a home for my pseudo rack and storage for things like cables, unused cards, and miscellaneous electronic bits. Here is a picture of the portion of it that houses my pseudo rack:


Yes, that's just a metal shelving unit from Lowe's sitting in a closet. I have a cheap 4 port KVM connected to the 3 systems I'm currently using for work and personal related things. Each of these systems (including my laptop and my now dead desktop) needs to have backups. Additionally, the 3 servers (not shown here) that power my various websites need to be backed up as well. These systems run various Linuxes including Ubuntu, Debian, SUSE, and Fedora.

So I have a moderately inhomogeneous small array of systems that I need to backup. I'd like it to be done in an easy (read: "Low ceremony on my part") and centralized way.

Traditionally, my non-easy (and "high ceremony") way of backing them up was to dump things to DVDs/CDs and archive them somewhere. This was problematic because I had to actively participate in the backups, as well as culling down large and unruly collections of files that wouldn't fit on a single DVD (or, heaven forbid, organize some sort of split across ISOs). Obviously I didn't have incremental backups, and had to somehow organize my DVD/CD archives in a fashion that wouldn't make locating one of them an impossible task. Because this method was so difficult and irritating, I really didn't do it often enough (every 3-6 months I'd feel guilt over not having backups and do a backup frenzy for a day or two).

There are plenty of Free Software and Open Source backup tools out there, but the problem with most of them are that they are designed for larger and often more homogeneous setups than mine. A dozen or so quirky systems running a gambit of different distros (many of which get wiped and reinstalled as projects evolve and change) tend not to fit well into most of the larger backup suites out there.

The Solution: rdiff-backup


I'll be honest, I forget where I was first exposed to rdiff-backup. It was likely another "rdiff-backup love story" post someone else made that is similar to this one. The important part is that I found it, and discovered how incredible it is :-)

In a nutshell, rdiff-backup is a tool (based on rsync) that backs up one directory to another (even over networks). The resultant directory is a copy of the original, but any changes made since the last backup are stored as reverse diffs inside of a special subdirectory. This means that the backed-up directory is the current working directory of the source (well, as of the recent rdiff-backup run), but every incremental change that has occurred since the backup was first started are still available should you ever need to restore them (or simply roll back the directory in time).

rdiff-backup is amazingly simple to use. The basic format of an rdiff-backup command looks like:

rdiff-backup source destination


Here, source/destination can be directories mounted locally or ones available across a network.

For example, if I had my source and destination mounted locally on my filesystem; say "/home/sam" was the source and "/mnt/usb/home-sam-backups" was the destination, my rdiff-backup line would look like:

rdiff-backup /home/sam /mnt/usb/home-sam-backups


If I had a remote directory ("/var/www/some-website") accessible over SSH on the server "somewebsite.net", and I wanted to back it up locally to "/mnt/usb/some-website-backups", my rdiff-backup line would look like:

rdiff-backup user@somewebsite.net::/var/www/some-website /mnt/usb/some-website-backups


Note that you can also go from server to server in a similar way:

rdiff-backup user@server1.net::/source/path user@server2.org::/backup/path


USB Mounted HDDs for backups


As it's been pointed out elsewhere many times, HDDs have become so inexpensive that it's actually more economical to use extra drives for your backups rather than get some tape system. This has a number of big advantages such as faster backups and restores (HDDs will always be faster than tapes), as well as large storage capacities. The big disadvantage to HDDs is that they traditionally had to be physically mounted inside of the PC doing the backup, which isn't great when it comes time to archive them (e.g., remove them from the premises to protect them from things like fires, hurricanes, and gremlins).

Well, this disadvantage is nullified by the fact that you can now buy external USB HDD enclosures so insanely cheap.

So, what I've done is bought a couple of 500GB (you easily could get higher) HDDs and a couple of cheap (usually ~$20) USB enclosures for them. You will have to set up some automounting via partition labels (which is beyond the scope of this document), but once you have, these backup drives essentially become "hot-swappable" (you know, as long as you swap them in between backups and restores :-)

If you look in the picture of my pseudo-rack you'll see a small shelf at the top which is just large enough for my external USB HDDs (as well as a switch).

Once you have these USB mounted HDDs setup to automount (and mount to the same spot when they do), you're now ready for a bit of rdiff-backup scriptage to make this backup scheme truly "hands-off".

The rdiff-backup script(s)


Okay, so you have your USB HDDs mounted to some central system which will handle the backups. Now we need some scripts that automatically handle scheduled backups. For the scheduler, under Linux you'll already have cron, so that's what we'll use. You could technically just cron all your rdiff-backup lines (in one large crontab file), but I personally chose to fancy-it-up with some scriptage and logging.

My scripts are simple shell scripts. In order to add logging to them, they all have the following function:

trace () {
stamp=`date +%Y-%m-%d_%H:%M:%S`
echo "$stamp: $*" >> /var/log/backup.log
}


This function will dump whatever is sent to it to a backup file called "/var/log/backup.log" (you could call this whatever), with a date prefix. I prefer my date prefixes to be YYYY-MM-DD_HH:MM:SS, or like "2008-03-14_05:30:44" as that makes it easier for me to parse (and for me to script a parser for), but you could realistically add whatever stamp that floats your boat here instead.

Then, when I actually run a backup, I usually start with a trace() call saying I've started the backup, as well as trace() calls for each path I am backing up. My rdiff-backup calls get their STDOUT and STDERR redirected to the backup.log file. Finally, I close the backup with a call to trace() saying I'm done. Thus, a typical backup session for a given server might look something like this:


trace "Backup for mysite.net started"
trace "Working /var/www"
rdiff-backup user@mysite.net::/var/www /mnt/backup0/mysite.net/var-www >> /var/log/backup.log 2>&1
trace "Working /usr/share/cms-sites/"
rdiff-backup user@mysite.net::/usr/share/cms-sites/ /mnt/backup0/mysite.net/cms-sites >> /var/log/backup.log 2>&1
trace "Working /etc"
rdiff-backup user@mysite.net::/etc /mnt/backup0/mysite.net/etc >> /var/log/backup.log 2>&1
trace "Backup for mysite.net complete"
trace "------------------------------"


Note: This assumes that the user has passwordless access to the remote system mysite.net. If you don't know how to do this, check the man pages or google for ssh-keygen and ssh-copy-id.

A typical backup script would then have multiple backup session sections like this. Putting it all together, the backup script may look a lot like this script. Sure, you could likely do something fancy with arrays and more functions, but I figured I was getting fancy enough with these scripts so I just stopped there :-)

Simply save this script, call it from cron (I usually have mine run at night when I'm sleeping), and when you look at your backup.log file, you'll see your rdiff-backup messages intermingled with an indication of what was going on when they happened:

.....
2008-03-14_00:32:54: Working /home/gecika
Warning: could not determine case sensitivity of source directory at
/home/gecika
because we can't find any files with letters in them.
It will be treated as case sensitive.
2008-03-14_00:32:58: Working /home/sam
2008-03-14_00:33:10: Working /home/www
.....



Now, you'll likely come up with multiple scripts for this, however I'd just use the base script as your starting point. For example, while I do have scripts that run every day from given boxes, I have a special script just for my laptop that I fire off manually (the laptop usually wont be on over night :-) But at any rate, what I have here is enough to get started on using rdiff-backup for wonderful and low-ceremony daily (or weekly, monthly, whatever) backups.