I don't think I have to convince you that creating regular backups of your valuable files is a good idea. I maintain complete backups of my hard drive both on-site and in the cloud.
Sometimes, though, I'd just like to periodically sync some specific directories to the cloud for easy retrieval. It'd also be nice to be able to archive large amounts of data without worrying about the cost. Both goals are made possible by the convenience of S3 and the inexpensiveness of Glacier.
In this tutorial, I'll teach an easy way of creating backups in S3 and
Glacier. We'll use the
sync command provided by AWS CLI and versioned buckets
to make incremental backups. If you don't have AWS
CLI installed, make sure to install
Important Note: If your files are important to you, this shouldn't be your only backup. However, AWS S3 and Glacier make cloud backups very convenient, ridiculously cheap and very easy to recover, so it is a good addition to your backup options.
The first part of this tutorial involves creating an S3 bucket and syncing a local directory to it. Here are the steps:
This is an important step. We want to enable versioning for our backup bucket. This enables us to never delete any data. We can recover old versions of files and also files that are deleted locally, which is kind of the point of having backups.
Now that we have a bucket, we can store something in it. The easiest way of
doing that is using the
sync command in AWS CLI. Note that we're using the
--delete flag here. That doesn't mean that the locally deleted files are
actually deleted from the bucket as well. Instead, since we're using
versioning, those files are merely marked deleted. You can still recover them
by downloading older versions of the same files from s3.
--sse flag tells S3 to encrypt the files. The official
has lots of details on the different encryption options.
Backups are worthless unless you can recover them. Using
sync against a
versioned bucket simply returns the latest version of all files. Excluding
the ones that are marked as deleted.
S3 doesn't provide an API for point-in-time recovery. In other words, you can't tell S3 to return, say, the versions of objects that were current on May 1st last year.
It is possible, however, to list all versions and find the correct version by comparing the timestamps. s3-pit-recovery, a library I wrote, does exactly that:
Install via npm:
npm install -g s3-pit-recovery.
Restore to a point-in-time. After running the command below,
pit.my.backup.bucketwill look exactly like
my.backup.bucketdid on May 15th, 2018.1234s3-pit-restore \--bucket my.backup.bucket \--destinationBucket pit.my.backup.bucket \--time '2018-05-15T10:38:00.614Z'
If some of your object versions reside in Glacier, s3-pit-recovery gives you the option to restore them to S3.
Since nothing is ever deleted from the bucket by using this method, over time, the amount of data can get pretty huge. A great way to save money in storage is to take advantage of the different storage classes in S3. The storage classes we'll use are:
STANDARD: This is the default S3 storage class where all objects start their life.
STANDARD_IA: IA stands for Infrequent Acess. This storage class offers cheaper storage than the standard class, the trade-off being lower availability and higher cost of retrieval. Hence, as the name suggests, you should use this storage class if you know you're going to access your data infrequently.
GLACIER Glacier is a long-term storage service that trades speed of retrieval for low prices. Storage in Glacier is very cheap. At the time of writing, Amazon charges $0.004 per gigabyte per month. The trade-off is that retrieving records from Glacier can take hours. You can't simply download your files from Glacier. You have to first restore them to S3. Simply put, Glacier is an archival service. Don't use it for files you want to be able to restore quickly in an emergency.
Remember that the standard S3 storage class is already very cheap. If you don't have massive amounts of data, storage classes make very little difference. However, if you do have lots of data transferring infrequently, accessed files to STANDARD_IA can cut your S3 expenses to half.
In summary, when you're deciding which storage class you are going to use, the question you should ask yourself is "How often am I going to access this data?". If you have to access the data frequently, you should use STANDARD. If you almost never have to access the data, you should use GLACIER. STANDARD_IA sits between those two.
You can, of course, get scientific and analyze your data to figure out the optimal storage class for each object. AWS even has a tool for this. Here are some quick guidelines, though, that I think are quite sensible:
Use STANDARD_IA for files you are storing for emergency recovery. In other words, data you only need in the unlikely event that things go horribly wrong and you have to restore to a point in time when everything was still hunky-dory.
Use GLACIER for files you don't expect to ever need, but you still want to archive them for whatever reason.
Use STANDARD for everything else.
Since in this tutorial we are dealing with backups, the STANDARD_IA is a pretty obvious choice. Files that are really old can be further transferred to Glacier. What "really old" means exactly I leave up to you to decide. However, in the next section, I give an example setup where the time to wait before transferring objects to Glacier is one year.
To read more about the differences between storage classes click here.
To transition objects to different storage classes, we take advantage of a feature of S3 buckets called lifecycle rules. Lifecycle rules are a way to tell S3 how long it should hold on to data before transferring it to a different storage class, or even deleting it (not recommended). All storage class transfers are thus automated and we don't have to worry about them beyond creating the lifecycle configuration and applying it to our bucket.
First, let's create a new file and name it
lifecycle.json. We populate the file with the
json below and save.
The configuration above moves the latest version of all files to STANDARD_IA 30
days after uploading (
Transition) or once a new version of the same file is
inserted to the bucket. Old versions are kept in S3 for 365 days before
transferring to Glacier (
If you want to get rid of old versions, you can add a
NonCurrentVersionExpiration action to your rule. Expired versions will be
permanently deleted and cannot be recovered. Please be careful with this rule!
See AWS documentation to learn more about lifecycle rules.
Finally, let's apply the lifecycle configuration to our S3 bucket.
Now you have a bucket that's all set up for backups. The only thing left to do is scheduling the backups.
If you're on a Mac or Linux you can use
Open your crontab in an editor.
Choose a schedule for your backups. The most important thing here
is to choose a time when your computer is likely to be switched on. The
following script will run the
sync command once every five hours. You can, of
course, choose a more frequent backup schedule to play it safe.
A nice thing about the
sync command is that it doesn't copy any files to S3
unless they have changed. This means you can afford to have it run fairly
frequently as most invocations won't result in many file transfers.
Note that cron runs all scripts without the usual environment variables,
PATH, which means that you have to use the full path to your
command. You can find it by running
which aws in your terminal.
Backing files up to S3 is so easy and cheap there's almost no reason not to do it. Please don't rely on one service though. I keep multiple backups of my important files, both locally and in the cloud, and I recommend you do the same.
That concludes the tutorial. Please leave a comment and let me know if you found it useful.