The differences between a backup and an archive couldn’t be more stark and are relatively easy to understand. Since both are copies of data that are stored and used to restore data, the difference comes down to why they will be restored (i.e., their purpose). Because their purposes are different, they are stored differently with different metadata.
I like to say that a backup is a secondary copy of primary data, and an archive is a primary copy of secondary data. This means that a backup is another copy of important, recent data that you use in case something happens to the primary copy. Archive is the main copy of data that has typically lost its primary reason for being, but we’re keeping it around for historical purposes. It’s therefore the primary copy of that data of secondary value.
What is a Backup?
Definition: A backup is a copy of data stored separately from the original used for the purposes of restoring that data to its former state, usually after the data has been deleted or damaged in some way.
A "copy" is a replica of the original data, including all content and metadata. True copies should retain all security and permission settings from the original. A copy also must not rely on the original in any way, in the way storage snapshots do (for example). Storage snapshots are created within the storage system and do not contain the original content but rather reference it. They are, therefore, not a copy; they are a virtual copy that needs the original to function. Ergo, storage snapshots (that have not been copied somewhere else) are not backups.
The next part of the definition is “stored separately from the original.” If a copy is stored in the same file system, computer, or database as the original, I would call it a "convenience copy" rather than a backup. The core idea is that for something to qualify as a backup, it must be stored on separate media or in a different location.
An example of a copy that does not qualify as a backup is a version of a document you are working on, and it is stored on the same drive or within the SaaS vendor you are using. A disaster could take out both copies. In other words, that would not be considered a backup.
You can use something as simple as a removable USB drive, a backup tape sitting in a tape library or a storage vault, a separate storage system for storing backup (e.g., deduplication systems), and more commonly, cloud storage.
The next part of the definition is “for the purposes of restoring that data”. If it’s for restoration (and meets the other parts of the definition), then it is backup. If it’s for other reasons, it is probably for an archive.
What does it mean when we say restore?
A restore is the process of using a backup to return the original data to a prior state. Restores are usually aimed at reverting a server, file system, or cloud resource to a relatively recent point in time. Backups are most used to restore data from yesterday or even more recent timeframes. The objective is almost always to restore from the most recent backup available; however, there are exceptions, such as the need to restore data deleted a while ago or to counteract the effects of ransomware infection. In such cases, older backups may be required. But typically, when we start talking about bringing data back from many years ago, hopefully, it will be from an archive – or it will be very difficult.
To perform a restore, multiple pieces of information are necessary to identify the backup's source, including the server, VM, or resource name, the application name, the server or VM's credentials, and perhaps a subset name (e.g., file system, directory, or table). In other words, we are going to restore the /stuff directory on the Apollo VM. Or perhaps we are restoring the user Curtis from within our company’s Microsoft 365 account.
Additionally, you must know the date on which the specific item to be restored was in the desired state. This date is crucial for selecting the correct point in time to which the data should be reverted. Even if you are restoring several different things, that doesn’t change the fundamental meaning of the word, because that’s still several restores, not a collective action (like a retrieve from an archive would be.)
A restore returns a single resource to a single point in time.
What is a Data Archive?
Folks often make a real mess out of the term "archive" in the IT world. They toss it around casually, describing any old data as an archive. The reality is an archive is a very particular beast. It irks me when folks label old backups as "archives." You don't just stumble upon an archive; you've got to create it, intentionally.
Archiving isn't about slapping your old backups onto a cheaper storage medium designed for long-term storage. That's just migrating a backup. Don't kid yourself; there's no such thing as "archiving a backup." You're merely moving that backup to a long-term residence.
Let's be clear, backups and archives are as different as grape juice and fine wine. Old backups don't magically transform into archives any more than grape juice transforms into wine without a special process. If you want wine, you must plan for it from the start. Likewise, if you want an archive, you've got to go out and make one. It's high time we stop confusing old backups with archives, as they're typically lousy at performing the archive's duties. How long something sits around doesn't determine its role; it's how and why it was stored that counts.
So, what is an archive, then?
Definition: An archive is a copy of data stored in a separate location, made to serve as a reference copy, and stored with enough metadata to be able to find the data in question without knowing where it came from.
Let’s Break Down the Definition of an Archive
The first two parts of this definition will sound similar to a backup. You do need to keep an archive in a separate spot, and it must be a complete copy, no question. However, where an archive truly sets itself apart from a backup is in its purpose and how you'll use it. This purpose will also determine how the data and its associated metadata are stored.
Archives aren't there to resurrect a server or a file to its former glory. They exist to unearth data for purposes usually entirely different from the data’s original intent. It might be a related purpose, sure, but it's rarely the same. Take, for example, an archived CAD drawing for a satellite model. You're not pulling it up to build an identical satellite but maybe to create a similar one or investigate why the existing one crashed down to Earth. These are linked purposes but not the same.
Using the Example of an Archive in Terms of Email
Let’s also talk about email archives. They aren't for resurrecting your email server; that's what backups are for. Instead, email archives typically serve e-discovery purposes. They're about finding emails that fit specific patterns, phrases, or criteria, not about recovering every email from yesterday. That's what backups are for.
This is among the reasons why I don't consider Microsoft 365 Retention Policies or Google Archive as backups for Microsoft 365 or Google Workspaces. They're archives of your email and data, not backups. They're there for your reference, mainly for e-discovery, not for restoring your entire database. So, the way you query these archives is entirely different from how you'd query a backup of the same.
Data Archives Store Additional Metadata as Compared to a Backup
A crucial feature of archives is the additional metadata they contain. Sometimes, this metadata is already part of the archived content, like sender, recipient, subject, and date in an email archive. Other times, you add extra metadata when creating an archive, such as naming the archive after a project. This metadata might not be in the files themselves, but it's vital for searching by project name.
Some archive systems go the extra mile and extract plaintext content from the archived data, enabling full-text searches of the content. This is handy when you need to search based on content, not just metadata. You're looking for info inside files or emails, not just their surface-level details.
Why does all this matter so much?
Well, retrieval is a horse of a different color compared to restoration. Restoring data doesn't require meticulous metadata querying, but for retrieval, metadata is king. You often lack the specifics needed for restoration. When you're retrieving, you might be fishing for data from many servers, applications, and across a range of dates. It's the opposite of a straightforward restore operation.
Unlike a typical restore, a retrieve might find you searching for information years or even decades after it was created. You might have a vague notion of the servers that might have held it, but recalling server names, database names, and such is a long shot.
Try remembering the email server you used five years ago—nearly impossible, right? If you do, it's either still in use or it gave you a world of trouble. I can’t remember any servers’ names from 30 years ago, except for one (Paris). That one almost ended my career before it began.
During retrieval, you're not on the hunt for a server or a file; you're after information, specifically content. You want all emails with "Apollo" in them, emails between Stephen Smith and Jane Collins over the last three years, regardless of their originating systems. You're digging for different versions of the code John Stevenson worked on five years ago.
Maybe you vaguely remember some projects from years past that could be pertinent to a current project. Your brain says, "This reminds me of the widget project from three years ago." So, you delve into the archive system and find that widget project from yesteryears. Click, and voila! All project-related files and emails are at your fingertips.
(Restoring this with a backup system, assuming you knew where everything used to be, would be a complex task, involving multiple restores and more information than you usually have in such a scenario.)
Tying It All Together
Both backups and archives must be separate copies and stored separately from the original. Backups are for restoring a single resource (server, file, VM, application) to a single point in time – which is usually yesterday or a few hours ago. Archives are for finding a collective group of information created on a variety of resources over a period, which is usually a very long time ago. While there’s no reason you can’t accomplish both backup and archive purposes with a single copy of data, that rarely happens. Typically, backup software makes backups and archive software makes archives.