Backup, disaster recovery and how to avoid, control and reduce data loss
Database backups are often thought of as a checkbox item, to be ticked off without much thought. This article aims to clarify what a backup is, what defences are available and against which threats, facilitating a strategy for building robust ecosystems against the unexpected.
A backup is, in its simplest form, a copy of a database at a point in time, nothing more, nothing less. It's worth noting that a database backup can, but does not always include, all aspects of a database. It is neither a backup nor disaster recovery strategy.
What is a database?
A database is more than an application such as PostgreSQL or MySql, it usually consists of stored data that is entered, manipulated and presented via one or more applications. Uploaded data, document or image files are usually recorded in a database but stored in a file system. Each of these parts, including their configurations, are linked that together form a whole. Although the focus is mainly on the database application because that's where data is stored, to be useful all the constituent parts need to be considered so that if any recovery is required they are restored in sync, and that can occasionally involve processes by end users.
What are the costs of downtime / loss of data?
Downtime is a loss of opportunity when you can't make a sale, carry out a task, or check something. Employees may be unable to work and and there may be claims against an SLA.
Data loss can be much more serious, for example getting an account balance wrong or not getting tickets that have been paid for.
From reputational damage to reimbursing actual losses, lost earnings to costs of investigation and repair, these costs can be significant.
If every second counts, then an online continual backup, aka replication may be a solution, if it's essential to have almost no downtime then a hot backup with failover may be necessary.
If a data centre goes out of service, would it necessary to restore operations at another location? Perhaps less important for well provisioned cloud services, but essential for others.
How much can you afford to lose?
Each option to protect a system against loss and downtime has a cost, some eye wateringly expensive. For example losing a few transactions might be embarrassing and cost revenue, but missing a publishing deadline at the wrong time could dent credibility and future sales.
Time can be important. If an incident is discovered that happened some weeks or months earlier, would database transaction logs, application record changes or both be required.
Backups and logs take up space, online replication also uses processing power, failover adds complications.
What are the main threats?
Users make mistakes all the time, it's rare for these to require a data restore for a well designed app that instead of deleting data merely marks it as deleted with, after a suitable time has elapsed, have a task to actually delete the data. This is like a multi level recycle bin, for example in dbmail when deleting emails, they are set to delete so users don't get to see them, then a separate process that can be run daily, weekly or monthly as required to escalate those records then remove the physical records and finally recover disk space.
Apps and the underlying data store is often under constant evolution. Although rare, it's not unknown for whole databases to be lost having to resort to an earlier backup.
This is usually best avoided through a robust change mechanism using testing and staging. There are various techniques for ensuring adequate division of responsibility such as limiting database permissions for schema changes.
Although rare, either external or internal threats can adversely impact data. Limiting access to and controlling what can be done is the best defence, but sometimes you need to identify when it happened and restore a database to a point in time before the attack.
Some transactions are regulated and require a clear audit trail, most databases offer various levels of detail.
Performance and space limitations
As a database grows it takes up more disk space and if there are hard limits can cause a database to stop working or in exceptional circumstances become corrupt, network throughput may cause excessive latency with the potential to lose data. It's essential to monitor and identify trends before any issues impair or prevent a reliable backup.
Hardware faults and power outages
These are often the easiest to fix by simply restoring data and the application. It's worth noting that configuration is important as often data locations are an essential part of a database cluster.
Testing, testing, testing!
There can be no certainty until it has been tested. I've heard of too many instances where a backup has failed silently and ignored because no one tested a restore.
Managing the process
Users and managers are anxious and want answers, ability to state what is being done and quantify how long it is likely to take is essential for any help desk and notwithstanding the incident itself, offers comfort in times of stress. Testing both adds the numbers and processes to ensure this happens with the minimum of stress should it ever be needed.
Following this article, you should be better able to identify, construct and deploy your chosen backup and disaster strategy.
Offering insights and good practices to identify what needs protecting, quantifiable and ephemeral business risks, leading to avoiding, controlling and reducing downtime and data loss. It's an essential aspect to any robust and resilient ecosystem. As any system grows up and out, scale and robustness become more about coordination than individual tasks.