Ensuring correct database backups taken at regular interval is very critical for disaster recovery. Recent gitlab database incident re-emphasizes this fact. Gitlab was very transparent about this and documented approaches for preventing these failures
The preventive measures include, monitoring -
- Backup file is created in every x interval: Catches backups not being uploaded due to backup script error or scheduling error
- Size of latest backup file is at-least y bytes: Catches erroneous backup file uploaded due to script error
In our case, DB backups are uploaded to Azure blob storage(similar to AWS S3) and prometheus is used for monitoring
High level design
- Run an exporter which can expose metrics such as
latest_file_timestamp
andlatest_file_size
for each blob container where backup files are uploaded - Alert if
current_time - latest_file_timestamp > backup_interval
orlatest_file_size < expected_backup_file_size
As we couldn’t find any existing exporter, we wrote prometheus-azure-blob-exporter to capture following metrics
1 2 3 4 |
|
Alerts are defined as
- Check backup is created every day
1 2 3 4 5 6 7 8 9 |
|
- Check latest backup file created has minimum size of 1MB
1 2 3 4 5 6 7 8 |
|
Please checkout the github repo for more details