There has recently been a very big disaster at the hosting provider OVH. They have suffered a fire, and a complete datacenter and part of another have burned, leaving many websites and online games without service. Seeing things like this possibly caused by the lack of a contingency plan:
I couldn’t help but think about how it is possible that they did not have some contingency plan to survive a disaster like this. For example, I have also suffered the loss of this website because I was in one of its data centers. Luckily I had thought about the possibility of the total loss of my data, so I designed a disaster recovery plan. This has allowed my website to be down for less than 24 hours, being so long mainly due to work, the time of the incident (early morning), and that it is not a critical website. Had I been posted at the time of the disaster, the website would have been online again in just over an hour and a half. That counting on the purchase of the new VPS, installation of services, data recovery and verification …
That is why it is very important that you design a good contingency plan. For this I will give you a few guidelines to follow to design your own:
- Always think of the worst that could happen
- Design a contingency plan
- Think about the business continuity plan
- Try to reduce the possibility of disaster
Always think of the worst that could happen
Whenever you are going to design a contingency plan for your website, it is very important that you think about the worst that can happen. In my case the worst that could happen is losing everything, so I designed my contingency plan based on that.
Design a contingency plan
Once you have thought about the worst thing that can happen to you, it is time for you to think about the following points:
- What data is important? Think about what data in your application is important to keep. In my case it is the configurations, the web data and of course, the database.
- What is the best method to do the backup? In my case the web weighs a little and the database is hilarious, so a simple script has been enough. In your case it is very important that you think if it works for you, or if on the contrary it is better to use more professional methods, such as Snapshots, programs like Bareos …
- Always do backups offsite. Thinking that it is the worst thing that can happen to you, there will always be the loss of all data. This can happen due to file system corruption, irrecoverable server damage, fire … Therefore, it is VERY IMPORTANT that you do your backups outside the server, in another availability zone, and if possible in another country and another provider. For example, in my case the server is from OVH and it was in France, so I decided to do the backups on AWS and in Ireland. Thanks to the fact that my website is light, the cost is very low (less than € 1 for 40 days of backup).
- If you can, save more than one copy in multiple places. This will help reduce the chances that Murphy’s Law will be applied and the backup data will also not be accessible.
Think about the business continuity plan
It is as important to think about the contingency plan as it is to think about the continuity plan. That is why it is very important that you take it into account when designing the contingency plan. Once a disaster has occurred, downloading the file from a regional S3 is not the same as requesting recovery from Glacier. The regional S3 download will be immediate, while the download from Glacier can take up to 12 hours. That is why it is good to keep the following in mind:
- Think about how you are going to recover the data after the disaster has occurred.
- Choose storage that helps reduce recovery time. For example, disk snapshots are quick to retrieve.
- Run recovery tests periodically to make sure it’s workable.
- Automate as much as possible with infrastructure templates with tools such as terraform, and / or servers with salt or ansible. This will help get them up and running as quickly as possible and with the correct settings. Thanks to terraform and salt I was able to reinstall an Elasticsearch server in less than 15 minutes.
Try to reduce the possibility of disaster. Try not to need the contingency plan
Of course, obviously it is much better to avoid disaster than to fix it. That is why if you have the resources and money it is very important to mount disaster-proof infrastructures (if possible):
- Use HA managed services in multiple availability regions.
- If you have no choice but to use services in IAS, spread them across multiple Availability Zones and multiple regions if possible. This will help to ensure that in the event of a disaster in one of the data centers, there is little chance that the service will be affected.
- Go for large providers like GCP or AWS. They are more expensive but the guarantees they give are also higher. In my case, being a personal blog and not being able to invest a lot of money, I have had to lean towards low-cost hostings like OVH. This causes the performance to be limited and disasters like the one that happened can occur. In addition, the SLA of these providers despite being similar, give many more problems (believe me, we have suffered it on other occasions).
I hope these tips help you, and don’t do like Rust and lose your valuable data! All the best.