Operating Dev

Blurring the line between software development and operations

Operating Dev - Blurring the line between software development and operations

When a disaster hits – rising from the ashes

Image from http://www.eci.com/blog/77-hedge-fund-disaster-recovery-requirements-part-one.htmlSo far in the Disaster Recovery series I have discussed the importance of DR planning for building resilient organizations regardless of size and described the steps needed for building a basic DR plan using the cloud. This article will complete the series by looking at the process of recovery and the thoughts that go into proactive planning and recovery actions.

The most critical part of DR — and hardest to implement — it is the procedure to rebuild your infrastructure in a different location should a disaster hit. As you can imagine, the cost of a secondary infrastructure in a traditional data centre is prohibitive for many companies to even start thinking about DR.

They need to not only provision enough hardware (it costs a fortune and takes a long time to acquire) that can be used as a failover target in a disaster situation, but they also need to keep the systems running on that hardware up-to-date with all of the latests system updates, software versions, etc. Not to mention the need to keep adding hardware to match your primary capacity as you grow and scale your system, if you choose to keep a secondary location ready at all times.

Choosing to implement your secondary location in the cloud deals with the upfront cost — cloud infrastructure vendors only charge per use of their resources (typically through hourly rates, storage allocated per month, etc.) Before you start thinking about all of the cloud servers and storage you will need to setup your DR location there, it is important to note one of the inherent values of the cloud that can significantly reduce the cost of keeping a secondary location on standby as part of your DR strategy — the ability to provision cloud resources like servers or storage in minutes instead of the typical weeks or months in the traditional hardware world.

This means that by focusing your efforts on building images (snapshots or templates) of your servers that can be quickly turned on when needed instead of running cloud servers at all times, you can drastically reduce the cost of the secondary location as you only pay for the storage needed to keep those images (e.g. in the case of Amazon, these would be AMIs stored on S3). This leads me to the second most important question that should be answered before building a DR plan and after understanding your risk tolerance to data loss:

How fast do you need to restart your business after a disaster before you suffer consequences that would significantly impact your ability to meet your financial and customer satisfaction targets after recovery?

To give you an example how knowing your risk tolerance to recovery time can influence your DR plan, let’s assume that you are ok if your systems get back online within 24 hours. In this case, assuming you are trying to recover your systems in the cloud, you will have ample time to get your IT to start all necessary servers from the images you kept for the case, recover the data from the backup already available in the cloud and get everything going.

If you’re one of the very few lucky ones that can afford to spend a week or so to recover, then you may even get away without creating images upfront, but then you will need to keep a documented inventory of all servers in the primary location, with accurate information about the software versions, updates, configuration, etc. so you can reproduce the same in the cloud. I would argue, though, that this is not a reasonable strategy, particularly if you are using the cloud as a DR target, and would strongly advise you to maintain images for your servers instead.

On the other hand, if your tolerance to recovery time is low and you need to get up and running in, say, under an hour or even in minutes, manual recovery from images may not work and you will need to automate the recovery process. This may involve running some or all of your servers in the DR location in parallel to the primary systems and directly updating them when any software or configuration changes are done to the primary servers so they represent an exact replica of the production systems. Keeping images in this case would be used as an additional precaution that helps you quickly recover individual servers should any fail to work when you try to failover your production to the DR location.

This leads me to the final question that needs to be answered before creating a DR plan:

What is the minimum capacity you need upon initial recovery to run the critical functions of your business and which sub-systems are needed for that?

Let’s imagine that you are currently running a legacy ERP system that is used by almost all departments in your company – e.g. 100 users in total. To support this, you’re running a terminal services solution with 5 TS host nodes serving the client app directly to the business users on their laptops via RDP or RemoteApp.

If many of those users are not critical, i.e. you mostly need the 5 business managers that are providing reports to certain critical functions and need to send their key reports every morning, then you can plan a single TS host server in your DR location. Any additional users may need to wait for the primary location to come back online, if possible, or extra recovery work to be done to enable to capacity to handle more users.

Now let’s imagine that you have to start letting more people in on the secondary system as the primary can’t be recovered or the recovery is slow. Again, the cloud offers a value that is not easily possible with traditional IT as it makes it simple to scale by adding more server instances as needed. With the use of services like AWS AutoScaling it is even possible to automatically add more host nodes when the load increases, thus improving the capacity of the DR location as needed, with no upfront capacity planning and provisioning of sufficient hardware.

As you can see, leveraging the cloud to implement your DR plan, you can significantly reduce the DR cost by avoiding to provision infrastructure until it is needed. You can do this by keeping images of all of your servers or keeping a smaller fleet of servers to handle the initial hit after a disaster and let features like auto scaling handle further capacity increases as needed. (The cloud can also offer an ability to scale down your fleet when the load decreases, thus reducing the overall cost of operations, which may be enticing for some organizations to stay in their new home after the original one turned into ashes.)

I hope you will take the time and think about the value of implementing a DR plan that uses the cloud for your organization. Before you leave this article and think about DR in the cloud, I would like to pose two more questions that are worth considering beforehand.

Would you need to ever go back after a disaster and where?

If a disaster forces you to move into your cloud location, you may find that you prefer staying there and turn the cloud as your primary home. If you choose to leave back to your original home or another non-cloud location, you need to consider how will you sync the data back.

Typically this is hard to plan as you won’t know your target ahead of time, but luckily you will have more than a few minutes or hours to make the move. You may still want to test if you can bring the data from the cloud into your current infrastructure to confirm how easy or hard is to do that in general.

How much can you pay to test your recovery procedure?

The worst thing that can happen to any DR plan is if the recovery fails and you find yourself with lost data and not able to restart your business unless you get the primary systems online again somehow. Lots of things can cause this outcome — from the mundane possibility of loosing the key used to encrypt your data before sending it to the cloud so you end up not being able to recover it, to introducing subtle bugs in your recovered system by forgetting to apply the latest software version for a component and causing the data to be corrupted on use.

To prevent these and other examples to occur, regular and thorough testing of the DR plan is recommended. Depending on your budget and taste for risk, you may even go as far as doing planned failovers from your primary systems to your DR location and vice versa to ensure all systems are configured and function properly.

Are you running DR in the cloud? Do you plan to implement it soon? I’d love to hear from you in the comments or at kima[at]operatingdev[dot]com.

Your email address will not be published. Required fields are marked *

*