Processes

So NOW you want a Disaster Recovery Plan

Here at BuboWerks, we believe in risk-driven security, and when assessing risks, we attempt to look holistically at the risks businesses face. There is no point in spending big bucks on security controls to keep out nation state hackers when it’s more likely that some thug will use a brick to enter your office and walk off with your server with everything on it. Given that, the risk of natural disasters is always something we look at, and subsequently one of our top recommendations is typically to develop a Business Continuity Plan (BCP) and Disaster Recovery Plan (DRP). Even in the wake of major disasters like 9/11, Katrina, and Sandy in the past couple decades, organizations still drag their feet on developing such plans.

Now, granted, we have not called out a global pandemic as an example of the sort of disaster that organizations need to be prepared for, but here we are.

The nice thing about a good BCP/DRP is that the exact nature of the disaster does not matter. Ideally, one should have these plans before you find yourself in a disaster, but if you are suddenly realizing that you need to figure out your path forward this week, now is a great time to start! This post will explain what a Business Continuity and Disaster Recovery Plans are, and how you can go about creating one. If you need any help either writing or executing your plan, let us know!

What is a Business Continuity Plan?

The Business Continuity Plan (BCP) is focused on what you need to do to keep your business running in the face of any event that would threaten to stop it. This event doesn’t just have to be a widespread disaster – say you have only one person who can unlock your office, or build your software – what happens when they are sick? What about your building losing power? We shouldn’t try to enumerate all the scenarios – the common theme is you cannot do work: No work, no revenue. Your BCP should enumerate the processes that are critical to keeping the business running, how you can keep them operating, and how quickly you need to react in the face of any interruption event to keep your business in business.

What is a Disaster Recovery Plan?

Your Disaster Recovery Plan (DRP) by contrast is focused on your IT systems. The DRP should be based off your BCP in that the DRP can establish what systems need to stay up or fail over and how quickly based on the requirements set out in the BCP.

Bringing the two together

In a large organization, both the BCP and DRP will be significant documents, and it makes sense to maintain them separately, especially if they are maintained by separate departments (such as operations for the BCP and IT for the DRP). Smaller organizations do not have such distinctions, and do not gain anything from the overhead of maintaining two separate documents, so we recommend a unified Business Continuity and Disaster Recovery Plan.

A simple Business Continuity and Disaster Recovery Template

  1. Who can declare a disaster? You probably do not want any rank-and-file employee to make this call for you. Large enterprises tend to have a Business Continuity Officer who makes the call; in most organizations, it should be the COO or CEO. You should also consider line of succession in case that individual is not reachable.
  2. How is the declaration of a disaster communicated to all available staff? Keep in mind that they may or may not be in the office, and various methods of communication (email, phone, carrier pigeon, et cetera) may be impacted by the disaster.
  3. What are your critical business processes? Typically, these are things that directly drive revenue generation.
  4. How long of an outage can the business withstand for each process? Say you manufacture widgets, you may have a contract to deliver widgets within two weeks, and it takes one day to produce and one day to deliver, so the maximum downtime you can withstand in the widget making process is 12 days. When one starts thinking about processes in a downtime context, you might realize that you need to develop a new widget every year, and that takes a full year; but you can stand to be up to three months late or your customers will start looking for other widget suppliers, so your maximum recovery time for your widget development process is three months.
    Activities that do not sound so critical like Accounts Payable and Recruiting become much more critical when looking at them from the standpoint of extended downtime.
  5. What resources does each process need? A resource could be a person, a machine, a computer, a raw material, data, documents, facilities, etc. Essentially, what are all the inputs to the process, and what does your business need to convert those inputs into the desired outputs? Hopefully, these resources can be described descriptively, such as “a staff member trained in the operation of the Fujinomic 5000”, as opposed to “Mary Kopp” because it might be much harder to find another Mary Kopp if need be.
  6. How will you reestablish all the resources that you need for each process should one, some, or all those resources become unavailable? Consider all the important combinations of missing resources; for example, the need to obtain raw materials from a different source may not care about what staff you have working on the process, but it will matter if you have to move your facility to the other side of the continent. You may need multiple options for fulfilling a resource, for instance if you need to staff an alternative facility, your first choice may be to put your staff on a plane and fly them there, but if the planes aren’t flying, you might need an alternate staffing plan. How long can you use any of your alternate resources? A valid alternate staffing strategy may be to ask staff to work 60-hour weeks, but this is not sustainable from either a burnout or overtime pay perspective.
  7. Do you have all the sources you may need to tap signed up and on board? If your continuity event is affecting a broad area, then there will be contention for those resources. You want to have a contract in place (likely some sort of contingent use contract with a retainer) so that you go to the front of the line. This includes staffing firms, commercial real estate firms, alternate material suppliers, alternate data suppliers, and alternate service providers. Consider if you want to make those sources backups, or start leveraging them as a source today and have built-in diversity? Also, perform any necessary due diligence on these suppliers, including Vendor Risk Management, and single points of failure (for instance the recent case of vinyl record producers depending on a single source for the lacquer used in masters).
  8. For each alternate resource of each business process, create subprocesses for transitioning to that resource. Make sure your resources have the resources they need: if you are putting your people on a plane, how will they eat, sleep, and keep clean clothes at the destination?
  9. For each of these transitioning subprocesses, calculate the lower and upper bounds for how long that process will take.
  10. When the sub total of upper bounds for transitioning subprocesses exceeds the maximum downtime, establish intermediate processes – this typically includes manual processes, such as taking orders using the phone and writing the order down on paper – until the backup resources are all in place.
  11. Establish the conditions under which you will restore business using your regular resources. What will you do if your regular resources will not be available again?
  12. Establish the reverse procedures – how will you resume normal operations with your normal resources after operating off your alternate resources?

Now in a large organization, you would have an entirely separate Disaster Recovery Plan that would focus on how all your IT systems should be doing the above. A good metric as to whether your Disaster Recovery Plan merits a separate document is whether you have a CIO. Regardless of whether it is captured in the same document or a separate document, the Disaster Recovery Plan should look at each of the IT assets (hardware and software) noted in the various resources and detail how each one will meet the necessary recovery times detailed in the transitioning subprocesses. There are four principal means for transitioning IT systems to alternate resources:

  • Hot Backups: Hot backups are where the alternate system can take over with no downtime. Two different ways of doing this are active-active and active-passive systems. Active-active systems are where both the primary and secondary systems process data on a day-to-day basis and either can take over all processing when the other fails. Active-passive systems are where one system processes data on a day-to-day basis; but the other can take over with no downtime if the primary fails.
  • Warm Backups: Warm backups are where the alternate system is all installed and ready to go, but someone must switch processing over to it in the event of a failure of the primary system. This is much easier to build and can provide service with a maximum downtime on the order of minutes to hours (depending on how many staff can perform the switchover and how quickly they can do it).
  • Cold Backups: Cold backups are where the alternate system is installed and must be powered up before processing is switched over to it. This process typically results in hours of downtime but can extend to days if the staff is not sufficiently prepared and encounters problems.
  • New builds: In this case, hardware and sometimes software must be procured and the entire system build including the operating system, applications, and services, and then have all processing transitioned onto it. This typically takes on the order of days and can extend to weeks especially if the procurements take a long time or staff is not sufficiently prepared.

It is important to note that High Availability (HA) is not a guarantee of disaster preparedness. In particular, HA systems are often collocated (distributed HA systems are notoriously difficult to architect), so most events that render the primary unavailable (such as data center flooding, power outage, fire, et cetera) will render the second unavailable as well. Therefore DRPs frequently utilize warm and cold backups, and new builds. The transitioning subprocesses for technical resources also tend to be far more detailed and technical than most business users are accustomed to.

Considerations of a Pandemic Scale

Most disasters will simply make a geographic swath of resources unavailable. Pandemics are interesting because the primary mitigation is ensuring that our staffing resources are not in close proximity to each other, and some of those staffing resources may become unavailable as they or their loved ones succumb to illness. Simply telling everyone to work from home and connecting them with VPN and teleconferencing software sounds like an easy solution, but one must consider how your critical business processes are structured to understand how well that will work. Additionally, you need to consider the infrastructure and security ramifications of such widespread access.

Final Thoughts

It is always better to be prepared, but even when you are not, it is better to have a plan than rush in headfirst and hope for the best. Since business continuity is something you hopefully do not have to deal with every day, we here at BuboWerks can assist in several ways:

  • Preparing your Business Continuity and Disaster Recovery Plans
  • Performing Vendor Risk Assessments of the vendors you are utilizing for business continuity and disaster recovery – for example, every remote access vendor uses language like “military grade encryption”, but there are plenty of ciphers used by the military ten years ago that are considered weak today – we understand the strength of the underlying ciphers to make informed assessments of vendor claims
  • Designing security controls for your remote access systems; for example, you may wish to consider the use of remote desktop software over VPNs, limiting file transfers, limiting access to only corporate managed machines, and cloud access security brokers
  • Implementing those security controls, usually working in collaboration with your IT staff to ensure that your controls are installed and properly configured
  • Testing your plan – you do not want to find out that part of your business continuity plan doesn’t work as designed in an actual disaster; if your plan is properly structured, there should not be a significant business impact by instituting your plan

If we can help with your immediate needs, just reach out and one of us will get back with you shortly.

Leave a Reply

Your email address will not be published. Required fields are marked *