Disaster Recovery and Business Continuity Management

A Business Continuity Plan (BCP) is the programme for maintaining normal operations as effectively as possible whilst resolving any issues. At times when the incident has had impact sufficient to disrupt normal operations beyond reasonable or rapid repair then the programme is superseded by Disaster Recovery (DR). A key difference between BCP and DR are the scales of plans invoked and in particular their relative scales of disruption and implementation cost. However, they both must be firmly based upon the RA and BIA.

It is not necessary to consider all possible events, but rather the consequences of the event. For example, there are several potential events that may lead to the unavailability of a building (eg. fire, flood, bomb, ...). In response to a specific actual event, those responsible will have to make the ‘invocation decision’ (which aspects, if any, of the DR to implement). Implementation must be supported by a significant awareness and training programme along with the necessary well-maintained available documentation.

Resilience and Redundancy

Risk avoidance in the context of BCP/DR can be achieved through the related concepts of resilience and redundancy.

Resilience

Resilience involves ensuring reduction in single-point failures in a system. This is usually called something like ‘replication’ in the microservices context and involves duplication of system units for continual use. That imposes a financial cost which should balance against the impact of the system failure as a whole.

Redundancy

Redundancy differs from resilience in that the entire system is duplicated and kept as an unused backup to the primary system. The standby system may be ‘cold’ (not configured and requiring time to bring up), ‘warm’ (partially configured and partially populated), ‘hot’ (fully configured and populated). Hot redundant systems in particular may be ‘highly available’ in that they can be made available almost instantaneously.

Data population of redundant systems can be achieved asynchronously or synchronously. Asynchronous population offers better performance since there is no requirement to wait for acknowledgement but can reduce assurance of data integrity in the standby.

Documentation, Maintenance, and Testing

Documentation is vital to successful implementation. Lack of access (to a building or other resource) is often overlooked and documentation for BCP/DR may need to be secured off-site. The off-site location considered should account for possible cordons by authorities in response to events. Business-critical assets may also need to be accessible off-site. Overall documentation:

  • must provide clear understanding and expectation of the response to an event

  • must ensure all involved share the same understanding

  • must be available to those that need it, when they need it, and in a convenient location and format

  • must be well-maintained

  • must be supported by regular testing

Routines for testing and checking details of plans must be comprehensive but also work within acceptable levels of disruption (which will be supported by relevant findings of a BIA).

A typical first stage of testing is a ‘desk check’: key stakeholders conduct a roleplay of the plan to identify major issues.

The next stage may be a limited walkthrough in manageable chunks which can further identify implementation issues (eg. what is a reasonable time-frame for evacuation of all staff). These limited walkthroughs can help identify further issues as well as offering staff training opportunities.

The next stage may be full-scale enactment of the plans. For the enactment to be effective it must be coordinated by a central control organization (the ‘incident management team’ - IMT). The IMT would also control the implementation of the BCP/DR. The central control facility may need to be established off-site at short notice (for example in the case of the primary site becoming unavailable).

A method of simulating possible emergency events is to employ the ‘brown envelope’ technique; descriptions of events are revealed to staff at a time chosen by an outside controller and staff must then respond to the described event. The control of this exercise must be from an outside and independent entity in order to stimulate a realistic response. It’s vital that the exercise properly assesses the appropriateness, effectiveness, and comprehensiveness of the response and planning. Guards should be in place to prevent ignoring the impacts of factors outside the control of the IMT (the weather, the response time of emergency services, etc).

Involvement of third-parties

In a DR invocation situation the involvement of suppliers is crucial and so the involvement of suppliers in the DR planning and education process is also crucial. Consequently, effective BCP/DR plans must consider third-party services and assets. Third-party contracts should detail provision in the event of emergency (for example, the telephone company providing assurance of timely connection at the central control facility). The organization should check that third-party plans are effective and consistent with their own.