From Surf Wiki (app.surf) — the open knowledge base

IT disaster recovery

Maintaining or reestablishing vital information technology infrastructure

Summary

Maintaining or reestablishing vital information technology infrastructure

a sub-practice of business continuity planning (BCP)

IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, and procedures with a focus on IT systems supporting critical business functions. This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity (BC). DR assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.

IT service continuity

IT service continuity (ITSC) is a subset of BCP,{{cite web |access-date=2018-11-30 |archive-date=2018-11-30 |archive-url=https://web.archive.org/web/20181130084451/https://www.drj.com/images/journal/fall-2017-volume30-issue3/2017_ITServiceDir.pdf |url-status=dead

Principles of backup sites

Main article: Backup site

Planning includes arranging for backup sites, whether they are "hot" (operating prior to a disaster), "warm" (ready to begin operating), or "cold" (requires substantial work to begin operating), and standby sites with hardware as needed for continuity.

In 2008, the British Standards Institution launched a specific standard supporting Business Continuity Standard BS 25999, titled BS25777, specifically to align computer continuity with business continuity. This was withdrawn following the publication in March 2011 of ISO/IEC 27301, "Security techniques — Guidelines for information and communication technology readiness for business continuity."

ITIL has defined some of these terms.

Recovery Time Objective

The Recovery Time Objective (RTO){{cite magazine

According to business continuity planning methodology, the RTO is established during the business impact analysis (BIA) by the owner(s) of the process, including identifying time frames for alternate or manual workarounds.

RTO is a complement of RPO. The limits of acceptable or "tolerable" ITSC performance are measured by RTO and RPO in terms of time lost from normal business process functioning and data lost or not backed up during that period.{{Cite web

Recovery Time Actual

Recovery Time Actual (RTA) is the critical metric for business continuity and disaster recovery.

The business continuity group conducts timed rehearsals (or actuals), during which RTA gets determined and refined as needed.

Recovery Point Objective

A Recovery Point Objective (RPO) is the maximum acceptable interval during which transactional data is lost from an IT service.

For example, if RPO is measured in minutes, then in practice, off-site mirrored backups must be continuously maintained as a daily off-site backup will not suffice.{{cite web |archive-url=https://web.archive.org/web/20160303224604/http://www.virtualdcs.co.uk/blog/business-continuity-planning-rpo-and-rto.html|archive-date=2016-03-03}}

Relationship to RTO

A recovery that is not instantaneous restores transactional data over some interval without incurring significant risks or losses.

RPO measures the maximum time in which recent data might have been permanently lost and not a direct measure of loss quantity. For instance, if the BC plan is to restore up to the last available backup, then the RPO is the interval between such backups.

RPO is not determined by the existing backup regime. Instead BIA determines RPO for each service. When off-site data is required, the period during which data might be lost may start when backups are prepared, not when the backups are secured off-site.

Mean times

The recovery metrics can be converted to/used alongside failure metrics. Common measurements include mean time between failures (MTBF), mean time to first failure (MTFF), mean time to repair (MTTR), and mean down time (MDT).

Data synchronization points

A data synchronization point{{cite web |date=May 14, 2013

System design

RTO and the RPO must be balanced, taking business risk into account, along with other system design criteria.{{cite book|chapter=Setting the Maximum Tolerable Downtime -- setting recovery objectives|pages=19–22 |chapter-url=https://books.google.com/books?id=YC49DXW-_60C&pg=PA20|date=2011-03-03

RPO is tied to the times backups are secured offsite. Sending synchronous copies to an offsite mirror allows for most unforeseen events. The use of physical transportation for tapes (or other transportable media) is common. Recovery can be activated at a predetermined site. Shared offsite space and hardware complete the package.

For high volumes of high-value transaction data, hardware can be split across multiple sites.

History

Planning for disaster recovery and information technology (IT) developed in the mid to late 1970s as computer center managers began to recognize the dependence of their organizations on their computer systems.

At that time, most systems were batch-oriented mainframes. An offsite mainframe could be loaded from backup tapes pending recovery of the primary site; downtime was relatively less critical.

The disaster recovery industry{{cite news |newspaper=The New York Times

During the 1980s and 90s, computing grew exponentially, including internal corporate timesharing, online data entry and real-time processing. Availability of IT systems became more important.

Regulatory agencies became involved; availability objectives of 2, 3, 4 or 5 nines (99.999%) were often mandated, and high-availability solutions for hot-site facilities were sought.

IT service continuity became essential as part of Business Continuity Management (BCM) and Information Security Management (ICM) as specified in ISO/IEC 27001 and ISO 22301 respectively.

The rise of cloud computing since 2010 created new opportunities for system resiliency. Service providers absorbed the responsibility for maintaining high service levels, including availability and reliability. They offered highly resilient network designs. Recovery as a Service (RaaS) is widely available and promoted by the Cloud Security Alliance.

Classification

Disasters can be the result of three broad categories of threats and hazards.

Natural hazards include acts of nature such as floods, hurricanes, tornadoes, earthquakes, and epidemics.
Technological hazards include accidents or the failures of systems and structures such as pipeline explosions, transportation accidents, utility disruptions, dam failures, and accidental hazardous material releases.
Human-caused threats that include intentional acts such as active assailant attacks, chemical or biological attacks, cyber attacks against data or infrastructure, sabotage, and war.

Preparedness measures for all categories and types of disasters fall into the five mission areas of prevention, protection, mitigation, response, and recovery.

Planning

Research supports the idea that implementing a more holistic pre-disaster planning approach is more cost-effective. Every $1 spent on hazard mitigation (such as a disaster recovery plan) saves society $4 in response and recovery costs.{{cite web |access-date=October 29, 2018

2015 disaster recovery statistics suggest that downtime lasting for one hour can cost

small companies $8,000,
mid-size organizations $74,000, and
large enterprises $700,000 or more.

As IT systems have become increasingly critical to the smooth operation of a company, and arguably the economy as a whole, the importance of ensuring the continued operation of those systems, and their rapid recovery, has increased.

Control measures

Control measures are steps or mechanisms that can reduce or eliminate threats. The choice of mechanisms is reflected in a disaster recovery plan (DRP).

Control measures can be classified as controls aimed at preventing an event from occurring, controls aimed at detecting or discovering unwanted events, and controls aimed at correcting or restoring the system after a disaster or an event.

These controls are documented and exercised regularly using so-called "DR tests".

Strategies

The disaster recovery strategy derives from the business continuity plan. Metrics for business processes are then mapped to systems and infrastructure. A cost-benefit analysis highlights which disaster recovery measures are appropriate. Different strategies make sense based on the cost of downtime compared to the cost of implementing a particular strategy.

Common strategies include:

backups to tape and sent off-site
backups to disk on-site (copied to off-site disk) or off-site
replication off-site, such that once the systems are restored or synchronized, possibly via storage area network technology
private cloud solutions that replicate metadata (VMs, templates and disks) into the private cloud. Metadata are configured as an XML representation called Open Virtualization Format, and can be easily restored
hybrid cloud solutions that replicate both on-site and to off-site data centers. This provides instant fail-over to on-site hardware or to cloud data centers.
high availability systems which keep both the data and system replicated off-site, enabling continuous access to systems and data, even after a disaster (often associated with cloud storage).

Precautionary strategies may include:

local mirrors of systems and/or data and use of disk protection technology such as RAID
surge protectors — to minimize the effect of power surges on delicate electronic equipment
use of an uninterruptible power supply (UPS) and/or backup generator to keep systems going in the event of a power failure
fire prevention/mitigation systems such as alarms and fire extinguishers
anti-virus software and other security measures.

Disaster recovery as a service

Main article: Recovery as a service

Disaster recovery as a service (DRaaS) is an arrangement with a third party vendor to perform some or all DR functions for scenarios such as power outages, equipment failures, cyber attacks, and natural disasters.

Disaster recovery for cloud systems

Following best practices can enhance disaster recovery strategy for cloud-hosted systems:

Flexibility: The disaster recovery strategy should be adaptable to support both partial failures (such as recovering specific files) and full environment failures.
Regular testing: Regular testing of the disaster recovery plan can verify its effectiveness and identify any weaknesses or gaps.
Clear roles and permissions: It should be clearly defined who is authorized to execute the disaster recovery plan, with separate access and permissions for these individuals. Implementing a clear separation of permissions between those who can execute the recovery and those who have access to backup data helps minimize the risk of unauthorized actions.
Documentation: The plan should be well-documented and easy-to-follow to ensure that operators can effectively follow it during stressful situations.

References

"'Systems and Operations Continuity: Disaster Recovery". Georgetown University - University Information Services.
"Disaster Recovery and Business Continuity". [[IBM]].
"What is Business Continuity Management?". Disaster Recovery Institute International.
(2012-05-03). "ISO 22301 to be published Mid May - BS 25999-2 to be withdrawn".
"Browse the Resource Hub for all the latest content | Axelos".
(1989). "Information Security for Managers". Springer.
[https://cloudsecurityalliance.org/download/secaas-category-9-bcdr-implementation-guidance/ ''SecaaS Category 9 // BCDR Implementation Guidance''] CSA, retrieved 14 July 2014.
(May 2018). "Threat and Hazard Identification and Risk Assessment (THIRA) and Stakeholder Preparedness Review (SPR): Guide Comprehensive Preparedness Guide (CPG) 201, 3rd Edition". US Department of Homeland Security.
"The Importance of Disaster Recovery".
(25 October 2012). "IT Disaster Recovery Plan". FEMA.
(2021-08-16). "Use of the Professional Practices framework to develop, implement, maintain a business continuity program can reduce the likelihood of significant gaps".
Gregory, Peter. CISA Certified Information Systems Auditor All-in-One Exam Guide, 2009. {{ISBN. 978-0-07-148755-9. Page 480.
Brandon, John. (23 June 2011). "How to Use the Cloud as a Disaster Recovery Strategy".
"What Is Disaster Recovery as a Service (DRaaS)? | Definition from TechTarget".
(11 October 2024). "Engineering Resilient Systems on AWS". O'Reilly Media.
(April 2009). "Cloud Application Architectures Building Applications and Infrastructure in the Cloud". O'Reilly Media.
(23 March 2016). "Site Reliability Engineering How Google Runs Production Systems". O'Reilly Media.

Wikipedia Source

This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.

disaster-recovery backup business-continuity data-management it-risk-management

Want to explore this topic further?

Ask Mako anything about IT disaster recovery — get instant answers, deeper analysis, and related topics.

Research with Mako

Free with your Surf account

Content sourced from Wikipedia, available under CC BY-SA 4.0.

This content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.

Report