Back to Glossary Home | Failover
Failover
What is Failover?
Failover is a Disaster Recovery (DR) capability that involves automatically and seamlessly shifting your applications and workloads from their normal production environment to a redundant secondary or “back-up” environment in the event of a service outage.
Modern enterprises host critical services, customer-facing applications, and databases on both on-prem servers and public cloud infrastructure. Unplanned interruptions to the availability of these resources can result in revenue loss, operational bottlenecks, and poor customer experience. A 2023 study estimated the average cost of unplanned downtime across industries at $9,000/minute.
Implementing failover processes allows enterprises to recover from cloud service interruptions or unexpected system failures and restore operations before revenue is lost or customers are negatively impacted.
The ultimate goal of implementing failover systems is to ensure high availability and reliability of critical applications and create a fault-tolerant system that can recover from unexpected crashes, outages, or failures with no interruptions to service.
Failover vs. Failback - What’s the Difference?
Both Failover and Failback processes can play an important role in your disaster recovery plan, so it’s important to understand how they’re different.
As we mentioned above, Failover is the process of switching your workloads to a redundant production environment at a back-up recovery facility when your primary environment fails.
In contrast, Failback is the process of switching your workloads back to the original production environment from your back-up environment after normal operations have been restored.
When a power outage or DDoS attack takes down your primary application server, the failover process helps you quickly get your application back online using infrastructure at your back-up recovery site. After you block the DDoS attack or restore power at your primary site, the failback process helps you restore your application back to the primary server and resume normal operations.
How Does Failover Work?
Most of the time, a failover operates automatically; however, it can be done manually. The biggest downside to this is that any systems that count on human intervention to work will be inherently less reliable.
Failover Requires a Redundant Back-up Server
The first step to implementing a failover system is to establish a redundant back-up or secondary server that can take over the functionality of the active primary server in case of an outage.
Failover servers can be hosted on-premise or in the cloud. On-premises failover is typically managed in-house, with the back-up server located in the same data center as the active server. In contrast, a back-up server for cloud-based failover might be hosted on the same cloud as the primary server, on a different public cloud, or at a DR site managed by a 3rd party.
Failover Requires a Mechanism to Detect System Failure
For the failover process to be fully automated, there must be an automated way of detecting system failures. This is typically achieved by digitally monitoring the primary server and automatically communicating about its operational status and health to the back-up server.
Failover is Triggered Automatically
Automated monitoring systems detect deviations from normal operating conditions or the occurrence of predefined failure events, such as hardware failures, network issues, software crashes, etc. When a failure event is identified, failover proceeds based on the chosen failover configuration and processes.
Two Failover Configurations You Should Know
There are two main configurations that enterprises can choose from when it comes to setting up their failover capabilities: Active-Active, and Active-Passive.
Active-Active
An active-active failover configuration is one where the primary server and the secondary back-up server are actively and synchronously supporting the application workload in concert with a load balancer that directs server requests between them according to a load balancing algorithm.
If either of the active servers goes offline, the outage will be detected and the failover system will direct the load balancer to route traffic away from the disrupted node and to the server that remains operational.
Active-Passive
An active-passive configuration is one where the primary server is actively supporting the application workload while the secondary back-up server is held in “standby” mode - not active, but ready and available to take over processing in case the active server becomes unavailable.
If the active server goes offline, the outage will be detected and the failover system will start routing network traffic to the standby server while the original active server is recovered to its normal operating state. From here, one of two things can happen:
- The standby server becomes the active server and the recovered server (previously the active server) becomes the standby server until the next failover.
- A failback process can be implemented to restore application workloads to the recovered server and restore the back-up server into standby mode until the next failover.
Synchronizing Data and Rerouting Network Traffic
When a failover is triggered, the back-up server comes online and takes over the functionality of the failed primary server. Making this process seamless involves automatically synchronizing data between the primary and secondary systems, as well as rerouting network traffic away from the failed primary server and to the back-up server.
Validation and Testing
Once the back-up server is operational, automated validation and testing protocols are executed to ensure that it is properly configured and resourced to handle the necessary workloads and functions.
Monitoring, Recovery, and Failback
With the back-up server now handling processing and workloads, system administrators and disaster recovery teams can monitor the health of the failed primary server and work to restore its functionality. Once the primary server has been recovered, it can be placed on standby as a new back-up server or returned to service via failback processes.
Why is Failover Important?
With a proper failover, businesses can keep their downtime and data loss to a tolerable minimum (and align to their Recovery Time Objective (RTO) and Recovery Point Objective (RPO). For every hour your organization experiences downtime, business continuity is at risk; you lose revenue, productivity, and brand trust that you may not be able to recover. The same goes for data loss. If systems go down at an inopportune moment, the consequences of losing that data could be irreversible.
Enabling Disaster Recovery and Business Continuity
Implementing failover processes allows enterprises to quickly and efficiently recover from both minor service interruptions (e.g. application overload, software failure, or database errors, or network outages ) and major disasters (e.g. security breaches, power outages, or environmental disasters) to ensure business continuity.
Avoiding Operational Downtime and Revenue Loss
Implementing automatic failover processes allows enterprises to minimize or eliminate the operational downtime and revenue loss that would normally result from an application crash or service outage.
Safeguarding Customer Experiences and Brand Value
Application, network, or server outages that result in unplanned operational downtime can negatively impact the customer experience, push your customers to search for more reliable alternatives, and damage the perception of your brand in the marketplace.
Implementing automatic failover systems to ensure high availability and reliability for customer-facing applications helps you safeguard the customer experience and preserve the value of your brand.
What is Failover Testing?
Failover Testing is a type of Disaster Recovery Testing that assesses how well a system can transition from its normal operational environment to the back-up server in the event of a service failure or disruption.
The goal of failover testing is to verify that critical services and application workloads can remain available and functional with minimal downtime in case of a service failure. Failover testing can include:
- Simulating failures to assess how the automated failover system responds,
- Validating back-up systems to verify that they can take over processing and operations when needed,
- Measuring how long it takes to recover operations to the back-up system in case of an outage,
- Verifying that data synchronization is executed completely and accurately during failover,
- Ensuring that back-up systems are sufficiently scalable when failover occurs during periods of high demand for the application or service, and/or
- Ensuring that network configuration changes designed to route traffic to the back-up server work as intended.
Ensure Failover with Disaster Recovery Services
TierPoint offers managed cloud-to-cloud Disaster Recovery-as-a-Service (DRaaS), enabling rapid failover to hyperscale public cloud infrastructure (e.g. AWS or Azure), or to a TierPoint private or multitenant cloud environment.
Ready to learn more?
Book an intro call with TierPoint and discover how we can help you minimize the business impact of service outages and ensure resiliency for critical data, applications and infrastructure in the cloud.