High-Availability, Redundancy and Fail-Over

3 min read

It’s never a matter of if a system goes down. It’s always a matter of when. (SysAdmin wisdom)

High Availability

Computational environments designed to function at near full-time availability are known as “high availability” systems. High availability is a characteristic of a system that aims to ensure an agreed-upon level of operational performance, usually uptime, for a higher than normal period. These systems are typically comprised of redundant hardware and software that allow the system to remain available even during failures. Well-designed high availability systems avoid single points of failure since any given hardware or software component that can fail has an identical back-up component.

When failures occur, the “failover” process redistributes processing that had been performed by the failed component to its secondary back-up component. This remasters systemwide resources, recovers partial or failed transactions, and restores the system to its optimal state, preferably within microseconds of the onset of the failure. The more transparent failover is to users, the higher the system availability.

Learn more about high availability implementations in:

  • Class 5 Softswitch MOR here
  • Class 4 Softswitch M4 SBC here.

Measuring Availability

Availability is generally presented as a percentage indicating how much uptime is expected from a particular system or component over a given period.  A value of 100% would indicate that the system never fails. A system that guarantees 99% availability in a one-year period can have up to 3.65 days of downtime (1%).  These values are calculated based on multiple factors, including both scheduled and unscheduled maintenance periods, as well as the recovery time from a possible system failure. The most common availability in an SLA (Service Level Agreement) is 99.999% (“five nines”)—5.26 minutes of downtime per year.

What is server redundancy?

Server redundancy refers to the presence of redundant backup servers that stand ready to take over if the primary server fails. Additional servers may also be installed on runtime for backup, load balancing, or to temporarily halt a primary server (i.e. for maintenance).

Server redundancy is implemented in an IT infrastructure where service availability is of critical importance (such as in telecommunications). To enable server redundancy, a replica server is designed with identical computing power, storage, applications, and other operational parameters.

Why is server redundancy important?

Issues arising from hardware failure, network problems, or application faults could cause your primary servers to stop performing correctly, leaving users unable to access services and posing a real threat to productivity.  Server redundancy also helps businesses by protecting crucial data, ensuring that data is backed up in multiple places. This allows the business to recover its data if the live server fails.

Downtime is the period of time when your system (or network) is not available for use or is unresponsive. Downtime can lead to significant losses for a company since all its services are put on hold when its systems are down.

What else should be redundant?

In addition to a redundant server, your infrastructure should be designed such that all major components are duplicated in case of emergencies, allowing maximum uptime.

Backups: Backups are deployed to ensure that data held locally is also stored elsewhere (on the cloud, or in another data bank in another physical location). This allows for quick restoration of data in the event of a disaster.

Disk drives: Hot spares should be available so that if a disk drive in a primary server fails, another drive can immediately run in its place. Using a RAID array should ensure that a server can keep running in the event of a single disk failure.

Power supplies: Redundant power supplies should be deployed on critical servers so that if the main power supply fails, the server still has a source of power and can continue to run.

Internet connectivity: If your server needs to have a connection to the Internet at all times, having a line from multiple service providers is important. If one line fails (e.g. if a workman severs a cable), traffic can shift over to an undamaged line.

Geo-redundancy

Geo-redundancies occur where redundant servers are located in another data center connected to a different network.

With local server redundancy, the server is resilient to the failure of the network or another server. Geo-redundancy protects the server against the failure of an entire data center or the network connected to the server.

Failover

Failover describes the action of instantly switching to a backup server or a network upon the failure of the primary server/network.  The primary purpose of failover is to eliminate, or at least reduce, the impact on users when a system failure occurs.

Switchover

Failover and switchover are essentially the same operations, except that failover is automatic and usually operates without warning, while switchover requires human intervention. Switchover often occurs when an administrator wants to apply hardware or software updates, bug fixes, or feature testing to either the main or backup system without terminating connectivity for the user.

Failback

Failback is the process of restoring a system, component, or service that was previously in a state of failure back to its original, working state, and having the standby system go from functioning back to standby.

Heartbeat

At the server level, failover automation often incorporates a heartbeat system. This system, essentially, connects two servers either physically through a cable or remotely over a wireless network. As long as the pulse between the two servers continues uninterrupted, the second server will not go online.

Is it worth the money?

The fact is that going for high availability architecture gives you higher performance, but comes at a high expense. You must decide if the decision is justified from a financial point of view. You must assess how damaging potential downtimes can be for your company and how important your services are in running your business, and decide whether the extra uptime is worth the investment.

Leave a Reply

Your email address will not be published. Required fields are marked *