What is fault tolerance?

Fault tolerance is the characteristic that enables a system to continue to function and operate sufficiently in the presence of component failures. When a component of a system fails, fault-tolerant design enables a system to continue to operate as intended – though potentially at a reduced level – rather than fail completely*. A fault may present itself technically – a node stops working, there is a hardware failure, a software bug – but can also be the result of a malicious actor attacking a system.

While fault tolerance can be a property of individual machines, it can also characterize the way in which machines interact in a network. In distributed computing systems, for example, attacks and software errors are becoming more common and can cause faulty behavior of nodes, resulting in the need for fault-tolerant design, especially for financial institutions that operate many sensitive production systems.

This fault-tolerant design allows a system to maintain functionality, though sometimes at a reduced quality**. In a fault-tolerant distributed computing system, for example, a system-component failure could result in a reduced throughput or longer response time – the entire system, however, is not stopped after a partial failure.

Recovery from failures in fault-tolerant systems can be classified as either “roll-forward” or “roll-back” depending on the type of error in a system.  Roll-forward recovery involves correcting the system state so that it may advance, while roll-back requires reverting the system to an earlier, correct version so that it can then continue to advance.

There are specific types of fault tolerance that are aimed at unique failure scenarios. For example, Practical Byzantine Fault Tolerance targets the Two Generals’ Problem, which applies to any type of two party communication where failures of communication are possible.

*For this reason, fault tolerance is particularly important in systems that are life-critical or high-availability, as the ability to maintain functionality in these systems is imperative.

**The ability of a system to maintain functionality when components of a system break down is referred to as “graceful degradation”.