11 West 42nd Street, 8th Floor New York, NY 10036
T: 1 646 630 7421
2 London Wall Place, London, EC2Y 5AU
T: 0207 113 1500
18 Robinson Road, Level 14-02, Singapore 048547
Av. Angélica, 2529 - Bela Vista, 6th Floor São Paulo - SP, 01227-200, Brazil
40-44 Bonham Strand, 7F Sheung Wan, Hong Kong
Fault tolerance is the characteristic that enables a system to continue to function and operate sufficiently in the presence of component failures. When a component of a system fails, fault-tolerant design enables a system to continue to operate as intended – though potentially at a reduced level – rather than fail completely*. A fault may present itself technically – a node stops working, there is a hardware failure, a software bug – but can also be the result of a malicious actor attacking a system.
While fault tolerance can be a property of individual machines, it can also characterize the way in which machines interact in a network. In distributed computing systems, for example, attacks and software errors are becoming more common and can cause faulty behavior of nodes, resulting in the need for fault-tolerant design, especially for financial institutions that operate many sensitive production systems.
This fault-tolerant design allows a system to maintain functionality, though sometimes at a reduced quality**. In a fault-tolerant distributed computing system, for example, a system-component failure could result in a reduced throughput or longer response time – the entire system, however, is not stopped after a partial failure.
Recovery from failures in fault-tolerant systems can be classified as either “roll-forward” or “roll-back” depending on the type of error in a system. Roll-forward recovery involves correcting the system state so that it may advance, while roll-back requires reverting the system to an earlier, correct version so that it can then continue to advance.
There are specific types of fault tolerance that are aimed at unique failure scenarios. For example, Practical Byzantine Fault Tolerance targets the Two Generals’ Problem, which applies to any type of two party communication where failures of communication are possible.
*For this reason, fault tolerance is particularly important in systems that are life-critical or high-availability, as the ability to maintain functionality in these systems is imperative.
**The ability of a system to maintain functionality when components of a system break down is referred to as “graceful degradation”.
Share this post
November 12, 2019
September 09, 2019
September 06, 2019
Blockchain for every business in every industryLearn More
Stay up to date on the latest news and articles related to R3.