July 18, 2017

What is fault tolerance?

Fault tolerance is the characteristic that enables a system to continue to function and operate sufficiently in the presence of component failures. When a component of a system fails, fault-tolerant design enables a system to continue to operate as intended – though potentially at a reduced level – rather than fail completely*. A fault may present itself technically – a node stops working, there is a hardware failure, a software bug – but can also be the result of a malicious actor attacking a system.

While fault tolerance can be a property of individual machines, it can also characterize the way in which machines interact in a network. In distributed computing systems, for example, attacks and software errors are becoming more common and can cause faulty behavior of nodes, resulting in the need for fault-tolerant design, especially for financial institutions that operate many sensitive production systems.

This fault-tolerant design allows a system to maintain functionality, though sometimes at a reduced quality**. In a fault-tolerant distributed computing system, for example, a system-component failure could result in a reduced throughput or longer response time – the entire system, however, is not stopped after a partial failure.

Recovery from failures in fault-tolerant systems can be classified as either “roll-forward” or “roll-back” depending on the type of error in a system. Roll-forward recovery involves correcting the system state so that it may advance, while roll-back requires reverting the system to an earlier, correct version so that it can then continue to advance.

There are specific types of fault tolerance that are aimed at unique failure scenarios. For example, Practical Byzantine Fault Tolerance targets the Two Generals’ Problem, which applies to any type of two party communication where failures of communication are possible.

*For this reason, fault tolerance is particularly important in systems that are life-critical or high-availability, as the ability to maintain functionality in these systems is imperative.

**The ability of a system to maintain functionality when components of a system break down is referred to as “graceful degradation”.

Sign Up and Stay Up on R3

Sign up for our flagship newsletter, the R3 Ledger, to receive the latest news, updates and content.

"*" indicates required fields

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Discover how applications built on Corda are solving real-world problems for institutions in regulated industries.

The leading distributed application platform that delivers digital trust between multiple parties.

Dedicated professional services and support teams to accelerate time to production.

See our story and latest news, what it’s like to work at R3, ways to partner with us, and how we give back.

What is fault tolerance?

Sign Up and Stay Up on R3

Sign up for our flagship newsletter, the R3 Ledger, to receive the latest news, updates and content.