Member-only story

Introduction to System Design: What is Reliability

Scott Cosentino
3 min readJun 19, 2023

--

Reliability means that a system continues to work correctly for an application, even when things go wrong. The things that could go wrong in an application are called faults. A system that can tolerate faults is called fault-tolerant or resilient. A system’s requirements for fault tolerance can vary based on how critical the system is. As you continue to learn system design, you will explore methods for creating fault-tolerant systems that meet the specifications of a product. To better understand fault-tolerant design, let’s focus on the type of faults that could occur in a system.

Hardware Faults

Every computer system must run on a set of hardware. If the hardware on a computer fails, the applications running on the computer can also start to fail. We will often introduce hardware redundancy to reduce the failure rate of hardware in a system. For example, administrators can use RAID (Redundant Array of Independent Disks) configuration for storage disks to allow the system to tolerate a hard drive failure.
Hardware failures can extend beyond a single component failing. For example, a power failure can cause computer hardware to become unavailable. To help avoid this type of issue, data centers often have secondary power sources like generators.

--

--

Scott Cosentino
Scott Cosentino

No responses yet