Reliability vs. other ilities

These words are used interchangeably when talking about cloud services, and they shouldn’t be. So here’s a little glossary:

  • Availability: the system uptime percentage, where “uptime” means the service is able to respond to requests. Common examples are 99.9% (“three nines”) and 99.99% (“four nines”).
  • Reliability: the probability that the system will work like it’s supposed to (as opposed to availability which just means it’s “up” but not necessarily working as expected). Sometimes expressed as mean time between failure (MTBF) or failure rate, such as “1 failure every 79.3 hours”.
  • Durability: the ongoing existence of something, usually stored data. Durability doesn’t necessarily mean the thing can be accessed, only that it still exists. Regular backups and redundancy are the go-tos for improving here.
  • Serviceability: the speed and ease of manually repairing a system when something goes wrong. If serviceability is low, then it takes longer to fix a failure or outage, which means availability and reliability are worse.
  • Resiliency [the only non-ility in the list]: the ability to self-heal after something goes wrong, such as auto-restarts, active failover, or automatically removing broken things (nodes, hardware, DB replicas, whatever) from the system when a failure is detected.

Thanks for reading! Subscribe via email or RSS, follow me on Twitter, or discuss this post on Reddit!

search previous next tag category expand menu location phone mail time cart zoom edit close