Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - Martin Kleppmann

Refs:

CHAPTER 1 Reliable, Scalable, and Maintainable Applications

Applications today:

data-intensive (высоконагруженное данными)
compute-intensive (высоконагруженное вычислениями)

Data-Intensive Application (DIA/Высоконагруженное данными приложение) is typically built from standard building blocks that provide commonly needed functionality:

Store data so that they, or another application, can find it again later (databases/базы данных)
Remember the result of an expensive operation, to speed up reads (caches/кэши)
Allow users to search data by keyword or filter it in various ways (search indexes/поисковые индексы)
Send a message to another process, to be handled asynchronously (stream processing/потоковая обработка)
Periodically crunch a large amount of accumulated data (batch processing/пакетная обработка)

Thinking About Data Systems

Three concerns that are important in most software systems:

Reliability (Надежность)
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).
Scalability (Масштабируемость)
As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
Maintainability (Удобство сопровождения)
Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

Reliability

If all those things together mean working correctly then we can understand reliability as meaning, roughly, continuing to work correctly, even when things go wrong.

The things that can go wrong are called faults (сбоями), and systems that anticipate faults and can cope with them are called fault-tolerant or resilient (устойчивая к сбоям).

Note that a fault (сбой) is not the same as a failure (отказ). A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.

Hardware Faults

Hard disks are reported as having a mean time to failure (MTTF/время наработки на отказ) of about 10 to 50 years.

Software Errors

Human Errors

For example, one study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages.

Scalability

Even if a system is working reliably today, that doesn't mean it will necessarily work reliably in the future. One common reason for degradation is increased load: perhaps the system has grown from 10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is processing much larger volumes of data than it did before.

Scalability (Масштабируемость) is the term we use to describe a system’s ability to cope with increased load.

Describing Load

First, we need to succinctly describe the current load on the system; only then can we discuss growth questions (what happens if our load doubles?). Load can be described with a few numbers which we call load parameters (параметрами нагрузки).

Describing Performance

In a batch processing system such as Hadoop, we usually care about throughput (пропускная способность) — the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size. In online systems, what’s usually more important is the service’s response time (время ответа) — that is, the time between a client sending a request and receiving a response.

Latency (время ожидания) and response time (время отклика) are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service.

Even if you only make the same request over and over again, you’ll get a slightly different response time on every try. In practice, in a system handling a variety of requests, the response time can vary a lot. We therefore need to think of response time not as a single number, but as a distribution (распределение) of values that you can measure.

Most requests are reasonably fast, but there are occasional outliers (аномальные запросы) that take much longer.

It’s common to see the average (среднее) response time of a service reported. (Strictly speaking, the term “average” doesn't refer to any particular formula, but in practice it is usually understood as the arithmetic mean (арифметическое среднее): given n values, add up all the values, and divide by n.)

Usually it is better to use percentiles (процентили). If you take your list of response times and sort it from fastest to slowest, then the median (медиана) is the halfway point.

The median is also known as the 50th percentile (50-м процентилем), and sometimes abbreviated as p50.

In order to figure out how bad your outliers are, you can look at higher percentiles: the 95th, 99th, and 99.9th percentiles are common (abbreviated p95, p99, and p999). They are the response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.

High percentiles of response times, also known as tail latencies («хвостовым» времен ожидания), are important because they directly affect users’ experience of the service.

For example, percentiles are often used in service level objectives (SLOs, требованиях к уровню предоставления сервиса) and service level agreements (SLAs, соглашениях об уровне предоставления сервиса), contracts that define the expected performance and availability of a service.

Queueing delays often account for a large part of the response time at high percentiles. As a server can only process a small number of things in parallel (limited, for example, by its number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests—an effect sometimes known as head-of-line blocking (блокировкой головы очереди).

Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple backend calls, and so a higher proportion of end-user requests end up being slow (an effect known as tail latency amplification (усиление “хвостового” времени ожидания))

Approaches for Coping with Load

People often talk of a dichotomy between scaling up (vertical scaling/вертикальным масштабированием, moving to a more powerful machine) and scaling out (horizontal scaling/горизонтальным масштабированием, distributing the load across multiple smaller machines). Distributing load across multiple machines is also known as a shared-nothing architecture (архитектуры, не предусматривающей разделения ресурсов).

Some systems are elastic (адаптироваться), meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system).

The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce/волшебным масштабирующим соусом).

Maintainability

Yet, unfortunately, many people working on software systems dislike maintenance of so-called legacy (унаследованных) systems — perhaps it involves fixing other people’s mistakes, or working with platforms that are now outdated, or systems that were forced to do things they were never intended for.

However, we can and should design software in such a way that it will hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves. To this end, we will pay particular attention to three design principles for software systems:

Operability (Удобство эксплуатации)
Make it easy for operations teams to keep the system running smoothly.
Simplicity (Простота)
Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.)
Evolvability (Возможность развития)
Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility (расширяемость), modifiability (модифицируемость), or plasticity (пластичность).

Operability: Making Life Easy for Operations

Operations teams are vital to keeping a software system running smoothly. A good operations team typically is responsible for the following, and more:

Monitoring the health of the system and quickly restoring service if it goes into a bad state
Tracking down the cause of problems, such as system failures or degraded performance
Keeping software and platforms up to date, including security patches
Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage
Anticipating future problems and solving them before they occur (e.g., capacity planning)
Establishing good practices and tools for deployment, configuration management, and more
Performing complex maintenance tasks, such as moving an application from one platform to another
Maintaining the security of the system as configuration changes are made
Defining processes that make operations predictable and help keep the production environment stable
Preserving the organization’s knowledge about the system, even as individual people come and go

Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities. Data systems can do various things to make routine tasks easy, including:

Providing visibility into the runtime behavior and internals of the system, with good monitoring
Providing good support for automation and integration with standard tools
Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted)
Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
Providing good default behavior, but also giving administrators the freedom to override defaults when needed
Self-healing where appropriate, but also giving administrators manual control over the system state when needed
Exhibiting predictable behavior, minimizing surprises

Simplicity: Managing Complexity

Small software projects can have delightfully simple and expressive code, but as projects get larger, they often become very complex and difficult to understand. This complexity slows down everyone who needs to work on the system, further increasing the cost of maintenance. A software project mired in complexity is sometimes described as a big ball of mud (большой ком грязи).

Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental (побочной) complexity. Moseley and Marks define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.

One of the best tools we have for removing accidental complexity is abstraction (абстракция). A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand facade.

Evolvability: Making Change Easy

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones. But since this is such an important idea, we will use a different word to refer to agility on a data system level: evolvability (возможность развития).

Summary

An application has to meet various requirements in order to be useful. There are functional requirements/функциональные требования (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways), and some nonfunctional requirements/нефункциональные требования (general properties like security, reliability, compliance, scalability, compatibility, and maintainability)

Reliability (Надежность) means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically systematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user.

Scalability (Масштабируемость) means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively. We briefly looked at Twitter’s home timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable system, you can add processing capacity in order to remain reliable under high load.

Maintainability (Удобство сопровождения) has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. Good abstractions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health, and having effective ways of managing it.