When it comes to SRE, short for Site Reliability Engineering, resources available online are limited to the books published by Google themselves. They do share some useful case studies that will help us understand what SRE is and the concepts given in it, but they do not clearly explain how to build your own SRE team for your organisation. The concept of SRE was cooked fresh within the walls of Google and later released to the general public as a practice for anyone to follow.
In this article I would like to give a brief introduction to SRE and why it is important to any Software Engineering organisation. This is based on my experience and learning from leading a Site Reliability Engineering team for one of my former clients.
In the early stages of any new software product, the user base is very low and the primary focus is on delivering features as quickly as you can, to reach a stable market in the industry. During this period, you might get a few tickets which can be handled by developers themselves and the same could happen to DevOps tasks as well. But as the System grows, the developers will have to focus primarily on Development, whereas you will have to start hiring Support Engineers and SysAdmins to take care of the operational tasks. But what will happen when it grows further, when the SysAdmins are no longer able to tackle it by themselves? You will have to hire more SysAdmins and Support Engineers to take care of the reliability of the system. As the system grows, the cost for operational tasks will also grow linearly. Where will it stop?
The day-to-day responsibilities of Software Engineers and Operations Engineers are increasing daily; growing organisations need to seek approaches to keep the system stable as much as possible. You need your Site to be more and more reliable when it grows, in terms of Scalability, Availability and other aspects. If you fail to meet the customers' expectations, your product will fail in the industry and will completely lose its traction. How can we tackle problems like this in the real-world whilst ensuring the operational costs stay intact with our budgets? The burning question that was asked a long time back was:
It is a universally acknowledged truth that systems do not run themselves. How, then, should a system—particularly a complex computing system that operates at a large scale—run?
This is the basic need for Site Reliability Engineering; where a specific set of engineers build their own set of practices to ensure that the Site is Reliable at any given point in time. In any growing system, you need a set of engineers who will look for new ways to improve the stability of production systems with proper monitoring and automation-first practices.
SRE is a framework introduced by Google in 2003 on how to operate large scale production systems in a reliable manner. This may sound like an operations function, but it is not. According to the founder of Google's SRE team, Ben Treynor;
SRE is what happens when you ask a Software Engineer to design an operations function for your system.
It's a very versatile approach which allows you to reliably run mission critical systems, no matter how big or small the system is.
SRE is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. A Site Reliability Engineer will have to spend up to 50% of their time, doing operational work like being on-call, manual workloads, documentations etc. Whereas the remaining 50% of the time, an SRE is expected to do actual development, such as new features, deployments, automation tasks etc. A system managed by SREs is meant to be self-healing and very proactive. SRE owns the entire production environment and has to ensure that the Site is reliable, no matter what gets released to production.
In my opinion, an ideal SRE is a software engineer, with a strong background on administering and operating production systems. From what I see, you can do Site Reliability Engineering, without having a Site Reliability Engineer, and you may already have engineers playing the role of SRE, without even having an SRE Team. Site reliability engineering is a cross-functional role, assuming responsibilities traditionally siloed off to development, operations, and other IT groups. They will seek to automate everything that comes in their way to make room for actual engineering work rather than manual labor.
The usual question about SRE is whether it is suitable for small organisations. This is highly debatable, but my belief is that it does. Even if it is a small organisation, there is always someone who will be taking care of the operations work from time to time. As I said earlier, you may already have SREs working under you, even without knowing it. This has grown as a practice for larger organisations, but it would be well suited for small organisations to take over the practices even without establishing an SRE team in the organisation.
The mindset of an SRE is different from that of a Software engineer or an Operations Engineer. SREs always think of ways to automate most of the operations work, rather than doing them manually. This mindset is something that needs to grow within, where you think of ways and tools to alert, monitor, do and automate most of the tasks of the moment, in order to make the system more reliable.
As a Software Engineer, you will gain an in depth knowledge on a single area. But as an SRE, you will grow the breadth of your knowledge on a vast area by learning about different technologies available in the industry.
The demand for Site Reliability Engineers has grown rapidly throughout the world, and as a result, the average salary of an SRE is higher than that of an SE. If you search for SRE positions on Glassdoor, you would find over 70k positions available worldwide. The demand is growing rapidly as organisations start to understand the value of SREs in keeping the Site Reliable. If you really want to be a Site Reliability Engineer, ask the following questions from yourself.
DevOps is a set of practices that combines software development (Dev) and IT operations (Ops) which aims to shorten the systems development life cycle and provide continuous delivery with high software quality. SRE and DevOps are strongly related to each other, because they all work for the same targets. But the way SRE sees the system is different from a traditional DevOps culture. There is a common saying in software terms, as follows.
SRE Implements DevOps
First let's understand the 5 key pillars of success of DevOps.
The role of a Site Reliability Engineer (SRE) is not properly defined anywhere. It's more of a culture and a set of norms built by organisations in tackling production related matters on their own. Hence, the role of an SRE differs from organisation to organisation. But there are a common set of practices that SREs follow, and it is not limited to the following.
Most of the systems we see today are highly distributed and it is very rarely that we see non-distributed, monolithic architectures. The role of an SRE is not limited to just monitoring the Distributed System, but monitoring almost everything, based on my understanding. Monitoring can/should be done on the production applications, deployment servers, underlying infrastructure, code quality, and even Mean Time to Deliver a system etc.
The system you are maintaining might have agreed to comply to industry standards like ISO 27001 security standards and ISO 9001 quality standards. In this case, there should be a way to monitor whether the system is in-line with these standards or not.
SLOs are a key aspect of any System, which explains the overall behaviour of a production system. I will explain more on this in a section below. But for now, just assume that this is about measuring the uptime of a System. Have you seen systems mentioning that it is available for 99 percent of the time? The more 9s it adds to this definition, the more rigid the system becomes. The following table will give you the idea.
Many production applications have licensed or paying customers and we need to provide a reliable system to them. If the system is not reliable enough for the customers, then they will raise the question as to why they would want to keep on paying for this software. Even highly reliable systems can go down unexpectedly. But based on the past metrics, every company defines its availability, whereas a breach in this number will cause the company to pay the customers back in terms of cash, credit, or discounts.
But this is something the customers cannot monitor. This can only be monitored by SLOs, and this is where SRE comes into play. Have a look at how Google Cloud Platform penalises its own services, if they fail to adhere to the SLAs. The below screenshot is from Google Compute Engine. (Compute Engine SLA).
You should try and reduce the level of manual tasks as an SRE. You will have to build a lot of automation scripts in order to make sure that you can just sit back and have a cup of coffee while your system is running smoothly. So, as an SRE, your first question should be to ask the following.
Can I automate this task as well?
SRE is not a support engineer position. The first call regarding complaints should come to the support engineers. Then, if it is a High Severity incident, the Support Team decides the wake up the SRE Team. In this case the SRE is responsible for analysing the incident and waking up others who are required to solve this crisis. I use the word crisis, because this process should not happen, unless it is defined to be an organisation wide incident. SRE will take care of the incident from top to bottom and after it is resolved, SRE will create a Postmortem document and a retrospective with the leads to ensure that this will not happen again. These postmortems should be blameless, and only be a part of a learning exercise.
SRE needs to ensure they build a good rapport with the development teams, and that they provide the necessary details to the management when needed. The management will depend on the metrics provided by the SRE to make many business decisions in the organisation.
Have you ever wondered how to measure the behaviour of a service? How could you actually measure whether a production application is running smoothly or not? We sometimes go with the gut feeling and determine if the users are happy, the conclusion is that the service is running smoothly. These applications could be internal APIs, or even Public Applications used by the general public. Nevertheless, the service should have proper metrics that we can investigate to measure the quality of the application. Some applications might behave as intended for some users, and some might not. This is where we need to define levels of service to the user, so that they understand what to expect in an application when using it. This does not indicate the actual features/requirements provided by the application but defines how the application behaves in a live production environment.
This is where we need to introduce proper metrics and keep monitoring them so that stakeholders are aware of the behaviour of the application over time. In Site Reliability Engineering, there are three main concepts where metrics need to be collected.
These measurements describe the basic properties the applications should have (SLO), what values we want these metrics to have or maintain (SLI), and how we should react if we are not able to provide the expected service. Defining these metrics is very important for SREs to understand the behaviour of the application and to be confident about the production environment.
The term Service Level Agreement (SLA) is something we all are familiar with, but the word has taken different forms in the Software Industry, based on the context. This section intends to explain the terminologies in depth for the readers to have the exact definition so that defining these metrics will be crystal clear.
Service Level Indicator - SLI
This is a carefully defined quantitative measure of some aspect of the level of service that is provided. Some values that we actually need to measure might not be directly available for us to monitor. For example, network delays on the client side might not be directly measurable for our monitoring tools. Due to that, these might not be considered as metrics and some other aspects will come into the picture. Most commonly used SLIs are given below.
Service Level Objective - SLO
This is a target value or a range of values for a service level, that is directly measurable by an SLI. Deciding on a proper Service Level Objective is a bit tough and opinion-based. Monitoring is important here; SLOs you cannot monitor won’t have any value at all. For example, measuring the network delays on the user side is impossible unless you maintain a frontend client app to do so. Hence, having that as an SLO is not that relevant. Having a proper SLO defined in the application is very critical not only for the management, but also for the users. This will set the expectations for everyone on how the application will perform. If someone is complaining that the application is running very slow, we can correlate this with the metrics gathered from the SLOs to see whether the affected user has been properly captured by the downtime in the SLO. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service. This dynamic can lead to both over-reliance on the service, when users incorrectly believe that a service will be further available than it actually is.
Service Level Agreements - SLA
This is the legal binding which asks the question, "what happens if the SLOs are not met"? This agreement directly speaks to the customer and communicates the consequences of maintaining a defined SLO. If the SLOs are met, the customer is happy, and if they are not met, the service provider will have to pay a penalty (in money or any other form) to the customer. This is mostly in concern with applications involving licenses and paid subscriptions.
The SRE Team does not get involved in deciding the SLAs, because SLAs are closely tied to business and product decisions. But SRE will get involved in taking actions if the SLOs are not met as per the SLA.
Some organisations might not have a direct SLA with its customers, but an implicit one. For example, Google Search does not have an agreement with its users. But still, if the search results are generated slowly or incorrectly, the organisation will end up paying a penalty to its customers, which is its reputation. Nevertheless, SLOs and SLIs are important and later you can decide on how to implement an SLA for a service provided.
This section will explain the concept of Toil in Site Reliability Engineering. As Site Reliability Engineers, we are required to perform certain amount of operational activities in our day-to-day processes. This being said, if these operational activities convert themselves to Toil, they should be eliminated by the SREs themselves. As SREs, we have many more crucial, long-running engineering tasks to carry out, than spending most of our time on Toil. So, this article will try to give a definition for Toil and explain how SREs should tackle Toil in their day-to-day processes.
We often see the misuse of the word Toil in the Engineering domain. Toil is not just work that we have to do regularly, and the work we get bored with. Tasks like writing documentations, conducting meetings, sending out emails cannot be considered as Toil. These are merely administrative work, and in management terms, this can be simply called as Overhead. So, when it comes to understanding Toil, it's definitely not work that irritates and discomforts us. This kind of feelings are highly opinionated and can be interpreted in different ways.
Toil is work which possesses the following characteristics in general. It does not necessarily need to have all of the below properties, but at least a combination of them.
Executing a script or triggering a script. If a person needs to manually trigger a script, in order to execute the steps in the script, that is Manual Work. This time can be considered as Toil time.
If you are performing a task once or twice, this is not Toil. But if you have to do this continuously, then this becomes Toil. For example, sending out an email daily to stakeholders, is definitely Toil.
If the manual work you are doing, can simply be converted to a script or automated programme, then that work is definitely Toil. By automating it, you will reduce the need of Human Effort to execute the task. But if it needs a human judgement, like deciding whether it is a Bug or a Feature, then it is not Toil. This statement is still arguable, where you can use sophisticated tools like Machine Learning to optimise the judgement. In this case, it can be called as Toil as well.
Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. For example, if and when an incident happens, we have to create a channel, pages, postmortems etc. which involves a lot of work, and what we do in each incident will differ from each other. So, it will be hard to fully eliminate the process, but we should definitely work towards reducing it.
If the operational tasks did not change the state of the system, then it is definitely Toil. If the work you did, changed the performance of the application or added a new feature to the System, then it cannot be considered as Toil.
If the work you do, grows with the size of the system, requires more resources and takes more time, then it is considered as Toil. For example, if you are supposed send a daily mail on the new incidents you get, then if you suddenly get around 50 incidents overnight, then you will have to manually analyse all of it and summarise them in an email. This is something the SREs should try to automate.