Table of Contents
Even though DevOps is a common buzzword in the world of software development, SRE has been there for a longer time and the two concepts have been coexisting for quite a while. Some call them rivals while others say SRE and DevOps are the two sides of one coin.
So what exactly is Site Reliability Engineering and what does it have to do with DevOps (if anything)? Let’s shed some light on the subject in this article.
SRE: a brief history and the definition
Site reliability engineering (or SRE for short) was first introduced in 2003 by Ben Treynor, Google’s VP of engineering. As Treynor put it, SRE is “what happens when a software engineer is tasked with what used to be called operations”. In other words, it is a software engineering approach to IT operations aimed at bridging the gap between the development and operations teams.
Now, the second part of the sentence sounds awfully similar to the definition of DevOps (we’ll get back to the comparison a bit later). So it’s important to state the following: SRE focuses on the reliability and availability of a system, not on the whole development and delivery process.
System reliability falls under the non-functional requirements that also include security or speed. In other words, system reliability indicates the percentage of non-failures that happen within a certain time period. A reliable system provides users with satisfactory performance and works as intended (or as an error threshold allows).
The main responsibilities of SRE engineers
An SRE engineer is required to have experience both in software development and in IT operations. Hence, it can either be a developer who knows his way around operations or an IT operations specialist who can code.
One of the main responsibilities of the SRE team is to measure the system reliability by using four Golden Signals (metrics):
- Latency (time that a system takes to respond to a request);
- Traffic (user demand for your service);
- Errors (rate of failed requests);
- Saturation (the capacity of a service, i.e. how much CPU or memory the system is using).
Another area of responsibility of the SRE team is to determine when new features can be introduced to the product. The introduction of these features normally happens when the system is considered reliable enough and passes all reliability and availability criteria. In this way, the implementation of new features does not “break” the system and does not cause any malfunctions.
When do you need SRE?
In a perfect world, every medium-sized and big company could benefit from the SRE implementation. In reality, the adoption of SRE by a company that lacks sufficient knowledge may lead to more issues and cause production delays (not to mention frustration and confusion for your team).
So when do you really need SRE and when can you adopt other, less voluminous methods? As harsh as it sounds, if your company is not even close to Google in size and complexity, SRE may not be something you desperately need.
Now, let us explain this one. Since SRE was developed by Google, it’s natural to assume that the company needed a new approach to existing issues. The issues were frequent hardware failures within massive data centers, the need for instant recovery, the need for fast response time, and the need for dynamic transfer of services from one center to another. Site reliability engineering resulted in automating a big number of manual processes and increased reliability of the system – but it also requires a team of SRE engineers to work on the issues and monitor the system constantly.
The introduction of an SRE team requires you to have all the needed resources and enough time to educate other team members on new processes. So if you do not have enough resources, if your current issues can be resolved with simpler tools, and if you don’t really understand what exactly you need, SRE may not be the right choice for you.
SRE and DevOps: are they the same or are they the opposite?
The most common misconception about Site Reliability Engineering is that it is the same as DevOps. Another big misconception is that these two concepts contradict each other.
Neither statement is true though. It will be most correct to say that SRE and DevOps complement each other and that SRE helps bring to life the core DevOps principles (more on that below). You can’t really say that SRE is a part of DevOps but SRE can help set up the processes that you’ll further need in case of DevOps implementation.
To better understand both SRE and DevOps, let’s look at the key goals and focus of each. As already mentioned, SRE focuses on system reliability and availability. DevOps focuses on continuity and speed of development and delivery. Both share the same goal though which is to eliminate the gap between operations and development.
How SRE helps implement DevOps principles
While SRE is a set of metrics and practices to achieve a tangible goal (aka system reliability), DevOps is more of a mindset and a culture. And while every culture is based on certain pillars, definite practices are needed to form and support these pillars. Below we’ll look at the core DevOps concepts and how SRE helps bring them to life.
Eliminate silos between development and operations teams
We’ve talked about this several times already but we’ve never quite mentioned how exactly this can be achieved. Here is what SRE offers:
- Use software engineering to solve operational problems (i.e. automate manual processes);
- Use the same tools for both operations and development;
- Document all procedures and update the document regularly.
As you can see, these processes lead to transparency between two departments, eliminate excessive tools/documentation and help keep everyone on the same page.
Accept failures and allow a defined error budget
An error budget is a wonderful concept introduced by SRE and its main idea is that 100% reliability is an unrealistic goal. So a company can define an acceptable error budget as an objective and follow it. In this way, a system remains reliable enough while having a certain error percentage and allows the introduction of new features without waiting for the errors to be fixed first. As for the frequency of error budget definition, you can either follow the example by Google where they define the error budget quarterly or set your own frequency, depending on the needs of the project.
Overall, DevOps promotes the acceptance of failures as something to learn from and move forward. That doesn’t mean you should have a carefree attitude towards errors but rather that you should not strive for the system to be 100% perfect and you should remain flexible to balance sufficient quality with an allowed percentage of failures.
Implement changes gradually
If you implement lots of big changes at once, chances are high it will harm your system. Hence, in order to retain system reliability while updating and expanding it, changes should be introduced in a gradual manner. A few best practices from SRE are:
- Keep the balance between the needed level of reliability and fast and frequent updates;
- Perform canary releases for risk mitigation;
- Perform early and frequent rollbacks.
Avoid rolling out huge chunks of new features and updates at once. Instead, go little by little and constantly monitor how these changes affect the system and whether they lower its reliability or not.
Automate as many processes as possible
One of the biggest bottlenecks in product development and delivery as well as in its reliability is toil aka repetitive manual work. It may include manual releases, regular password resets, or manual infrastructure scaling. All these processes usually consume quite a significant amount of time and may slow the whole delivery process down.
Both DevOps and SRE are aimed at maximal automation for the benefit of the business. The SRE’s main rule is keeping the toil below 50% of an engineer’s workload. If it goes higher, you need to identify the primary toil source and automate it.
Set metrics and measure them
Remember the metrics that we mentioned earlier? They are the key indicators of the availability of the system, meaning, they indicate whether the system functions as intended and is available at any needed time.
DevOps promotes measuring everything to always be aware of the state of the system and timely react to any warning signals. SRE uses latency, saturation, traffic, and rate of failed requests metrics to always have definite data about the system at one’s disposal.
Summing up
Site Reliability Engineering is great when you have a complex system and wish to keep it 24/7 up and running with the help of automation and a coding-first approach. Keep in mind though that you can’t just implement SRE out of blue: you’ll need to think about the implementation strategy, employee training, and availability of resources that you’ll need to successfully introduce SRE processes to your organization. Therefore, we highly recommend first analyzing all pros and cons that SRE will bring you and considering how its implementation will impact your business in the long run. And if you decide that the pros outweigh the possible cons, we recommend assembling an SRE team to help you navigate through challenges and ensure everything is set up properly.
Comments