Performance Engineering and SRE (Site Reliability Engineering)
Performance engineering (erstwhile Performance Tests, Load tests) and Site Reliability Engineering are the two most commonly interchangeable terms currently being used in the current software industry. Although the gap between them is quite narrowed down but if looked at a microscopic level, these two are totally different fields in their own sense.
Let’s try to look at each one of them in some detail today in this blog.
This will be an extension to my previous blog post and might sound repetitive a bit from a Performance Engineering standpoint, but for keeping this post complete, some of the information is reused here.
From the dotcom revolution in the early 2000’s the requirement to ‘test’ websites became a need of the hour and lot of dedicated commercial / proprietary tools took the world by storm. There was a time when the terms Load Runner / Silk Performer were synonymous to Load tests.
Due to the presence of dedicated teams, the test engineers who were earlier limited to manual & automated functional QA shifted focus to testing applications for load. Their task was to develop test scripts and run it for multiple users. The outcome of this was a test report which would then be inspected and introspected with further analysis by the respective product owners.
With the growth of the internet and the evolution of web technologies the smaller websites started to grow in scale. The engineer who was earlier limited to just ‘testing’ had to build a skill set beyond this to understand what actually happens ‘behind the picture’. This required knowledge of front end technologies, backend technologies and server hardware. Further down the line, companies invested in procuring hardware to setup test labs which had hardware configurations as close to production systems as possible. This gave a real time environment for test engineers to play around and put their new skill set to test.
This is what led the erstwhile ‘performance test’ to pave way for ‘performance engineering’. Now the test engineer was not only responsible for designing, creating and running test scripts but was also required to analyze and provide inputs on why a specific test produced a certain kind of results and if they were similar or different from the previous ones. This required a lot of deep dive in to different layers to look at each aspect.
All was good so far, so, what changed?
Site Reliability Engineering
Fast forward a few years where the ‘online’ things became ‘truly online’ by the advent of cloud technologies. The companies which had systems ‘on premise’ preferred to ‘on cloud’ systems where they didn’t have to manage a lot of back end systems and just focused on the software initially. Later as the cloud technologies evolved, lot of systems were rebuilt to work ‘for the cloud’ and ‘from the cloud’.
The development strategy which was focused at providing a single release / two releases a year moved to a ‘monthly’ strategy where each month a new version of the product had to be released. Some moved to even more aggressive models to release new software every week or fortnight.
As lot more features started to be incorporated in to applications, their bulk also started to grow and they became more complex. Lot of new technologies were bought in to manage the complexity and the new buzz word – ‘micro services’. The earlier known monoliths could not handle so much functionality running in a single piece, so they were split up in to manageable chunks. This paved way for more development teams often working independently from each other.
This gave the requirement of more testing and with this pace of evolution, the performance tests which were relaxed and focused on the engineering aspects became more focused only on the test parts. Lot of automation started to be built around how the metrics were analyzed and presented.
The complex software in production deployments required continuous monitoring round the clock to keep it alive and kicking and to deal with any issues. So, new roles started to be built up who used to manage the operational aspects and came back if they had any issues to development / test to deal with them.
This worked well for the most part but due to the continuous back and forth interaction with various teams, the problem management part became more tedious so this gave rise to a new skillset.
The operations engineer was also required to know most aspects of performance and have an understanding of how things worked and also deal with them for the most part. This finally paved way to the SRE roles that are more common today.
The SRE roles in today’s companies not only deal with in house testing but also to look at production sites in detail and derive any requirements based on the usage patterns and get them back to test. They are also responsible to maintain a ‘4 digit’ or in extreme cases a ‘5 digit’ availability of the application. This translates to either 99.99% or 99.999% availability of the application without any interruptions (or) down times.
This derived the need for further automation in the form of CI/CD where minimal time would want to be spent in actual testing and made the tests to be run in an automated manner with equivalent automated methods of presenting test reports and generating alerts.
In essence the present day performance engineer is expected to know aspects of
- Performance testing / Load testing
- Performance Engineering
- Automation - CI/CD
So, what started as a performance test engineer role, grew in to a multi-faceted role in a matter of a decade. This is only expected to enhance further in to a development role where ‘coding’ will also become a mandatory requirement for performance engineers in the future where the engineers would also require to code (write / modify) and fix issues themselves. Although these roles currently exists in few tech savvy organizations, it might become more of a common requirement in the future narrowing down the definition of the roles for ‘Performance Engineering’ / ‘SRE’ / ‘Developer’.