Sunday, September 22, 2019

Starting a SRE (Site Reliability Engineering) Practice

SRE & Your Business 

When considering implementing an SRE department, it's one thing to read Google's SRE book and implement it the way they did in their IT centric organization.   I've found the implementation to be a bit different in organizations where IT is not the entire driver behind the business. I think it's very likely to be more of a challenge to implement in an environment where IT could be considered more of a utility to the business rather than the business moneymaker. Frankly, the majority of businesses out there are like this. In my experience, I discovered that a SRE team can encounter a fair bit of push-back to new 'priorities', 'automation ideas', and 'procedures' from other IT departments within the enterprise that have already established IT operation procedures and routines. Our team needed to evangelize and prove themselves as an SRE department before getting traction in getting our agenda considered with other teams.

Monitoring 

Members of new SRE teams might be excited about all the opportunities for automating in an organization. As a fledgling SRE team, you actually might find that much of your early automation ends up revolving around the monitoring and triage of production incidents. After all, the 'R' in SRE stands for Reliability. One of the first things our team did that created value beyond our practice in the enterprise was create monitoring dashboards from tools like AppDynamics and Splunk. These dashboards displayed and alerted on pertinent, inter-team SLO's and SLA's - many of which we derived from production incidents we tracked. Tracking inter-team SLO's was important because we discovered that inter-team incidents/problems fell through the cracks until our SRE team started owning them. Correlating each production incident to a monitor/alert we created helped us ensure: 

  • We could be proactive if similar circumstances aligned towards another analogous incident occurring, and 
  • We could likely pinpoint how to resolve the new issue because it had happened in the past and had an existing incident ticket and past resolution. 
Keeping an automated eye on the enterprise systems in production, we discovered more insights intrinsic to reliability in the organization:
  • We began to anticipate and predict new production incidents. This gave us a leg up on being proactive in preventing the incident, or at the very least, the resolution. Managers and operations teams found this information valuable which gave us a foot in the door for further discussions to socialize our agenda. 
  • We began to see correlating alerts across disparate departments associated with specific incidents, which allowed us to drive the problem to a root cause, and in some cases, several root causes with contributing factors in disparate configuration, architecture, and code. 
Automation from an operations/monitoring perspective definitely trumped any automation we did to improve IT deployments.

Challenges to Reliability 

In an organization where the business is not IT centric, Google's concept of the Error Budget sounded great, but was much more difficult to enforce, or even get any kind of commitment on. The main reason was really quite simple - the business that made the money drove what features/projects went into production and when, no holds barred. The business held the trump card - they made the money - and so long as we could keep things running that paradigm didn't seem like it was going to change. Other challenges to reliability that we ran into early on included:

  • Either non-existent or out of date documentation which caused us problems on multiple fronts, including network upgrades and developing/consuming internal APIs for application services. 
  • Configuration management issues Issues with data integrity across multiple systems. 
  • Lack of discipline and followup surrounding post mordems 

Collaboration 

Once high priority incidents were discovered that weren't owned by teams or a department, our team took ownership of those incidents and drove them through to resolution. This generally meant communicating and working towards the resolution with many different teams across the enterprise, and helping those teams understand the root cause and enabling/empowering those teams to build the fix & resolution. This kind of cooperation allowed us to get a good reputation within the organization and helped the teams responsible for the root cause feel like they also 'owned' and contributed to the solution in a positive, empowering way.

Expectations 

Getting positive results (ROI) from a new SRE team doesn't happen overnight. Unless you've handpicked seasoned developers from your existing development/operations team, it will take time for your newbies (even if they have past experience as an SRE) to get a grasp on your business & technology domain. Their ramp-up time may depend on how you've structured the implementation of the team itself. Some questions that might have a bearing on this:

  • Is it an entity on its own, or embedded across the development/operations teams? Google and Facebook use embedded team models. We started with an autonomous team, but began to head towards an embedded model. 
  • Are other teams aware of what the SRE team will be doing and how this new team can potentially help them? 
  • How current is your documentation?

No comments: