Sunday, September 22, 2019

Starting a SRE (Site Reliability Engineering) Practice

SRE & Your Business 

When considering implementing an SRE department, it's one thing to read Google's SRE book and implement it the way they did in their IT centric organization.   I've found the implementation to be a bit different in organizations where IT is not the entire driver behind the business. I think it's very likely to be more of a challenge to implement in an environment where IT could be considered more of a utility to the business rather than the business moneymaker. Frankly, the majority of businesses out there are like this. In my experience, I discovered that a SRE team can encounter a fair bit of push-back to new 'priorities', 'automation ideas', and 'procedures' from other IT departments within the enterprise that have already established IT operation procedures and routines. Our team needed to evangelize and prove themselves as an SRE department before getting traction in getting our agenda considered with other teams.

Monitoring 

Members of new SRE teams might be excited about all the opportunities for automating in an organization. As a fledgling SRE team, you actually might find that much of your early automation ends up revolving around the monitoring and triage of production incidents. After all, the 'R' in SRE stands for Reliability. One of the first things our team did that created value beyond our practice in the enterprise was create monitoring dashboards from tools like AppDynamics and Splunk. These dashboards displayed and alerted on pertinent, inter-team SLO's and SLA's - many of which we derived from production incidents we tracked. Tracking inter-team SLO's was important because we discovered that inter-team incidents/problems fell through the cracks until our SRE team started owning them. Correlating each production incident to a monitor/alert we created helped us ensure: 

  • We could be proactive if similar circumstances aligned towards another analogous incident occurring, and 
  • We could likely pinpoint how to resolve the new issue because it had happened in the past and had an existing incident ticket and past resolution. 
Keeping an automated eye on the enterprise systems in production, we discovered more insights intrinsic to reliability in the organization:
  • We began to anticipate and predict new production incidents. This gave us a leg up on being proactive in preventing the incident, or at the very least, the resolution. Managers and operations teams found this information valuable which gave us a foot in the door for further discussions to socialize our agenda. 
  • We began to see correlating alerts across disparate departments associated with specific incidents, which allowed us to drive the problem to a root cause, and in some cases, several root causes with contributing factors in disparate configuration, architecture, and code. 
Automation from an operations/monitoring perspective definitely trumped any automation we did to improve IT deployments.

Challenges to Reliability 

In an organization where the business is not IT centric, Google's concept of the Error Budget sounded great, but was much more difficult to enforce, or even get any kind of commitment on. The main reason was really quite simple - the business that made the money drove what features/projects went into production and when, no holds barred. The business held the trump card - they made the money - and so long as we could keep things running that paradigm didn't seem like it was going to change. Other challenges to reliability that we ran into early on included:

  • Either non-existent or out of date documentation which caused us problems on multiple fronts, including network upgrades and developing/consuming internal APIs for application services. 
  • Configuration management issues Issues with data integrity across multiple systems. 
  • Lack of discipline and followup surrounding post mordems 

Collaboration 

Once high priority incidents were discovered that weren't owned by teams or a department, our team took ownership of those incidents and drove them through to resolution. This generally meant communicating and working towards the resolution with many different teams across the enterprise, and helping those teams understand the root cause and enabling/empowering those teams to build the fix & resolution. This kind of cooperation allowed us to get a good reputation within the organization and helped the teams responsible for the root cause feel like they also 'owned' and contributed to the solution in a positive, empowering way.

Expectations 

Getting positive results (ROI) from a new SRE team doesn't happen overnight. Unless you've handpicked seasoned developers from your existing development/operations team, it will take time for your newbies (even if they have past experience as an SRE) to get a grasp on your business & technology domain. Their ramp-up time may depend on how you've structured the implementation of the team itself. Some questions that might have a bearing on this:

  • Is it an entity on its own, or embedded across the development/operations teams? Google and Facebook use embedded team models. We started with an autonomous team, but began to head towards an embedded model. 
  • Are other teams aware of what the SRE team will be doing and how this new team can potentially help them? 
  • How current is your documentation?

Tuesday, January 8, 2019

The Job the Wasn't There - A Lesson in Applying for Jobs

Hermione (made up name to protect her identity) was a student in my fast-track Web Developer class at SAIT in early 2017.  Like a lot of students I get in that course, she was anxious about obtaining work after receiving her diploma.

One day in between exercises, I showed the class several 'Careers' web pages of good, local web design companies.  One of those companies was Critical Mass, a company I had actually consulted with before.  I often recommend this company to students because they have a world class client list, they do internships, and I have experience with them.  That particular day, they happened to have an opening at the time for a Junior Web Designer, but no posted opportunities for internships.  I encouraged the students to apply for the Junior Web Designer opportunity and Hermione challenged me...

"How can we do that when we don't have all the qualifications in their list of requirements?"

I often get this question, and I had an answer. "You need to understand how a company creates a job description.  Many put it together as a list of qualifications for the perfect candidate.  Others will build the job description based on an existing successful employee in the company. They realize that most of the applicants won't match all of the qualifications - and this is particularly true in the IT industry. "

Hermione digested my answer, and piped up again. "But we're still in school and we have several more weeks before we'd be available to start working!  Does it really make sense to apply now for a position like this?"

"Absolutely!" I replied. "You never know what might come out of an application.  The hiring process for many companies takes several weeks.  There's usually a bunch of interviews for them to schedule and have, and then some planning and logistics around actually bringing the successful applicant aboard.  You never know what will happen out of an application."

She still looked skeptical.  I moved the class onto another exercise and didn't think too much more about it.

Several weeks later, I received the following email from Hermione:

"I took your advice about applying for jobs and I applied at Critical Mass for a 
Typing letters back and forth about a job opportunity
Photo by Kaitlyn Baker on Unsplash
Junior Web Developer position knowing that I was NOT qualified and that they probably never call me back. Guess what? They called me back! They don't think I am ready for the Junior Web Developer position, but they want me to interview for their internship program. The interview is on . Which leads me to the crux of this email. Would you consider being a reference for me? And do you have any advice for this interview?"

I responded:

"Lol Excellent!  Good for you, Hermione.

Certainly I can be a reference (as a teacher) for you.
Probably the best advice I have for your interview is if you don't have the right answer, straight up tell them.  But then also tell them you'll have to answer (or know about whatever their asking you) tomorrow.  In other words, when you get home, you'll investigate it and get the answers.  
Bring a notepad to the interview and make notes about anything like that (so you look like you mean business).  Come with a couple of questions as well.  Research in advance anything in the job description you don't know about so you feel prepared.  Research the company a bit - know where their office is, ensure you can make it there on time, who are their current clients, some of the history, etc. 
Smile!  I don't know if you read my blog post about that, but smiling is HUGE.  If you can, try and get an interview somewhere else first to practice and get the jitters out (and maybe get a competing offer) 
Hope that helps!  Good luck!"

She replied:

"Thank you! I appreciate the reference and the advice.
I've been panicking a little, I really thought they would never call. I'm scrambling to get my portfolio site updated for the interview, as well as just get prepared in general. I do have a practicum lined up though, so no pressure...sort of."

In the end, Hermione got the internship.  She was nervous going into the internship because she didn't feel entirely qualified.  I told her not to worry and ask LOTS of questions.  She ended up successfully completed her internship and came out feeling better about it than she expected to.  It was a great lesson for her (and for me and all more students who I tell this story to) of how there are opportunities that you don't see in the job market.