Sunday, September 22, 2019

Starting a SRE (Site Reliability Engineering) Practice

SRE & Your Business 

When considering implementing an SRE department, it's one thing to read Google's SRE book and implement it the way they did in their IT centric organization.   I've found the implementation to be a bit different in organizations where IT is not the entire driver behind the business. I think it's very likely to be more of a challenge to implement in an environment where IT could be considered more of a utility to the business rather than the business moneymaker. Frankly, the majority of businesses out there are like this. In my experience, I discovered that a SRE team can encounter a fair bit of push-back to new 'priorities', 'automation ideas', and 'procedures' from other IT departments within the enterprise that have already established IT operation procedures and routines. Our team needed to evangelize and prove themselves as an SRE department before getting traction in getting our agenda considered with other teams.


Members of new SRE teams might be excited about all the opportunities for automating in an organization. As a fledgling SRE team, you actually might find that much of your early automation ends up revolving around the monitoring and triage of production incidents. After all, the 'R' in SRE stands for Reliability. One of the first things our team did that created value beyond our practice in the enterprise was create monitoring dashboards from tools like AppDynamics and Splunk. These dashboards displayed and alerted on pertinent, inter-team SLO's and SLA's - many of which we derived from production incidents we tracked. Tracking inter-team SLO's was important because we discovered that inter-team incidents/problems fell through the cracks until our SRE team started owning them. Correlating each production incident to a monitor/alert we created helped us ensure: 

  • We could be proactive if similar circumstances aligned towards another analogous incident occurring, and 
  • We could likely pinpoint how to resolve the new issue because it had happened in the past and had an existing incident ticket and past resolution. 
Keeping an automated eye on the enterprise systems in production, we discovered more insights intrinsic to reliability in the organization:
  • We began to anticipate and predict new production incidents. This gave us a leg up on being proactive in preventing the incident, or at the very least, the resolution. Managers and operations teams found this information valuable which gave us a foot in the door for further discussions to socialize our agenda. 
  • We began to see correlating alerts across disparate departments associated with specific incidents, which allowed us to drive the problem to a root cause, and in some cases, several root causes with contributing factors in disparate configuration, architecture, and code. 
Automation from an operations/monitoring perspective definitely trumped any automation we did to improve IT deployments.

Challenges to Reliability 

In an organization where the business is not IT centric, Google's concept of the Error Budget sounded great, but was much more difficult to enforce, or even get any kind of commitment on. The main reason was really quite simple - the business that made the money drove what features/projects went into production and when, no holds barred. The business held the trump card - they made the money - and so long as we could keep things running that paradigm didn't seem like it was going to change. Other challenges to reliability that we ran into early on included:

  • Either non-existent or out of date documentation which caused us problems on multiple fronts, including network upgrades and developing/consuming internal APIs for application services. 
  • Configuration management issues Issues with data integrity across multiple systems. 
  • Lack of discipline and followup surrounding post mordems 


Once high priority incidents were discovered that weren't owned by teams or a department, our team took ownership of those incidents and drove them through to resolution. This generally meant communicating and working towards the resolution with many different teams across the enterprise, and helping those teams understand the root cause and enabling/empowering those teams to build the fix & resolution. This kind of cooperation allowed us to get a good reputation within the organization and helped the teams responsible for the root cause feel like they also 'owned' and contributed to the solution in a positive, empowering way.


Getting positive results (ROI) from a new SRE team doesn't happen overnight. Unless you've handpicked seasoned developers from your existing development/operations team, it will take time for your newbies (even if they have past experience as an SRE) to get a grasp on your business & technology domain. Their ramp-up time may depend on how you've structured the implementation of the team itself. Some questions that might have a bearing on this:

  • Is it an entity on its own, or embedded across the development/operations teams? Google and Facebook use embedded team models. We started with an autonomous team, but began to head towards an embedded model. 
  • Are other teams aware of what the SRE team will be doing and how this new team can potentially help them? 
  • How current is your documentation?

Tuesday, January 8, 2019

The Job the Wasn't There - A Lesson in Applying for Jobs

Hermione (made up name to protect her identity) was a student in my fast-track Web Developer class at SAIT in early 2017.  Like a lot of students I get in that course, she was anxious about obtaining work after receiving her diploma.

One day in between exercises, I showed the class several 'Careers' web pages of good, local web design companies.  One of those companies was Critical Mass, a company I had actually consulted with before.  I often recommend this company to students because they have a world class client list, they do internships, and I have experience with them.  That particular day, they happened to have an opening at the time for a Junior Web Designer, but no posted opportunities for internships.  I encouraged the students to apply for the Junior Web Designer opportunity and Hermione challenged me...

"How can we do that when we don't have all the qualifications in their list of requirements?"

I often get this question, and I had an answer. "You need to understand how a company creates a job description.  Many put it together as a list of qualifications for the perfect candidate.  Others will build the job description based on an existing successful employee in the company. They realize that most of the applicants won't match all of the qualifications - and this is particularly true in the IT industry. "

Hermione digested my answer, and piped up again. "But we're still in school and we have several more weeks before we'd be available to start working!  Does it really make sense to apply now for a position like this?"

"Absolutely!" I replied. "You never know what might come out of an application.  The hiring process for many companies takes several weeks.  There's usually a bunch of interviews for them to schedule and have, and then some planning and logistics around actually bringing the successful applicant aboard.  You never know what will happen out of an application."

She still looked skeptical.  I moved the class onto another exercise and didn't think too much more about it.

Several weeks later, I received the following email from Hermione:

"I took your advice about applying for jobs and I applied at Critical Mass for a 
Typing letters back and forth about a job opportunity
Photo by Kaitlyn Baker on Unsplash
Junior Web Developer position knowing that I was NOT qualified and that they probably never call me back. Guess what? They called me back! They don't think I am ready for the Junior Web Developer position, but they want me to interview for their internship program. The interview is on . Which leads me to the crux of this email. Would you consider being a reference for me? And do you have any advice for this interview?"

I responded:

"Lol Excellent!  Good for you, Hermione.

Certainly I can be a reference (as a teacher) for you.
Probably the best advice I have for your interview is if you don't have the right answer, straight up tell them.  But then also tell them you'll have to answer (or know about whatever their asking you) tomorrow.  In other words, when you get home, you'll investigate it and get the answers.  
Bring a notepad to the interview and make notes about anything like that (so you look like you mean business).  Come with a couple of questions as well.  Research in advance anything in the job description you don't know about so you feel prepared.  Research the company a bit - know where their office is, ensure you can make it there on time, who are their current clients, some of the history, etc. 
Smile!  I don't know if you read my blog post about that, but smiling is HUGE.  If you can, try and get an interview somewhere else first to practice and get the jitters out (and maybe get a competing offer) 
Hope that helps!  Good luck!"

She replied:

"Thank you! I appreciate the reference and the advice.
I've been panicking a little, I really thought they would never call. I'm scrambling to get my portfolio site updated for the interview, as well as just get prepared in general. I do have a practicum lined up though, so no pressure...sort of."

In the end, Hermione got the internship.  She was nervous going into the internship because she didn't feel entirely qualified.  I told her not to worry and ask LOTS of questions.  She ended up successfully completed her internship and came out feeling better about it than she expected to.  It was a great lesson for her (and for me and all more students who I tell this story to) of how there are opportunities that you don't see in the job market.

Friday, July 6, 2018

Facebook Job Interview - Production Engineer

Facebook contacted me on LinkedIn recently looking to fill a 'Production Engineer' role.  I wasn't looking for job offers.  They reached out to me.   Apparently they had been doing this a bit though, targeting DevOps professionals as my buddy at the startup I was recently working with also got contacted.

Having Facebook reach out to you for a job opportunity?  Definitely intriguing.  One of the things on my 'IT Career Bucket List' would be to work at one of 'those' companies.  Google, Facebook, etc.  I thought if nothing else, it would be interesting to see where the hiring process went, since in the past they didn't actively recruit unknowns like me. 

I replied that I'd be interested to know more and so a phone interview was arranged with the Facebook technical recruiter.  At the appointed time (a couple days later) he called and I got the skinny on how things would potentially work.

Facebook logo - about my interview process at FaceBook

The phone interview was the first step.  After that I provide them with my resume which some of their technical leads would look at.  If they felt I was a fit based on my resume, there would be two remote technical screens - 1 specifically focussed on the Linux OS, and the other on a coding language of my choice.  They wanted someone who was very comfortable at both.  Over the phone they gave me some sample questions - basic linux commands - to give me an idea of what the screens would be like.  If I managed to gain their approval in the technical interviews, Facebook would then fly me to the location of my choice (Menlo Park or Seattle) for face to face talks - both technical and otherwise.  Following that, if I was still up to snuff, I'd get an offer.  Once hired, there would be a 6 week on site 'boot camp' where I'd get trained in all things Facebook and brought up to speed the technical ins and outs of the team I'd be working with.

Getting hired would require me to move to Seattle or Menlo Park.  Real estate at both locations is exorbitant - like ridiculous.  In the event that I was hired, Facebook would offer me a full relocation package, with the potential for a temporary housing situation for 3-4 months while we looked for a permanent residence.  I was told that many employees commute from communities with more reasonably housing prices using Facebook commuter busses that have complimentary drinks and WIFI.  Health and Dental would be covered 100% for myself and my dependents.  Wednesdays are an optional work-from-home day, and Facebook offers 21 days of vacation per year - although I neglected to confirm whether that was 21 business days (over 4 weeks), or 3 weeks all in.  I'm a Canadian, so I asked about a work Visa.  He replied that Facebook has an immigration/legal department that would arrange a TN1 Visa for me, in the event that I was hired.

As far as the job itself - Facebook's version of the Production Engineer role is essentially described here.  At the time I was talking with them, Facebook had 42 teams each responsible for a particular feature in their system (FB messenger, Ads, Newsfeeds, etc.).  These teams consist of 4-5 developers with an embedded production engineer.  Everyone on the team goes on call, one week at a time, so it ends up being a 4-5 week on-call rotation.

In the end, after viewing my resume they decided not to pursue the hiring process with me further at this time.  They had been interviewing 'a lot of strong candidates recently that they felt were a stronger match for their immediate needs.'  I can't say I was heartbroken.  It would have been a big move for us that would have put pressure on me personally and financially - not to mention having one dependent in university and one in high school.

I have been pondering what it was about my resume that flagged it to the hiring managers.  Was it the fact that I've moved jobs every 2-3 years, and they had concerns that I wouldn't last long at FB?  Perhaps it was because of my lack of focus in technology - moving contracts 2-3 years means learning lots of new technology and never getting a chance to really focus.  I've asked the FB recruiter - we'll see what he comes back with.  (You might be wondering why I've moved jobs every 2-3 years...  Its prudent for independent business contractors like myself to 'keep moving' from a Canadian taxation perspective.)

Saturday, September 9, 2017

My Path to the AWS Certified Solutions Architect - Associate Exam

Its been a while since I've tried to get any kind of industry certification. Life was busy. I was full-time consulting and had clients to support after work as well. Why would I even bother with all the study? Does a certification amount to much anymore? Lately, several things converged to motivate me to write this cert exam...

  1. My resume was getting stale. While I was able to keep a steady stream of contracts going, many of them were using older technologies that weren't current. This concerned me.
  2. I got a new gig at a start-up that gave me a hands-on opportunity to work with newer technology in the cloud. After setting up the infrastructure and CI/CD for several QA environments and geo-load balanced UAT and PRD environments in the Google Cloud Platform, a requirement for hosting our data in Canada became paramount. AWS had recently launched data centres in Canada, and so in May we decided to migrate our infrastructure there. Having successfully completed that migration I wondered how easy it would be for me to follow up with a Solutions Architect certification.
  3. A couple of guys at work got me turned on to Catching some sales I was able to get a couple of courses on AWS certification for $15 each (regularly they are over $150 each). One course was by ACloudGuru for the AWS SysOps certification. The other was by the Linux Academy for AWS Solutions Architect certification. These two courses gave me a good foundation for the material that is covered on the exam.
I originally did the SysOps course first, as I thought the that course was more in line with my job description at work. Finishing that course (leaving its practice test for later), I took a look at the practice questions on AWS here and felt like I'd come up short if I wrote the exam. That's when I decided to switch gears and do the Solution Architect course and exam. Most people online recommend doing that one first.

Incidentally, both courses helped me get a better grasp AWS Best Practices, and I was able to implement several improvements to our infrastructure at work because of what I learned. After another week and a half of going through the Solutions Architect course and reviewing the material, I felt more confident. I took the practice tests from both courses and passed them with good marks, so I thought I was ready. I scheduled my exam for the next week and also, for good measure, purchased a 20 question 'official practice exam'.

I had the week off during which I was scheduled to write my exam. Thursday was the big day. Tuesday morning I wrote the 'official practice exam' which was full of scenario based questions - and got 60% - A FAIL! Apparently the passing mark floats a bit between 62 and 66% (go figure). The practice exams in the courses seemed a bit easier. One of the things I didn't realize was how the different 'domains' for the exam were weighed:

In my 'official practice exam' I had scored:
1.0 Designing highly available, cost-efficient, fault-tolerant, scalable systems: 50%
2.0 Implementation/Deployment: 100%
3.0 Data Security: 75%
4.0 Troubleshooting: 50%

Clearly my 100% in implementation/deployment wasn't going to help me much with the domains balanced like that. I had some work to do!

A Cloud Guru has a forum specifically for the exam. I poured of that, specifically looking at posts where people who've written the exam share their experiences and what they wished they would have studied. I made notes of what I didn't know from those posts, and I also went over a lot of official AWS documentation for specific services (mostly FAQs and Best Practices) and took notes from them, too. And then I studied hard. 6 pages of 8 point font text. I kept adding things to those notes as well.

Thursday rolled around and I wrote the exam. I tried my best. The exam is 55 multiple choice questions and you have 80 minutes to complete it. Some of the questions I knew the answer, hands down. Others (like 'choose two correct answers') got a bit dicey. I chose the answers that made the most sense to me. I didn't feel like I was blind-sided by any of the questions, however there were definitely some things I (still) could have studied more (AD/User Federation types, for example). I finished answering the questions with about 15 minutes to spare. I had flagged some questions, so I went back and reviewed them and changed a couple of answers. When all was said and done, I passed and scored 72% - not great, but good enough for a certification. 

Here's how things panned out:
1.0 Designing highly available, cost-efficient, fault-tolerant, scalable systems: 75%
2.0 Implementation/Deployment: 80%
3.0 Data Security: 55%
4.0 Troubleshooting: 80%
I was quite happy with the improvements I'd made in the first and last domains.
Apparently there is quite a bit of content cross-over between the AWS Solution Architect - Associate exam and the other two associate certifications: SysOps and Developer. Potentially I could study a bit and write those fairly soon. However its $150 USD to write the exams, and I'm wondering what difference is in having the 3 certs verses just the 1. I'd be interested to hear your thoughts. Would it be work the extra $300 USD? Personally, I'm content to take a break from studying for now

Saturday, August 20, 2016

Google Cloud Platform & Stackdriver Monitoring - First Impressions

I've been working with the Google Cloud Platform at work for a little over a month now, and I'm getting comfortable with it.  I haven't done an exact price comparison with EC2/AWS, but apparently its a bit cheaper.  Based on my last 6 weeks of experience, I'm pretty happy with the pricing so far.  Static IP addresses are free, and their dashboard tells you where you can make hosting optimizations to save money which I think is great.  There's lots of good documentation, and the libraries for adding software are full featured and work well.  While GCP not be as full-featured as AWS, I find their menus and dashboards easier and more intuitive to navigate than AWS.

Hiccups I ran into:
- I can't dynamically add cpu or memory.  I've got to stop the instance and restart it.  Same with disk space.  I'd also like to be able to make some of my disks smaller, but I can't seem to do that either without some kind of reboot.
- If an instance has a secondary SSD drive,  it seems that there are issues with discovering it on a 'clone' or a restart after a memory or CPU change.  I have to log in through the back end (fortunately GCP provides an interface for this) and comment out the reference to the secondary drive in the /etc/fstab file to get it going again.  This seems buggy.

I recently spent a couple of days configuring Google Stackdriver Monitoring this week, and after a
couple of frustrations with not being sure how to get started, I installed an agent on a server and I was off to the races.  Installing agents is definitely the way to get going.  I found that it was easiest to create Uptime Monitors and associated Notifications at the instance level, on a server by server basis.  Doing it this way allowed me to group the Uptime Monitors to the instance, something I couldn't do when I created the Uptime Monitors outside of the instance.  It was simple to monitor external dependency servers that we don't own, but need running for our testing.  I integrated my Notifications with HipChat in a snap.  I also installed plugins for monitoring Postgres and MySql - these worked great so long as I had the user/role/permissions set correctly.  I'm super impressed with Stackdriver Monitoring, and will probably use it even if we have to switch over to host with AWS.

Our biggest roadblock with the Google Cloud Platform currently is they don't have a datacentre in Canada.  That could be a big selling feature for many of my clients, and because they don't have it, we may have to consider other options (like AWS/EC2, which is spinning up a new data centre in Montreal towards the end of 2016...)  Privacy matters!  Hope you're listening Google...

Thursday, April 7, 2016

EC2 and the AWS (Amazon Web Services) Free Tier - My First Experience

The Amazon Web Services Management Console
I had my first go-round with EC2 in AWS this last week in a real-life context.  I was teaching a class on Content Management Systems at SAIT, and I wanted my students to experience what its like to install WordPress and MySQL on a Linux VM.  I also wanted to personally get some experience with AWS, so I thought 'why not kill two birds with one stone?'

About 6 weeks ago I had purchased Amazon Web Services IN ACTION, written by Andreas and Michael Wittig and published by Manning.  It gave me a great primer on setting up my AWS account, my Billing Alert, and my first couple of VM's.  I leveraged that experience, crossed my fingers, spun up 18 VM's for my students, and hoped I wouldn't get charged a mint for having them running 24/7 for a few days.  It was a Friday around 1pm when I created them and gave them to my students to use.  Imagine my surprise when I checked my billing page in AWS on Monday and discovered they had only charged me $2.81!

I had clearly reached some kind of threshold as after that day I got charged, on average, about $8/day - for all 18 VM's.  They were charging me 2 cents per hour per VM, and some small data transfers.  Granted, I used a 'T1 Micro' - with 1 CPU, .613 Gib memory and 8 GiB storage.  Still, I was quite happy.

On the last day of class, I split the students up into two teams and gave them large web projects to do, and spun up a 'T1 Micro' for each team.  I gave them some scripts they could run to create some swap memory if they needed it.  They ran those scripts right away, and within an hour (with 9 people uploading files and content into those systems continuously) those T1 Micros CPUs pinned.  So I quickly imaged them over lunch and spun up an 'M3 Large' VMs (2 CPU's, 7.5 GiB memory) for each team and threw their images on.  I ran into one issue spinning up the new team VM's - I had to spin down the original Team VM's first before I could start the new ones because there is a limit/quota of 20 VMs on the 'free-tier' in AWS.   Aside from that and the changed IP addresses, the transition was seamless.  I was a happy camper and now that the students had responsive VM's, so were they.

My total bill for the week - $38!  A colleague pointed out that I probably could have added auto-scaling to those team VM's and increase the CPU and memory in place, without losing the current IP's.  He's right, I probably could have, but I didn't have the experience and didn't want to waste class time (and potentially lose the student's work) by trying something I didn't know how to do.  All in all, I was very impressed with my first run of EC2 in AWS.  It was very reasonable, responsive, and easy to use.  I'd definitely do it again.

Monday, March 28, 2016

The Perils and Pitfalls of OpenSource Software

Open-Source Software is the backbone of most of the internet.  Seriously.  For more than a decade, small business websites and main-stream web applications have used open-source software to develop their solutions.  Web projects (and frankly, most software projects in general) that don't have a dependency of some kind on an open-source component are the RARE exception to the rule.

Should we be concerned about this?  I think so.  Here's why:
  1. Coders aren't implementing open-source code properly.  In my experience, any open-source code dependencies should be referenced locally.  Many coders fail to do this and reference external code libraries in their code.  What happens when that external 'point of origin' has a DNS issue, or is hit with a DDOS attack, or is just taken down?  Your site can break.

    Case in point - check out this article on 'How One Programmer Broker the Internet'  In a nutshell, one open-source programmer got frustrated with a company over trademark name issue.  This developer's open-source project had the same name as a messaging app from a company in Canada.  He ended up retaliating by removing his project from the web.  It turned out his project had been leveraged by millions of coders the world over for their websites, and once his code was removed, their websites displayed this error:
             npm ERR! 404 'left-pad' is not in the npm registry

    I believe if developers had downloaded the NPM javascript libraries and referenced them locally on their servers, they wouldn't have run into this issue (as they'd have a local copy of the open-source code).

    Another case in point - I worked on a project a number of years ago that had dependencies on - an open source site at the time that was hosting a bunch of libraries and schemas for large open-source projects like Struts and Spring.  Some of the code in those projects had linked references to schemas (rules) that pointed to  Unfortunately we didn't think of changing those links and one day went down....  and with it went our web site, because it couldn't reference those simple schemas off of  After everything recovered we quickly downloaded those schemas (and any other external references we found) and referenced them locally.
  2. Security  Do you know what is in that open-source system you're using?  Over 60 million web sites have been developed on an open-source content management system called WordPress.  Because it is open source, everyone can see the code - all the code - that it is built with.  This could potentially allow hackers to see weaknesses in the system.  However, WordPress has pretty strict code review and auditing in place to ensure that this kind of thing doesn't happen.  They also patch any issues that are found quickly, and release those patches to everyone.  The question then becomes:  Does your website administrator patch your CMS?

    Another issue I ran into related to security was with a different Content Management System.  I discovered an undocumented 'back-door' buried in the configuration that gave anyone who knew about it administrative access to the system (allowing them to log into the CMS as an Administrator and giving them the power to delete the entire site if they new what they were doing.  Some time later, I found out that some developers who had used this CMS weren't aware of that back door and left it open.  I informed them about it, and they quickly (and nervously) slammed it shut.   Get familiar with the code you are implementing!
  3. Code Bloat (importing a bunch of libraries for one simple bit of functionality)  Sometimes developers will download a large open-source library to take advantage of a sub-set of functionality to help save time.  Unfortunately, this can lead to code bloat - your application is running slow because its loading up a monster library to take advantage of a small piece of functionality.
  4. Support (or the lack of it)  Developers need to be discerning when they decide to use an open-source library.  There are vast numbers of open-source projects out there, but one needs to be wise about which one to use.  Some simple guidelines to choosing an open-source project are:
    • How many downloads (implementations) does it have?  The more, the better, because that means its popular and likely reviewed more.
    • Is there good support for it?  In other words, if you run into issues or errors trying to use it, will it be easy to find a solution for your issue in forums or from the creators?  
    • Is it well documented?  If the documentation is thorough, or if there have been published books written about the project, you're likely in good hands.
    • Is it easy to implement?  You don't want to waste your time trying to get your new open-source implementation up and working.  The facilities and resources are out there for project owners to provide either the documentation or VM snapshots (or what have you) to make setting up your own implementation quick and easy.
    • How long has it been around?  Developers should wait to implement open-source projects with a short history.  Bleeding-edge isn't always the cutting edge.  Wait for projects to gain a critical mass in the industry before implementing them if you can help it.