Site reliability and why you should plan to fail in advance

“You have failed me for the last time.” If only Admiral Ozzel knew about site reliability engineering, aka SRE, and planning for failure, he may have been spared by Darth Vader.

Now I’m not sure everybody’s favourite Sith Lord would go for breaking things on purposes. He prefers to keep things on schedule. But this is something you – as rebels – would do well to try to implement. In this way you make sure your IT is failure-proof.

You know the old adage, “fail to prepare, prepare to fail”? Well in the case of cloud computing you really need to start at the opposite end. Yes, I mean prepare to fail.

Site reliability engineering

Site reliability engineering needs a failure plan

Did you know that the cloud is designed to allow failures? If not, then you have learned something already. But seriously it is. And that’s why you need to prepare a detailed plan to mitigate failure.

The great thing about cloud computing is that cloud infrastructure offers an efficient and responsive way to design for higher reliability. It’s this concept which provides the backbone to SRE.

SRE, as many of you might know, is of course the brainchild of Google’s Vice President of Engineering Ben Treynor. As its still a growing movement, even Treynor himself still hasn’t published a single-sentence definition.

In brief, site reliability engineering seeks to effectively end the age-old conflicts between Development and Operations. Fundamentally, it encourages product reliability, accountability, and innovation. But without the bickering. You can read all about it in the free, downloadable ebook written by members of Google’s SRE team. It’s highly recommended bedtime reading.

Break things, then fix them

A big part of making site reliability work for you is to help you plan for failure. HA patterns and/or DR plans can be prepared and put in action. You can even plant failure in your production site(s). This is a a popular way to check your reflexes and your plans for increased reliability.

Just break it (on purpose) and watch the impact. Make sure you record any issues and use this log to improve your plan for business continuity.

Don’t feel confident to break your production? Then simply clone a similar environment. Test, test, and test again. And prepare for the real deal – production!

Haven’t gone cloud yet?

If cool concepts like this, i.e. breaking things on purpose and site reliability engineering, can not convince you to make the move to the cloud, well we don’t know what will. Maybe a lifetime supply of M&M’s or lifetime cinema pass? In any case, what we will say is get in touch with us today for a free consultation and discuss how we can help your business plan for IT infrastructure failure – and much more.

Site reliability and why you should plan to fail in advance was last modified: April 13th, 2018 by Stackmasters