Site Reliability Engineering
It’s difficult to walk into a software development organization without hearing about the discipline of Site Reliability Engineering (SRE) though you may discover that SRE means different things to different teams. The practices of Site Reliability Engineering are all well known, and successful teams practiced them before there was a collective name for them. I read Site Reliability Engineering in an attempt to get a handle on what the discipline covers, and what SRE teams do. I do feel like I have a better understanding both of what canonical SRE is, and helped explain why different organizations practice the SRE discipline in slightly different ways. I recommend the book to those want to learn more about deploying systems at scale, though with some caveats.
The book is very Google centric, which is appropriate since the book is subtitled “how Google Runs Production Systems.” How well you can apply the lessons in the book to your team depends on your prior expertise, and the specific topic.
The book is a collection of articles by a variety of contributors, rather than a single sourced book and as a result the writing is a bit uneven. Some chapters do a good job of walking you though the subject area, and distinguish how what Google does could apply to your organization and too chain. Others are Google centric to a fault, describing internal tools and approaches with little if any reference to similar more generally available or even open source tools, or the tradeoffs to consider when evaluating that decision for your team.
The principles in the book are all generally valid and you will definitely walk away from the book knowing more about the issues to consider in keeping production systems running at scale. You many or may not have a clear idea about how to implement the lessons you learned.
Google is a successful company and has solved some challenging problems, and we can learn a lot from the Google practices. It’s important to remember that Google is also unique, in terms of history and problem space, so one should consider adapting the Google Way, rather than adopting it without interpretation. This book is a great launching point for discussion, and it’s worth having a copy if you deploy systems at scale. Just don’t take it as The Way to do Site Reliability Engineering.