Mailing List Cleanup - SRE Weekly (Part 2)

(This is part of an open ended series of posts where I write down random things I feel are sharable from the years of mailing lists I’ve not caught up on…)

This is part 2 of going through the SRE Weekly folder. There will be at least 1, and probably another after that. And maybe another after that as there are 242 messages in there as far back as December 2020 and this covers through the end of 2021.

Here is Part 1.

What I Wish I Knew About Incident Management – a nice checklist that applies regardless of the size. I really like this sentance though; ‘Not all actionable alerts result in an incident.’
Complexity Has to Live Somewhere – ‘Complexity has to live somewhere. If you are lucky, it lives in well-defined places.’
Generic mitigations – Useful. ‘Call Adam’ was a generic mitigation for a while there…
Tyler Wells on building a culture of reliability at Twilio – ‘First, new engineers are placed into a support queue for one week after onboarding and training.’ I’ve been a huge proponant of this since I did some contracting at Freshbooks waaaaay back in the day. I’m not a fan of ‘culture for culture sake’ (remember, the root word of culture is cult), but this sort of intentional culture is key
Worst Case – What if us-east-1 went away. I’ve not thought about Tim Bray in awhile. And it starts quoting Corey Quinn which means we’re in for a ride.
The danger of hidden functional roles – ‘[T]here are people who participate in this process that aren’t even aware of this function’
Signs of Triviality – This blog from Jan Schaumann seems, just really good. A post on DNS was in the newsletter, but there is a lot of cool things there.
Avoid frostbite: Stop doing code freezes – I think to something similar tomorrow from Charity Majors, but the key part is ‘Your current reliability is based on your current process.’ If you stop deploying, you introduce change and therefore risk into your systems. One of my ‘maturity’ measures is whether you do code freezes over Christmas. Spoiler: it is not a positive indicator [in 99% of the scenarios.]
Safe schema updates - Near-zero downtime database deployments – a reminder that everything gets re-discovered with every subsequent generation. 3-phase deploys for databases was popular enough a topic in 2006 that a book was written which had them in there. And yes, I do think that every technical leader should have read this book. Even in the age of ActiveRecord-based migration patterns.
Howie Guide to post‑incident investigations – Stashing for later. Remember, SOC 2 includes post incident process design and desktopping.