Mailing List Cleanup - SRE Weekly (Part 4)
(This is part of an openended series of posts where I write down random things I feel are sharable from the years of mailing lists I’ve not caught up on…)
This is part 4 which covers posts from the SRE Weekly folder for calendar year 2023.
- Our simple-to-use incident post-mortem template – Until you don’t need a template, you need a template
- Adding Zonal Resiliency to Etsy’s Kafka Cluster: Part 1 – People are rediscoverying the importance of multi-az as conflicts are multiplying around the world. This is an old article, obvs, but still timely.
- The yaml document from hell – Sometimes YAML is too easy a target
- Good category, bad category (or: tag, don’t bucket) – Tags are, as ever, almost always the better approach.
- Taking the fear out of migrations – Stuff like this is what I fear will be lost when we start just telling AI to generate the migration paths.
- Car alarms and smoke alarms: the tradeoff between sensitivity and specificity – Math behind the sentance
avoid the base rate fallacy by remembering that your false-positive rate needs to be much smaller than your failure rate - The Invisible Success of Near Misses – They are interesting too
- Service Delivery Index: A Driver for Reliability – I think I like this
- 10x/9 Rule –
For every 9 you add to SLO, you’re making the system 10x more reliable but also 10x more expensive. - https://highscalability.com/the-swedbank-outage-shows-that-change-controls-dont-work/ – Ask me what I think about Change Control Boards and their role in risk theatre…