SRE Monthly Roundup

By António Araújo

SRE Monthly Roundup — March 2022

#1・
3

issues

Subscribe to our newsletter

By subscribing, you agree with Revue’s Terms of Service and Privacy Policy and understand that SRE Monthly Roundup will receive your email address.

SRE Monthly Roundup
SRE Monthly Roundup — March 2022
By António Araújo • Issue #1 • View online
Hi friends 👋
Welcome to the very first edition of the SRE Monthly Roundup. I’m thrilled and honored to have you onboard. 
If you can do me a favor and reply to this email with a quick “hello!” it’ll help convince Gmail to keep this newsletter out of the dreaded Promotions folder.

Guest of the month
Our guest this month is Fábio Araújo, SRE at Web Summit. Hey, I swear we’re not related. At least that we’re aware of. Fábio was actually introduced to me this week by José Velez. Fun facts: Fábio is a football referee on the weekends and has worked as a beach lifeguard in the past.
You might know Web Summit as the annual conference held here in Lisbon, Portugal. It has been described as the most important tech event in the world but many would be surprised to know that they actually run a few other conferences around the world too. Here’s a summary of my conversation with Fábio:
career
/ what led them to where they are
I’ve been through a typical path to SRE. I first started as a full stack software engineer at a small startup where I had to do a bit of everything. I started getting interested in the architecture of things and how to build systems that don’t fail. That’s when I got the opportunity to join OutSystems as a Cloud Engineer & SRE. There, I really learned a lot about working with cloud services and monitoring systems at a huge scale. Then, I spent some time as an SRE at Mercedes-Benz.io where my focus was mostly around incident management, monitoring and disaster recovery. Now, I’m at Web Summit doing less ops and more implementation, infrastructure, and automation.
current role
/ their day-to-day, goals, team
We are currently a team of three SREs but we will likely add more people to the team soon (stay tuned). Web Summit has a large group of services, from event management tools (tickets, access, comms, AI recommendations) to external platforms like Summit Engine. We own all service delivery, cloud infrastructure, monitoring assets, and our engineers’ development experience. The team’s success criteria today is still mostly around service performance and health but we will be implementing other metrics such as development pace. 
stack
/ what are some of the tools they use
We manage our own Kubernetes cluster on AWS and we’ve containerized everything in a multi-tenant setup. Monitoring with Prometheus + Grafana and New Relic. 
proud
/ a recent project they worked on that made them proud
In a previous role, my team was the first one at the company to plan and implement end-to-end monitoring and alerting for a critical service. Our approach was based on three pillars: (1) reliable and trustworthy monitoring, (2) efficient runbooks and guides, and (3) comprehensive mitigation techniques for on-call operators to avoid escalation. Explaining it like this might sound high-level but I’m proud of it because it was an innovative way of doing it in the company and our internal engineering community was extremely pleased with the results. 
improvement
/ what could be improved in the reliability world in general
I feel like there’s an opportunity for an AI layer to be created in the monitoring and observability space. That might be something that helps you make sense of the billion logs and metrics there are available or a system that learns how you’ve managed past incidents and tries to provide better alerting in the future. 
advice
/ reliability, organizational, career advice
If you’re tasked with reliability at your company, start by measuring things and setting up monitoring. Even if you feel like you will never have perfect observability, you must create your baseline so you can refine it over time. It’s very easy to showcase feature work but you can only communicate and discuss reliability with metrics.
The Roundup
/ a collection of interesting blogs & articles, news and stories
This one was published in February 2022, but it’s too good not to go on the newsletter first edition! It has surely helped me clarify what are some of the differences and overlaps between SREs and Platform Engineers. By @rdli and @bjorn_fb 
@jwswj has been at Zendesk for almost 10 years and this month he shared 8 lessons learned while working on the platform’s reliability. Incredible nuggets all over. I find it hard to resist real-world stories about implementing Service Level Objectives (SLOs) at scale and linking those to true customer impact. 
@annabkr volunteered for a full week of on-call at LaunchDarkly. Read through the diary and some very helpful tips if you’re looking to improve your on-call rotation. As Anna tells us, being part of on-call rotation can sound scary but with the right tools & processes and a healthy engineering culture, it isn’t so bad!
@reprazent shares GitLab’s approach to measuring availability through SLIs and SLOs. As you’d imagine, GitLab seems to have a pretty mature SLO landscape (at least for GitLab.com’s availability). I’m curious to hear a few things in the upcoming posts, such as (a) more details on how user journeys were represented through SLIs and (b) how all of this becomes part of their GitOps and CI/CD processes.
If you’re into the cloud costs subject, you might want to skim through Flexera’s State of the Cloud Report. For example, if you thought that public cloud was a good business for the likes of Amazon or Microsoft, we’re still far from what they can really become. According to 753 “global cloud decision-makers and users” surveyed by Flexera, it seems like today only half of the workloads and data are sitting in a public cloud.
Bonus: 
What’s coming
/ shortlist of events, meet ups, product launches
@dixie3flatline reminded everyone that dockerswim will be removed from the upcoming Kubernetes version. Read carefully before updating to v1.24.
Container Solutions is bringing us once again a free, full-day, virtual event about all things reliability, DevSecOps and Observability. 
It’s not a joke. The IRConf is a free, half-day virtual event with a great lineup of speakers discussing all things resiliency and incidence response.
HugOps
/ what is HugOps
This month we’re sharing #HugOps with:
Google Cloud — March 8th which brought down Spotify and Discord
And literally everyone who lost their sleep with the Okta stuff…
Who’s hiring
From the detech.ai blog
_____________
And that’s it for this month! What have I missed? Tell me on Twitter or [email protected]. See you next month!  
Did you enjoy this issue?
António Araújo

A monthly newsletter to help you keep up with everything going on in the site reliability world, covering topics such as performance, scalability, security, DevOps, observability, engineering leadership, and more

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue