Our guest this month is Fábio Araújo
, SRE at Web Summit. Hey, I swear we’re not related. At least that we’re aware of. Fábio was actually introduced to me this week by José Velez
. Fun facts: Fábio is a football referee on the weekends and has worked as a beach lifeguard in the past.
You might know Web Summit
as the annual conference held here in Lisbon, Portugal. It has been described as the most important tech event in the world but many would be surprised to know that they actually run a few other conferences around the world too. Here’s a summary of my conversation with Fábio:
/ what led them to where they are
I’ve been through a typical path to SRE. I first started as a full stack software engineer at a small startup where I had to do a bit of everything. I started getting interested in the architecture of things and how to build systems that don’t fail. That’s when I got the opportunity to join OutSystems as a Cloud Engineer & SRE. There, I really learned a lot about working with cloud services and monitoring systems at a huge scale. Then, I spent some time as an SRE at Mercedes-Benz.io where my focus was mostly around incident management, monitoring and disaster recovery. Now, I’m at Web Summit doing less ops and more implementation, infrastructure, and automation.
/ their day-to-day, goals, team
We are currently a team of three SREs but we will likely add more people to the team soon (stay tuned). Web Summit has a large group of services, from event management tools (tickets, access, comms, AI recommendations) to external platforms like Summit Engine. We own all service delivery, cloud infrastructure, monitoring assets, and our engineers’ development experience. The team’s success criteria today is still mostly around service performance and health but we will be implementing other metrics such as development pace.
/ what are some of the tools they use
We manage our own Kubernetes cluster on AWS and we’ve containerized everything in a multi-tenant setup. Monitoring with Prometheus + Grafana and New Relic.
/ a recent project they worked on that made them proud
In a previous role, my team was the first one at the company to plan and implement end-to-end monitoring and alerting for a critical service. Our approach was based on three pillars: (1) reliable and trustworthy monitoring, (2) efficient runbooks and guides, and (3) comprehensive mitigation techniques for on-call operators to avoid escalation. Explaining it like this might sound high-level but I’m proud of it because it was an innovative way of doing it in the company and our internal engineering community was extremely pleased with the results.
/ what could be improved in the reliability world in general
I feel like there’s an opportunity for an AI layer to be created in the monitoring and observability space. That might be something that helps you make sense of the billion logs and metrics there are available or a system that learns how you’ve managed past incidents and tries to provide better alerting in the future.
/ reliability, organizational, career advice
If you’re tasked with reliability at your company, start by measuring things and setting up monitoring. Even if you feel like you will never have perfect observability, you must create your baseline so you can refine it over time. It’s very easy to showcase feature work but you can only communicate and discuss reliability with metrics.