Monday, November 15, 2010

Site Reliability Engineers at Google profiled by student blogger

In Caitlin Talks to Site Reliability, you can get a peek into the daily routine of one of the teams I worked with when I first started at Google. These engineers are constantly at work behind the scenes, turning dials and pulling levers to make Google's incredible infrastructure work smoothly. We're talking about software that runs near the bottom of the stack, 'way below the Gmail or Calendar UI that you see: file systems, job allocators, and everything else that is just one step above datacenter hardware.

I helped the SREs by capturing what had previously been purely verbal lore: grilling them for troubleshooting tips, writing down day-to-day procedures, and creating training for new team members. I created web pages explaining SRE's habits and requirements to other teams at Google so that they could all work together nicely. I also (gently) corralled the SRE team into adopting some documentation best practices so that when I left them, they could continue to record their knowledge for the benefit of their colleagues.

Caitlin's interview subject, Marc the SRE, talks mostly about being on call and getting late-night pages to fix issues. That's the dramatic part of the job, but from where I sat, I saw the SREs labor constantly (and, mostly, patiently) to make already-up-and-running infrastructure run better, faster, and at ever-larger scale. New systems constantly replaced the old, requiring reshuffling and reconfiguration. Engineers on product development teams had to be educated about how to use the infrastructure appropriately—the danger of knocking over your own or another team's product was a real and ever-present threat.

It was an interesting job trying to pin all this down in words. Even today, when I get the "unavailable, try again in 30s..." message, I picture the SREs scrambling for all they're worth. I'm still amazed at how fast they get these things repaired. I like to think my documentation plays some small part in it, but the truth is, I learned a lot more from them than they did from me.

No comments:

Post a Comment