Software Development

Help! My on-call rotation sucks!

Taking time away to develop a top-notch on-call plan might seem counterintuitive. But, as engineer manager Leemay Nassery advises in this edition of The Lossless Leader, it could save your team (and you) from problems down the road.

Wake your on-call teammates up in the middle of the night too often, and they won't be as happy as this guy when you get in touch.

(Photo by Flickr user Denis Malka)

This is The Lossless Leader, an advice column written by engineering manager Leemay Nassery.

Why call it The Lossless Leader? An engineering leader is someone who inspires their team, communicates well, grows their people to become leaders themselves, removes blockers or painful aspects of their team’s day-to-day, delivers on product requests and so much more. In tech, lossless compression is a technique that does not lose any data in the compression process; it reduces the size of files without losing any information in the file so quality is maintained.

Combining the two: Leaders aren’t perfect. Sometimes they manage to not lose any data while leading their org, and other times it may seem like they’re losing it altogether. This column is called The Lossless Leader because we all admire those leaders who strive to stay true to who they are and the people they serve (their team). They admit fault when necessary, learn from their mistakes and sometimes flourish in difficult situations — all while not losing themself along the way.

Submit your question to The Lossless Leader

The question:

“I’m exhausted from partaking in my team’s on-call rotation. It’s not that I don’t see the need for on-call — we support a high-traffic engineering system, so it’s necessary. However, we have so many pointless alerts that never really get addressed. When I mention this at team meetings, nothing happens. My teammates seem to be on autopilot when it comes to on-call. What can I do to make this situation better?”

The answer: 

It’s time you take matters into your own hands. Here’s the plan:

  • Block four days off in your calendar.
  • Decline all meetings.
  • Do an audit of the last few weeks of PagerDuty alerts. If dashboards aren’t available, there are PagerDuty APIs that can make auditing a bit easier.

The objective of this focused time isn’t to resolve this bigger alert fatigue problem, because I suspect that would take more time. Instead, it is to craft a document detailing what needs to be addressed and why.

Advertisement

Why four days? Because four sounds better than an entire week and anything less feels too rushed. Someone may think, “One whole week without any meetings just to focus on an on-call audit? That’s crazy talk!” But it’s not crazy at all. You have to be strategic when presenting plans that could be met with resistance.

Don’t just go rogue, though. Let the team know of your plan. Depending on how lax or strict your team is with planning, suggest it prior to sprint planning or send a simple Slack message a few days prior to the time you’ve allocated yourself to focus.

Leemay Nassery. (Courtesy photo)

If your manager finds it strange that you’re shifting focus away from your typical work, assure them this effort is time-boxed. It’s just four days. You’re taking initiative. Most people complain about issues, but don’t do anything about it. You’re actually going to do something about it. This is leadership 101, my friend.

Once you’ve made it to the audit …

When auditing the alerts, consider grouping them by the following categories:

  • Alerts that were acknowledged and resolved but no action was really taken.
  • Alerts that resulted in a follow-up ticket or stream of work to address.
  • Alerts that occurred after the typical working hours. The incident level of each alert.

With these data points, you should be able to craft a story that potentially answers the following questions:

  • Is your team woken up far too often in the middle of the night?
  • Are alerts skewing toward too many P1s? Are they truly P1 incidents, or should they be docked to a lower severity level?
  • If no tickets are actually created to address incidents or pages after the fact, then maybe it’s time to introduce an on-call recap meeting where action items are created, experiences are shared, etc.

Once the audit is completed, create a document that illuminates specific areas for improvement. The document should include:

  • A summary of your findings, including data points from the audit.
  • Specific areas to focus on now (i.e. where investments need to be made to improve on-call health).
  • Reasons why you’ve selected those specific focus areas.
  • Multiple options to proceed forward.

Be cautious about how you approach what happens next. You likely do not want to become the person known for all things on-call, but you do want to make it better. By attempting to improve on-call, you’re doing something that is very necessary but sometimes receives little or no recognition. As was said earlier, you have to continue to be strategic with this effort.

That’s not to say you should write the document, send it into the Slack ether and never mention it again. Now that you have a clearer path to a better on-call system, you could pass the baton to your manager (if you trust they’ll actually do something).

Alternatively, figure out how to address it as a team. For example, one person in each sprint could focus on addressing areas that were defined in your document or allocate an entire sprint for the whole team to do so together. If you do the latter, keep metrics at the top of your mind so you can measure how you’ve improved the state of on-call and showcase that this time is well spent.

After this is all said and done, “There goes my hero” will be what your teammates say as you walk by, getting them one step closer to a less-sucky on-call experience. Until then, listen to this ’90s throwback for inspiration:

Submit your question to The Lossless Leader -30-
Subscribe to our Newsletters
Technically Media
Connect with companies from the Technical.ly community
New call-to-action

Advertisement