I got called out in email, I think reasonably appropriately, for using the term “off-times” when referring to choosing a better time to do our sprintly upgrades. In a 24×7 continuous service world, there really is no such thing as a “good” time to have a problem. I’ve written in the past about the extensive lengths to which we’ve gone to ensure that our full upgrade process is “online” and we pretty much never, ever have to take the whole service down for anything. However, there are bad times and worse times to have a problem.
We have designed our current upgrade schedule to make sure we have developers available during reasonably normal working hours to ensure that, if something goes wrong, the right people are on hand to address the issues and we don’t burn people out with too many long hours and late nights. The result has been a schedule where our upgrades happen during peak hours of usage. For a long time, that wasn’t really a problem because, for the most part, upgrades were very seamless and reliable. In the past few months it has been more of a problem, hence we are looking at adjusting the timing.
Back to the original observation about “off-times”…
I love data. I try to share the kinds of data we collect and how we use it. We have a lot of it. In fact, I just saw a mail saying we needed 20 more TB of space to store all of the data we collect about the operation of the service. Although there’s no true “off-time” there are more and less busy times. Here’s a graph our our weekly activity, starting at 12:00am UTC on Monday and going through the week.
Our peaks happen at about 11:00am EST (4:00pm UDT). Usage bottoms out around 12:00am EST (5:00am UTC). Weekend days (Sat & Sun) have a similar pattern but are, overall, significantly subdued compared to weekdays. In fact, you can see that the 3rd “peak” on Friday (and Friday in general) is much smaller than other weekdays; I suspect a sign people are ready for the weekend .
For the past many months, I’ve been advocating that we work harder to make updates a “non-event” – in every sense of the word. Customers shouldn’t notice them. Our developers shouldn’t notice them. Our service delivery team shouldn’t notice them. They just happen and people are pleasantly surprised when new features show up. It’s a statement not only about the reliability and performance but also about the degree of automation and process management on our side. That’s the goal but clearly, in the last few months, we haven’t been meeting that goal.
We’ll shift our deployment times from morning EST to evening EST for a while and continue to work on making the process better. I’d very much like to avoid having to ask the team to work weekends, even though that would be the least impactful time. I really want to focus on fixing the problem and have this mitigation only be temporary. We’ll certainly keep you all posted as we make progress.