So I don’t normally inflict more than one blog post on you guys per week, but I feel I owe it to the greater Gotdotnet community to talk more about the site’s ups and downs this week.
I joke around with people with dark humor – especially the old Gotdotnet team who know the site well – about how sneaky the site is. If there’s a server or application problem, it usually shows up ….
1) On a weekend when there’s a skeleton Microsoft.com ops crew that doesn’t know Gotdotnet
2) At 5:30 p.m. Pacific when everyone has been lulled by a false sense of security into going home
3) Friday night
4) At around 6-8 am Pacific time, so that by the time the crew comes in at 9 a.m., there’s a decent array of emails from people freaking out that there’s a problem
Once I became the Workspaces program manager, I suddenly became aware of what former Workspaces pm Andy Oakley kept insisting to me happened when I was a site manager:: the annoying, *self-healing* properties of Workspaces that make it hard to debug.
There’d be a Workspaces problem.
Six people would complain about it.
I’d go over to Andy.
It would start working.
Likewise over the last few months I’ve had user complaints I or the rest of the team couldn’t reproduce. As the person who answers your emails, I believe every single complaint is true (why would you lie? you have more important things to do).
But I’d turn around and the Workspace would work, or the site would work – so we’d have to systematically drill down with people on their situation and their repro steps. That’s why I ask people to email firstname.lastname@example.org – we need to get intense and personal on these bugs. We even got that guy in NYC to let us onto his machine to figure stuff out (Read the new STEP FOUR in the updated FAQ- http://gotdotnet.com/team/betsy/workspacesfaq.aspx – that’s all Jim’s help).
But hahaha, that’s funny Betsy – the site isn’t alive. It’s just a bunch of asp.net code, some of it older than it should be – it can’t possibly be sentient. It can’t possibly say, time its malfunctions. It can’t possibly scheme to get shiny new hardware in the most flagrant way possible.
Then, on the eve of a code fix(!) to assist you with bugs we’ve finally nailed down, we lose two critical pieces of hardware and have to replace and restore from backup. They likely started failing over the weekend but finally the site is up again around 10 a.m. Tuesday PST.
Fine, this stuff happens.
Then something happens that I can’t even talk about rationally, it just flabbergasts me in terms of feeling cursed – and we come in Wednesday to find Microsoft.com ops having to rebuild on an entirely new server from scratch and backups, as the same drives that failed previously and replaced were once again failing. That’s why our team stayed until 2 a.m. Thursday morning syncing and restarting and fiddling and restoring based on the previous backups. I’m writing this on no sleep, because once you stay up that long, it’s sometimes just easier to keep going
So …yesterday and today…. we are combing through things, trying to make sure it all syncs up. We never did get to that code fix (we’ll announce it far in advance when we do, since Workspaces will definitely be down when we do it, and we figure you need a break from all the downtime – might as well have the site up for a while.). Most people’s Workspaces are doing great, but we will continue to take feedback and hear from you if it’s not. Let us know.
The dang site’s alive. In more ways than one.Live it vivid!