My first job is Support Engineer for developer tools at Microsoft China. Most of the time we remotely help Microsoft customer to solve their development issues, like deadlock, heap corruption and performance problem. For difficult case we would connect to customer machine to perform remote debug. In rare case our engineer would travel to customer site to do onsite support.
At that time there was a special engineer title called CPR, meaning Critical Problem Resolver. Only very few engineer were entitled for that. Any problem that other engineers could not solve, CRP takes it and solves it.
In the earliest stage of my career I worked closely with a CPR who was considered the youngest CPR. Getting mentoring from him is still the most important career experience for me. Some small trick I learned from him 10+ years ago still shocks my team members today when I demonstrated it.
The story I want to tell today is not about him. I heard the story from him. It’s a story of another legendary CPR. I don’t remember the name of the protagonist.
It was around year 2000. A traditional company decided to build their IT solution by using Microsoft tech stack. At that time, only banks and ISP would invest on IT infrastructure. The common IT infrastructures were mainframe system from IBM or SUN, based on UNIX like Solaris. Obvious this was a big bet for that customer. The customer signed a contract to use Microsoft consultant to help build the service.
In the pre-production test, the new developed server application hit random crash. The same application worked fine in dev environment. When running on production environment it crashed in different places. Our support team had been following up for weeks, but not clue for the root cause. The customer was losing confidence on Microsoft. With the deadline approaching, we sent our CPR to customer’s site.
Our CPR engineer sit on customer site, debugged day and night, analyzed tons of dumps, and tried different approaches. After several days, he (or she) told the customer: you got a bad hardware, contact the vendor to replace your motherboard.
The customer was angry: “you cannot find your bug and you want to blame the hardware?” At that time, hardware vendor would not replace it for no reason. More importantly, shipping new hardware from USA takes time (computers were considered high tech, not made in China). If it would not fix the problem, we miss the project deadline.
The customer eventually agreed to replace the hardware, but with a condition. If the problem is not related to the hardware, Microsoft should refund all the cost for this project including the consultant cost.
The CPR explained the situation to the team leader. The team leader and other Microsoft leadership decided to support the CPR and agreed customer’s condition.
Customer makes the order. The new hardware arrives. Power on, running for days, and the problem never happens. This is the end of the story.
The story left a seed in my heart. I was wondering, if someday, I could be so skillful to tell from software bug and hardware bug in my debugging session. I could be so confident to ask customer to replace hardware when software crashes. Would I have the courage to trust the team’s judgement in similar situation?
A decade passed. I am still in Microsoft but I am not in support role any more. I am working on Azure Compute team. We manage all Microsoft server asset like Bing.com and Azure. The whole cloud infrastructure is deployed and managed by my team, and we have our service code run on every server machines.
10+ years ago, the support team I worked was so worried about that single customer. We wanted the software running healthy, and any single crash mattered. Today, our team’s service is supporting every Microsoft cloud customer. A single crash still matters. With modern instrumentation, we know any crash almost immediately. For the past years I debugged crash dumps many times. I also introduced some bug that almost caused global outage. I saw disk corruption and memory corruption. I saw outlier crash callstack. From analysis some of them were likely caused by bad hardware issue. But I have not caught hardware failure in debugger lively and directly, until last weekend, 11-11-2017.
The following is my analysis on a live debug. I think I observed a hardware failure which caused software crash. The debugging process was not tricky, but I still feel so good and peaceful.
GetAndXXXXXXX started constantly crashing on SXXPrdXXXXX XXXXXX-c125-XXXX-8202-513e917c5ddc since 2017-11-06 03:40:39.450 UTC.
The same binary is running everywhere without any issue and we verified the binary running on the problematic machine is not corrupted.
From mini dump we learned that the crash is Access Violation in string operation. Since this is AMD64 with optimization turned on, we could not confirm the actual broken pointer.
Since this is consist repro, we setup live debug.
From the live debug, the crash happens in inside this function call. A bad pointer was passed into the AppendF call.
(line 1658) serviceListIniBuffer.AppendF("ServerList=%s\r\n", m_newServiceList.config->GetStringParameter("Manifest", "ServerList", "").c_str());
We set breakpoint just at above line and got:
00007ff6`36961932 488b00 mov rax,qword ptr [rax]
00007ff6`36961935 4c8bc0 mov r8,rax
00007ff6`36961938 488d1541270b01 lea rdx,[00007ff6`37a14080]
00007ff6`3696193f 488d4d20 lea rcx,[rbp+20h]
>>>>>>>> 00007ff6`36961943 e8981ef3ff call GetAndXXXXXXX!YYYYY<127>::AppendF (00007ff6`368937e0)
From there we know (we push parameter from right to left. this pointer goes to RCX.)
r8 -- output of m_newServiceList.config->GetStringParameter
rdx -- should be hardcoded string "ServerList=%s\r\n"
rcx -- this pointer to serviceListIniBuffer
The memory info for these:
0:000> da r8
0:000> da rdx
0:000> dc rcx
00000001`075be190 00000239 00000000 0a3a5710 00000001 9........W:.....
00000001`075be1a0 3b0a0d3b 69685420 69662073 6920656c ;..; This file i
00000001`075be1b0 65672073 6172656e 20646574 47207962 s generated by G
From above, we know the bad pointer is the 1st parameter. Interestingly, the 1st parameter is actually hardcoded string “ServerList=%s\r\n”. How could this be corrupted!
If we look carefully, the rdx value is 00007ff6`37a1408. Let’s flip a bit and change it to: 00007ff6`36a14080, we get:
0:000> da 00007ff6`36a14080
This is 1 bit corruption.
We high suspect this is hardware failure. My guess there is a CPU/Memory/BUS failure that when a specified memory sequence is loaded, a bit get corrupted. That’s the explanation I can think about why the same application fails all the time at the same place.
BTW, the memory address does not change because the string is loaded from EXE, not a DLL. EXE always maps to the same start address.