Hi my name is Bob, I’m an Escalation engineer with the Microsoft critical problem resolution team. We had one of our readers ask how much we deal with hardware. Well in response we recently worked on an interesting problem I thought I would share with you. In this case it was interesting because it demonstrated an issue specific to multi-processor machines and something that probably sounded innocuous to the driver writer who caused the problem.
What was the problem?
The problem was that everything worked except the time in the system was not being updated. The RTC had stopped. We spent time reviewing the semantics of how the RTC should function and even hooked up an oscilloscope to the RTC on the motherboard and were able to turn it off and on with the debugger by writing out the correct port. The trace on the scope validated our understanding of what had to be written to the port to turn the clock off. One we had a clear understanding of this we understood what we were looking for in a driver that might cause the problem. Note the clock typically fires every 10ms so you do not need a fast scope to do this.
Special keyboard driver written
In order to catch a dump in state we had to modify the keyboard driver. It would cause an “Int 3” in its ISR instead of calling “bug check” for an E2 stop. Because the RTC was not running the idle thread was not getting quantums and as a result a normal break in would not work. However the system would respond to ISRs.
What was found
All RTC interrupts were stopped – the clock was not running. We checked all the obvious places to see if the RTC was disabled.
We looked at the ICR in the I/O APIC. This is the interrupt configuration register. There is a register for every interrupt pin on the APIC. These registers are used to tell the APIC what vector to send the processor so the processor can service the interrupt. It also has configuration information about level and if it edge triggered and a mask bit. The mask bit was not set.
Below is a reenactment.
0: kd> ed ffd01000
ffd01000 00000034 20 ß Select register 20 which is pin 8.
0: kd> ed ffd01010
ffd01010 000008d1 ß Contents ß Vect D1 Bit 16 the interrupt mask bit is not set so it is OK.
Next check the RTC status register which are I/O ports 70 and 71. Port 70 is the address port. Port 71 is the data port. This information is from an old BIOS book.
0: kd> ob 70 b ß ‘B’ is a control register.
0: kd> ib 71
00000071: 42 ß The value 42 means that the RTC is enabled. Bit 6 is the enable.
So what was it?
The way the RTC works is it will interrupt at a certain interval. When the interrupt is serviced, the status register has to be read to start it again.
We discovered another driver that was reading the clock, this was done by disassembly various drivers in the dump and looking for the I/O operation to ports 70 or 71. The lower addresses selected by port 70 will yield the time when read. That is what the driver was doing.
You would think that simply reading the time in this way would not hurt anything. However, in a multi-processor system, access has to be serialized. There is only one set of I/O ports for the system.
Since it takes two accesses to perform an operation on the clock, one address & one data, a collision between two processors can cause undetermined results.
Below is a timing diagram of the issue;
Proc 0 running OS RTC handler Proc 1 running XYZ driver
T1 Set register select to status register
T2 Set register select to read time
T3 Read status register to restart clock
So at T3 the OS RTC handler reads the wrong register so the clock does not start.
I thought this was an interesting problem that illustrates the need for serialization. And it demonstrates what to look out for in a multi-proc environment. You always have to think “What happens if the other processor does…….”
For more information consult any BIOS developer manuals you may have lying around or this link we found http://www.geocities.com/SiliconValley/Campus/1671/docs/rtc.htm
See the “Status Register C” section, “All future interrupts are disabled until this register is read – your interrupt handler *must* do it.”