This is a race the display driver wouldn’t normally expect to lose


One of my colleagues from the Windows 95 project reminded me of a problem that I was called to debug. If the floppy drive was in use, the display driver was more likely to crash.

It wound up being a race condition between the floppy driver and the display driver. This is a race the display driver wouldn't normally expect to lose.

The problem occurred on machines running the kernel debugger. When the kernel debugger was connected, the display driver printed diagnostic information, which was sent over the serial port, which slowed down the display driver and made it more likely that the floppy drive would interrupt it at a bad time.

Comments (11)
  1. creaothceann says:

    Just get a CPU with more cores, jeeze!

  2. Antonio Rodríguez says:

    In most cases, the display driver is like that handsome good guy played by Errol Flynn which never loses – only when the script needs to create some tension.

  3. Baltasar says:

    The kind of bug to debug that would tempt you to jump out of the window. You have to put the floppy, the diagnostic info and the display in debug mode together in order to be able to maybe reproduce the bug… what a nightmare.

  4. Douglas Hill says:

    I’m having trouble understanding what you mean by “wouldn’t normally expect to lose”.
    Was this something that was actually impossible under normal circumstances, but the debugger made it possible? Or was it possible under normal circumstances, but highly unlikely?

    In other words, was the bug in the display driver unmasking interrupts/not raising to a high enough IRQL/not taking a lock/whatever? Or was the kernel debugger doing things like unmasking interrupts or whatever?

    1. Simon Farnsworth says:

      As written, it sounds like the display driver didn’t hold onto a lock for long enough.

      In other words, correct code would look like:

      handle_irq():
      {
      with lock:
      {
      unmask interrupts
      determine cause
      handle interrupt
      clean up ready for next IRQ
      }
      }

      but the driver actually did:

      handle_irq():
      {
      with lock:
      {
      unmask interrupts
      determine cause
      handle interrupt
      }
      clean up ready for next IRQ
      }

      In the unlikely event that the floppy driver was able to claim the CPU (bear in mind that this is almost certainly on a single core and not preempted by a timer interrupt) while “clean up ready for next IRQ” was running, and then another display interrupt happened, you’d be in the bug case. The only way for the floppy driver to claim the CPU is if a floppy command was written to the controller before this interrupt handler fired, and the interrupt happened after lock was released but before “clean up” was finished – if it happened while lock was held, the display driver would still win the race.

      Because the kernel debugger made “clean up” much slower, you increased the chance of a bad hit from the floppy driver, causing trouble for the display driver.

      I suspect that Raymond has long since forgotten the actual details, though…

    2. Ben Voigt says:

      In a race between a tortoise and a hare, you don’t normally expect the hare to lose. The display driver is talking on a much faster bus, to a much faster peripheral, than the floppy driver.

  5. Koro says:

    Doesn’t the Win9x kernel have a concept similar to NT’s IRQLs to prevent lower-priority interrupts from interrupting?

    1. I don’t remember the details. Didn’t realize there would be a quiz 23 years later. Let’s say that the display driver had already returned to “not in a hardware interrupt” mode, but inadvertently expected to be able to get a small amount of additional work done before another interrupt came in.

    2. cheong00 says:

      I think Win9X is not using IRQL, instead they use IRQ with similar rule as that of DOS age.

      Added display cards usually use IRQ9-11 and floppy usually uses IRQ6, so the request handling sequence are independent from each others.

      If anything, since IRQ0-7 are processed on master i8529 chips and IRQ8-15 are processed by slave i8529 chips, floppy drives are more likely to gain higher priority to talk to the CPU.

      1. Cesar says:

        IIRC, the slave IRQ controller was cascaded to IRQ 2 on the master IRQ controller, so the priority sequence was IRQs 0-1, IRQs 8-15, IRQs 3-7 (with IRQ 0 being the timer, with the highest priority). So a floppy drive at IRQ 6 would have the second lowest priority, higher only than IRQ 7 (which IIRC was the parallel port).

        1. cheong00 says:

          Thanks for correcting me.

Comments are closed.

Skip to main content