Speculation on how a mishandled 13-character string can result in a blue screen


Commenter nolan reminisces about an old Windows 95 bug in the networking layer that crashed if a string was exactly 13 characters long. "So for the past 10 years or so, I've been wondering exactly how one could write code with that bug. Any bug that weird has to have a great story behind it."

I don't know what the story behind it is, but if after ten years you still can't imagine how such a bug could be possible, you don't have a very active imagination.

SomeFunction(char *hostname)
{
 char tmpbuffer[13]; // most host names are less than this size
 char *buffer;
 if (strlen(hostname) > sizeof(tmpbuffer)) {
   buffer = strdup(hostname);
   if (!buffer) return some error;
 } else {
   buffer = strcpy(tmpbuffer, hostname);
 }
 ... do stuff with buffer ...
 if (buffer != tmpbuffer) free(buffer);
}

If the host name is exactly 13 characters, then this code overflows the buffer by one character, corrupting the stack. A crash is hardly surprising at this point.

Comments (35)
  1. Doug says:

    Buffer overflow/Stack corruption bugs are, are, ah, I don’t know what.

    So much of it starts from the base C standard string libraries from the 70s, that had no fear.

    The amusing/sad thing is that these types of bugs have moved from annoyances (crashes on bad input) to security issues.  What used to be a low probability input, not worth fixing has become a priority one fix now.

    Oh, and the Microsoft bashers?  Pls go away.  You are like the little kid in the corner, going "nah nah, make me".  Grow up.

  2. Sunil Joshi says:

    I suppose this is one of those micro-optimisations  which we here about. If the overhead of allocation is so important wouldn’t _alloca be beter then allocating a fixed size buffer on the stack?

    Though, I suppose the author would still need to remember to leave space for the null byte either way.

    [They’re so cute when they’re young. _alloca just swaps one bluescreen for another. -Raymond]
  3. Boris says:

    That (not the buffer overflow, but the original question) reminds me of a bug in SQL code I worked on once. The code failed with a division by zero error only when a date was passed within a specific month (August I think). Turns out we had a function meant to return how many days the month of a passed date had. The code went something like this:

    Declare @ret INT

    Set @ret = 0

    IF @month = 1 or @month = 3 or @month = 5 or @month = 7 or @month = 10 or @month = 12

    @ret = 31

    IF @month = 4 or @month = 6 or @month = 9 or @month = 11

    @ret = 30

    IF @Month = 2

    . . .

    Oops! August has zero days! Took awhile to find it too.

    Side note: this (or similar) code previously had a bug where it would always return 29 for February. Hey, it was tested and worked fine in 2004, right?

  4. Joe says:

    this crashes for any string size >13.  I thought the scenario was "crash at strlen = 13, strlen <13 is ok and so is strlen > 13"…

  5. Bryan says:

    @Joe:

    I made the same mistake at first that you did.  At least I’m not alone.

    if (strlen(hostname) > sizeof(tmpbuffer))

  6. DriverDude says:

    "What used to be a low probability input, not worth fixing has become a priority one fix now."

    Buffer overflows should have been priority-1 bugs since the Morris Worm in 1988. It took Sun and others years to wake up to that, and followed slowly by Microsoft in 2001.

  7. Maurits says:

    an old Windows 95 bug in the networking layer that crashed if a string was exactly 13 characters long

    That’s not weird.  Weird would be if the bug also only happened on Fridays.

  8. John_F says:

    I think they are thinking of IP’s not hostnames – which still would’nt really work.

  9. John says:

    13 characters ought to be enough for anybody :)

  10. Josh says:

    @Joe

    This will only crash if the hostname is exactly 13 characters.  If there are more than 13 characters, strdup allocates enough memory to hold a copy of the string (including the trailing NULL).  If there are less than 13 characters, there’s already enough room in tmpbuffer to hold the whole string and a NULL terminator.  If there are exactly 13, the test fails and tmpbuffer is used, but when strcpy copies the string over and adds in a trailing NULL it overruns the buffer by writing 14 characters to a 13 character buffer.  This is due to the fact that strlen does not include the NULL terminator in its result.

    The easy "fix" would be to change the test to (strlen(hostname) >= sizeof(tmpbuffer))

  11. Sunil Joshi says:

    [They’re so cute when they’re young. _alloca just swaps one bluescreen for another. -Raymond]

    Yes, come to think of it I suppose if you _alloca for any arbitrary size string, you could overflow the stack.

    Forget I mentioned it – in my defence I would just declare a std::string and be done with it. Though, I think std::string performs a similar optimisation internally.

    [std::string swaps one bluescreen for another. -Raymond]
  12. Gabe says:

    Sunil: If you were writing a Win95 network driver, you wouldn’t have use std::string. In fact, you still probably can’t use std::string in network drivers.

  13. Joshua Ochs says:

    None of this explains the strange choice of 13. Anyone can come up with code for a buffer overflow or an edge case not covered – that’s trivial. What the original person probably couldn’t fathom (as I can’t either) is why on earth it would be 13. 255 give or take one I could see for obvious reasons. Perhaps other base-2 boundaries. But 13 is so incredibly arbitrary that it defies all logic.

    Of course, as we’ve seen on this blog, much in Windows’ development defies logic.

    And meanwhile, this would be a good example of how unit tests could have caught this – you always check any edge case plus or minus one for exactly things like this. Does the code behave on small values, large values, and all of the edge cases (integer overflows, signed integers, hardcoded values – like 13).

    [I put the explanation of the number 13 in the comment: “most host names are less than this size”. Instrumentation is often used to guide decisions like this. -Raymond]
  14. John says:

    I think he meant to use std::string regardless of the length of the input string (i.e. removing the bad code); at least that was my interpretation.

  15. J says:

    SomeFunction(char *hostname)

    {

      char tmpbuffer[strlen("microsoft.com")];

      char *buffer;

      …

  16. dave says:

    re Joshua Ochs.

    My interpretation is not that it’s intended to handle a length of 13, it’s intended to handle a length of 12 (but since it’s buggy, who can be sure?).

    12 is short, and I have no explanation for why it was chosen, but it’s not as odd as 13.

  17. Kevin says:

    Too me 13 looks like a good fit for a DOS file name (8 + 3 + separator + null).

  18. Warll says:

    "[std::string swaps one bluescreen for another. -Raymond]"

    How many bluescreens are there anyway?

  19. Maurits says:

    How many bluescreens are there anyway?

    Just one, with varying implementations.

  20. Duke of New York says:

    And here’s the fix:

    char tmpbuffer[PATH_MAX]; // all imaginable strings are less than this size

  21. Mark Steward says:

    According to http://support.microsoft.com/kb/q184242 it’s actually when NBT attempts to resolve a 15-character name with more than one period in.  DNS usually had higher precedence than NetBIOS, so only an invalid domain name would trigger the fault.

    15 is of course the maximum length of a NetBIOS name (see Google for why).  I vaguely remember Win95 truncating anything longer, suggesting a fixed-length buffer.  I wonder if it’s to do with replacing periods with nulls?

  22. Mark Steward says:

    As for the mysterious 13 characters, see http://google.com/search?q=cache:attrition.org/security/denial/w/winsk20.dos.html+%22(not+including+periods)%22

  23. Nick says:

    Aw, what’s one byte* between friends?

    *may or may not be one byte

  24. porter says:

    If only they had come up with memcpy_s etc all these problems would have gone away.

    So I assume any programmer worth their salt now merely adds #define memcpy(a,b,c)  memcpy_s(a,c,b,c)….

  25. Worf says:

    @Gabe:

    "Sunil: If you were writing a Win95 network driver, you wouldn’t have use std::string. In fact, you still probably can’t use std::string in network drivers."

    Sure you can. May not work on all Windows, hell, may only work on one flavor of Windows…

    Just like I used the NDIS library functions to help manage linked lists and buffers and stuff. Include NDIS.LIB, and let people wonder why your code needs ndis.dll.

    Of course, in Windows CE, you can use C++ just fine, and std::string should work in the miniport…

  26. HagenP says:

    Boris wrote:

    The code failed with a division by zero error

    only when a date was passed within a

    specific month (August I think).

    You are not alone. From the mail archive (2006-12-xx):

    […] the WDK tool Signability.exe (from WDK RTM) reports the error

     "Signability Test failed to run, TestDrivers function call failed."

    when in the driver INF file DriverVer is set to December, 2006.

    Examples:

     DriverVer=11/30/2006,6.0.6000.16386  ; Fine

     DriverVer=12/01/2006,6.0.6000.16386  ; Provokes error

     DriverVer=01/24/2007,6.0.6000.16386  ; Fine

    => signability.exe has since been replaced by inf2cat.exe

    (Yes, I actually did try if I could use "0" for the year number. No, it did not work.) :-)

  27. HagenP says:

    (Yes, I actually did try if I could use "0" for the month. No, it did not work.) :-)

    …errors happen. Frequently!

  28. Jonathan says:

    http://attrition.org/security/denial/w/winsk20.dos.html

    Notice how it’s actually "exactly 13 characters long (not including periods)", and that each FQDN contains exactly 2 periods. That brings us up to 15, which is suspiciously close to NETBIOS names. And indeed this page’s final comment confirms it’s some problem with NETBIOS name resolution.

  29. Alexandre Grigoriev says:

    @Mike Dimmick,

    IRQL_NOT_LESS_OR_EQUAL happens when a page fault occurs under IRQL>=DISPATCH_LEVEL. It doesn’t mean the code or data was paged out. It means any page fault occurs, including an access to page zero, or to any not mapped address.

    Paged code is another can of worms, and there is no reason to use it in the kernel anymore. I don’t understand why MS won’t just ignore this attribute for code sections. It even bites MS programmers themselves, which I found one time (ACPI paged code may get called under DISPATCH_LEVEL).

    The issues with C++ in the kernel are solvable, and C++ was used for KMDF runtime.

    You can’t really have C++ exceptions; this is why C++ standard library won’t fly.

  30. Mike Dimmick says:

    On the subject of ‘what is a blue screen’:

    Blue screens are generated (in Windows NT family) by KeBugCheckEx. This routine can be called explicitly when you know you have a fatal problem, or by the system in response to an exception it couldn’t handle, or that you didn’t handle.

    The most common problem is that a coding error causes an access violation, which generally leads to stop code 1E, KMODE_EXCEPTION_NOT_HANDLED. That code is also caused by many other situations, like using a breakpoint instruction with no debugger attached.

    The next most common problem, but quite erratic in how it reproduces, is that you’ve marked a piece of code or static data as pageable (or rather, not marked it as non-pageable) but referenced it when paging is not allowed. This produces stop code 0xA, IRQL_NOT_LESS_OR_EQUAL. The problem only reproduces when the referenced page is not currently in RAM, which on many systems means most of the time it doesn’t occur, only when under memory pressure. IRQL is the Interrupt Request Level, and it has to be at or below DISPATCH_LEVEL to allow paging (because the system needs to suspend this thread and *dispatch* another, so it can do something else while this thread waits for the page to be read from disk). Interrupts and deferred procedure calls are handled at levels above DISPATCH_LEVEL.

    Taking Raymond’s cases: in the original problem, stack corruption will cause a return to a random address, or an incorrect frame pointer will be restored, or some other register value will be set incorrectly. That will either directly cause a reference to an invalid page (jumping to a non-code page, writing to a read-only page) or will corrupt something else, ultimately ending up either in a critical structure not self-checking, which would cause some specific stop code, or a 0x1E or 0xA as above.

    Switching to _alloca is very likely to cause a stack overflow, because kernel stacks are very small (I see references to 32KB). I think you get a KMODE_EXCEPTION_NOT_HANDLED here with the STATUS_STACK_OVERFLOW code (0x800000FD).

    The reason C++ is banned from kernel mode is to do with paging and how you declare code as paged. The Portable Executable file format is divided into Sections, and each Section has attributes (flags) which indicate whether the section is code or data, pageable or non-pageable, etc. One of the flags is the ‘Not Paged’ flag. In the source code, you can use #pragmas to define what the section attributes are and to say which section code or data should go into. The C++ standard library doesn’t specify such things and as such ends up in your default code section (“.text”) or whatever was active when you #included the file. As a result, the code for std::string could be marked pageable, but you could have referenced it from code that must be non-pageable.

    Worse still is that the compiler generates code for you, for example default constructors, destructors, and exception and delete helpers, and there is no way at all to control where that code ends up.

    The result is that std::string is quite likely to cause IRQL_NOT_LESS_OR_EQUAL, but will be maddeningly unreproducible.

    [The original issue was Win95, which has a completely different kernel mode model from the NT family. -Raymond]
  31. Michiel says:

    You could have C++ exceptions. The problem is really MSVC++, which implements them on top of SEH. That’s definitely not a good idea in kernel code. However, it’s conceivable that a future compiler would add a /kernelexception option.

    Pageable/nonpageable is indeed trivially fixed. Mark .text nonpageable (safe default) and use .page as for the functions you explicitly allow to be paged.

  32. Alexandre Grigoriev says:

    Michiel,

    The problem with using .page (and INIT) with C++ code that you don’t know where some implicitly instantiated items end up, such as vtable and ctor/dtor.

  33. GordonSchumacher says:

    @Michiel and Alexandre Grigoriev: Here’s Microsoft’s official summary:

    http://www.microsoft.com/whdc/driver/kernel/KMcode.mspx

  34. Porter, see the FQA for some of the myriad ways in which that macro can make things worse. Non-hygienic textual substitution macros ftl.

Comments are closed.