Why does GetServiceDisplayNameA report a larger required buffer size than actually necessary?

If you call the Get­Service­Display­NameA function (the ANSI version), it reports a buffer size far larger than what appears to be necessary.

For example, if the service display name is awesome, you would expect that the required buffer size is seven characters, one for each character in "awesome". (The Get­Service­Display­NameA function does not count the terminating null; it expects you to add one yourself.)

But instead, the Get­Service­Display­NameA function says that you need fourteen characters. And then when you give it a buffer that is fifteen characters long, it fills in only eight of them, and then says "Ha ha, I wrote only seven characters (plus the terminating null). Silly you allocated far more memory than you needed to, sucka!"

Why is it reporting a required buffer size larger than what it actually needs?

Because character set conversion is hard.

When you call the Get­Service­Display­NameA function (ANSI version), it forwards the call to Get­Service­Display­NameW function (Unicode version). If the Unicode version says, "Sorry, that buffer is too small; it needs to be big enough to hold N Unicode characters," the ANSI version doesn't know how many ANSI characters that translates to. A single Unicode character could expand to as many as two ANSI characters in the case where the ANSI code page is DBCS. The Get­Service­Display­NameA function plays it safe and takes the worst-case scenario that the service display name consists completely of Unicode characters which require two ANSI characters to represent.

That's why it over-reports the buffer size.

When you call it with a buffer that is fifteen characters long, the Get­Service­Display­NameA function calls the Get­Service­Display­NameW function, which says, "No problem, here's your display name. It's seven Unicode characters long." The Get­Service­Display­NameA function then converts it from Unicode to ANSI, and it turns out that it requires only seven characters plus the terminating null. Hm, how about that. Okay, well here's your seven-character string. Sorry about the extra seven characters you allocated. I asked you to allocate them just in case.

Bonus chatter: These worst-case calculations will break if the ANSI code page were ever UTF-8, because the worst-case expansion becomes three UTF-8 code units for one UTF-16 code unit, rather than just two to one for DBCS code pages. These types of assumptions about the worst-case scenario are buried throughout tens of millions of lines of source code. Finding them is quite a challenge.

Comments (32)
  1. Gee Law says:

    Does that mean you should avoid setting your default code page to UTF-8, which is now possible as beta in Windows 10 version 1803?

    1. Antonio Rodríguez says:

      As I understand, the legacy code page is a concession to legacy software that is not Unicode-aware. As UTF-8 is quite recent, I doubt there are many applications developed taking it into account; but there are scores of them developed for ANSI which assume (wrongly) that one character is always one byte. So I guess setting it would cause all kinds of crashes and compatibility issues with old applications (because of the apps’ internal string manipulation – which Windows have no control over). I would avoid it at all costs.

      1. Cesar says:

        UTF-8 is not “quite recent”, it’s older than Windows 95. But AFAIK the “codepage” concept Windows uses is even older, from the MS-DOS days or earlier.

        The current recomendation seems to be to use the “W” APIs, and convert to/from UTF-8 manually. The “A” APIs, and the whole codepage concept which comes with them, should be treated as deprecated.

        1. Brian says:

          It’s somewhat recent by Windows standards. When Windows NT first adopted Unicode, there were <64k Unicode characters, and the only representation was 1 character = 2 bytes (either Big or Little endian). NT shipped about the same time as the development of UTF-8, but without any knowledge or any non-2-byte Unicode representation. UTF-8 really only took off when HTTP and HTML turned the whole world into a stream of bytes, later in the 90s.
          Windows 95 doesn't really figure into the equation, unlike NT, it was a "encode characters using code-pages" OS, just like DOS and 16-bit Windows was. UNICOWS fixed that, but too late to have any impact at all.

      2. James Picone says:

        Such programs would have already been wrong, surely? The legacy code page can be multi-byte, I thought, just originally a maximum of two bytes/character. You’re not going to break it any harder because UTF8 matches ANSI in the ANSI range.

        1. CN says:

          It matches “ANSI” in the **ASCII** range. For those who use languages that fit comfortably in 8-bit “ANSI “(e.g. CP1252), but absolutely not in ASCII (French, German, Swedish, …, …) — that’s a huge difference.

      3. Joshua says:

        On the other hand, I’ve got a dozen or so programs that are waiting for this to appear. (Or alternately a standard library where char * is UTF-8 in fopen(), dir(), printf(), etc.)

        1. GL says:

          I guess they’d better solve that problem themselves. The /execution-charset:utf-8 switch will do the trick.

          1. Joshua says:

            I tried this. Your idea would result in trying to pass UTF-8 strings to fopen with spectacular results. The Windows Console can handle this. fopen() cannot. Pipes can’t be declared to be in UTF-16.

      4. Antonio Rodríguez says:

        I was talking about Windows supporting UTF-8 as a codepage. Which is definitely more recent than Windows 95, given that in 2018 it’s still in beta, but Windows 95 was launched in 1995 (23 years before).

    2. Dan says:

      I’ve long wondered why Windows hasn’t supported UTF-8 for the “ANSI” code page, and now I finally got a straightforward explanation *why*.

  2. ranta says:

    Does Win32 have a constant similar to MB_LEN_MAX? I guess MB_LEN_MAX itself is not guaranteed to be large enough for all code pages, only for CRT locales.

    The Windows equivalent of MB_CUR_MAX apparently is CPINFO::MaxCharSize after calling GetCPInfo, which requires error handling and is too cumbersome to do just in case.

    1. Even if MB_LEN_MAX had existed, its value would have been 2, because at the time (1983), the worst-case length for a character was 2 bytes (DBCS).

      1. Joshua says:

        Indeed, but it makes finding the fixup locations easy.

  3. DWalker07 says:

    If the first sentence had said “If you call the Get­Service­Display­NameA function (the ANSI version), it reports a buffer size twice the size of what appears to be necessary”, I would have thought “DBCS” or “Unicode” right off the bat. As it is, when I read “fourteen” instead of “seven”, I knew that was the answer.

  4. Stefan Kanthak says:

    Bonus question: why does ConvertSecurityDescriptorToStringSecurityDescriptor() return in the 5th argument (StringSecurityDescriptorLen; number of characters written to the output buffer) a value bigger than the length of the string written to the output buffer?

  5. Kirit says:

    You can’t take the UTF-16 code units and convert them to UTF-8 without getting a borked stream, so you need to go UTF-16 -> UTF-32 -> UTF-8. Your worst case here is 4 bytes not 3.

    1. florian says:

      > … three UTF-8 code units for one UTF-16 code unit …

      With UTF-8, code points from the BMP take up (at maximum) 3 bytes, and code points from the Supplementary Planes take up 4 bytes.

      With UTF-16, code points from the BMP take up 2 bytes (1 WORD), and code points from the Supplementary Planes take up 4 bytes (2 WORDs).

      So the mentioned “worst-case expansion” of 3 to one “code units” is 3 to 2 bytes.

      And, it’s possible to perform UTF-16 to UTF-8 (and vice versa) conversion “directly”, it’s just a matter of masking and shifting the non-framing bits to the correct positions:

      UTF-8 (BMP):

      UTF-16 (BMP):

      UTF-8 (Supplementary Planes):

      UTF-16 (Supplementary Planes, i.e. surrogate pairs):
      (wxyz = abcde-1)

      Unicode has been “artificially” limited to a 21-bit encoding space, so that all code points can be represented with UTF-16, even if UTF-8 would theoretically cover a 42-bit encoding space (with 8-byte sequences).

      To me, Unicode is one of the most fascinating systems I’ve discovered, to date, a true technical masterpiece.

    2. Kevin says:

      Sure you can, you just have to know what you’re doing (i.e. you have to parse surrogate pairs and convert them into a single character each, which you also have to do to convert to UTF-32). Incidentally, this conversion (UTF-16 surrogate pair -> UTF-8 single character) will only blow up 2x because the surrogate pair occupies two “Unicode characters” (as Raymond has been calling them) to encode one Unicode code point (as the Unicode consortium calls them), which occupies four bytes in UTF-8.

      1. I try to remember to call them “Unicode code units”. If I miss a spot, let me know.

  6. Neil says:

    > You gave it a buffer with room for 15 characters and it filled in only 8 of them.

    Where did it get the buffer to call the wide version of the function? It can’t use your buffer, because it’s only got room for 7 wide characters, and the wide function needs to fill in 8 of them.

    Is there something I’ve forgotten from a previous post whereby the null terminator isn’t written in the limiting case, but the function still succeeds?

    1. It allocates a local Unicode buffer with the same capacity as the ANSI buffer.

    2. florian says:

      There were wonderful MSJ – “Under the Hood” articles by Matt Pietrek, maybe 20 years ago, and unfortunately no longer available on microsoft.com, to explain the techniques used to speed up the ANSI to Unicode (and vice versa) API translations.

      It was (is?) full of fanciness like hand-optimized assembly code with unrolled loops for UCS-2 (later UTF-16) to system ANSI code page conversion, and static/preallocated per-PEB (or, maybe per-TEB) string conversion buffers (with lengths of 32767 “characters”, accounting for many of the string length limits found in Win32 ANSI APIs).

    3. Alex Cohn says:

      This API and her kin were never about saving memory, but rather reducing (sometimes) memory management effort.

  7. Joshua says:

    So I’ve got something useful to say on UTF-8 codepage. My previous analyses were all based on application code, not library code. It takes a special kind of mistake to make code page UTF-8 break in application code, but it’s very easy for that mistake to exist in the C library itself. Essentially, it’s the same kind of problem as the W-A problem given here. The problem shouldn’t exist in application code because application code has no business doing partial-range code-point conversion. But fgets() and fprintf() need to do this internally and its really easy to mess up the chunk-read algorithm in fgets() to assume at most two bytes here.

    This is one of the many reasons why you really want a platform libc. While you can service the msvc*.dll files, several VS versions including 2005 went out with broken support for this where servicing one application’s local copy would break another application’s local copy, resulting in people statically linking against libc, which can’t be serviced.

  8. Alex Cohn says:

    To me, this and similar APIs that return size of required buffer sans terminating nul, remind the $9.99 price tags.

  9. Anonymous says:
    (The content was deleted per user request)
  10. Anonymous says:
    (The content was deleted per user request)
  11. Anonymous says:
    (The content was deleted per user request)
  12. poizan42 says:

    > Bonus chatter: These worst-case calculations will break if the ANSI code page were ever UTF-8, because the worst-case expansion becomes three UTF-8 code units for one UTF-16 code unit, rather than just two to one for DBCS code pages. These types of assumptions about the worst-case scenario are buried throughout tens of millions of lines of source code. Finding them is quite a challenge.

    Uh-uh https://i.imgur.com/BuxCncw.png. Yes, that does exactly that…

    1. poizan42 says:

      … how do I get the blog software to make a newline? Several line breaks does nothing, and <br> is ignored as well,

      1. poizan42 says:

        Wait, my comment shows up fine when I’m not logged in, but misses the newline when I am logged in.

        Logged in: https://i.imgur.com/OO5PhWA.png

        Not logged in: https://i.imgur.com/TYLc9R7.png

Comments are closed.

Skip to main content