Keep your eye on the code page: C# edition (the mysterious third code page)


A customer was having trouble manipulating the console from a C# program:

We found that C# can read only ASCII data from the console. If we try to read non-ASCII data, we get garbage.

using System;
using System.Text;
using System.Runtime.InteropServices;

class Program
{
  [StructLayout(LayoutKind.Sequential)]
  struct COORD
  {
    public short X;
    public short Y;
  }

  [DllImport("kernel32.dll", SetLastError=true)]
  static extern IntPtr GetStdHandle(int nStdHandle);

  const int STD_OUTPUT_HANDLE = -11;

  [DllImport("kernel32.dll", SetLastError=true)]
  static extern bool ReadConsoleOutputCharacter(
    IntPtr hConsoleOutput,
    [Out] StringBuilder lpCharacter,
    uint nLength,
    COORD dwReadCoord,
    out uint lpNumberOfCharsRead);

  public static void Main()
  {
    // Write a string to a fixed position
    System.Console.Clear();
    System.Console.WriteLine("\u00C5ngstr\u00f6m");

    // Read it back
    COORD coord  = new COORD { X = 0, Y = 0 };
    StringBuilder sb = new StringBuilder(8);
    uint nRead = 0;
    ReadConsoleOutputCharacter(GetStdHandle(STD_OUTPUT_HANDLE),
                               sb, (uint)sb.Capacity, coord, out nRead);
    // Trim off any unused excess.
    sb.Remove((int)nRead, sb.Length - (int)nRead);

    // Show what we read
    System.Console.WriteLine(sb);
  }
}

Observe that this program is unable to read the Å and ö characters. They come back as garbage.

Although there are three code pages that have special treatment in Windows, the CLR gives access to only two of them via Dll­Import.

  • CharSet.Ansi = CP_ACP
  • CharSet.Unicode = Unicode (which in Windows means UTF16-LE unless otherwise indicated).

Unfortunately, the console traditionally uses the OEM code page.

Since the Dll­Import did not specify a character set, the CLR defaults (unfortunately) to Char­Set.Ansi. Result: The Read­Console­Output­Character function stores its results in CP_OEM, the CLR treats the buffer as if it were CP_ACP, and the result is confusion.

The narrow-minded fix is to try to fix the mojibake by taking the misconverted Unicode string, converting it to bytes via the ANSI code page, then converting the bytes to Unicode via the OEM code page.

The better fix is simply to avoid the 8-bit code page issues entirely and say you want to use Unicode.

  [DllImport("kernel32.dll", SetLastError=true, CharSet=CharSet.Unicode)]
  static extern bool ReadConsoleOutputCharacter(...);
Comments (12)
  1. Myria says:

    My mentality when it comes to Windows programming is that there are no Win32 APIs that end in a capital A.

    Is there a downside to writing "ReadConsoleOutputCharacterW" in the DllImport instead of "ReadConsoleOutputCharacter"?

    [Not sure what that gains you. You would have to write [DllImport("kernel32.dll", ExactSpelling=True, CharSet=CharSet.Unicode] bool ReadConsoleOuputCharacterW(...) and you still have to deal with all the W's. To fix that, you would need to add EntryPoint="ReadConsoleOutputCharacter" and now you are way in the hole compared to just letting the CLR add the W. -Raymond]
  2. parkrrrr says:

    @Myria, there's one: OutputDebugStringA.

  3. Myria says:

    @parkrrrr: Yes, I knew that exception, but didn't feel like mentioning it =^-^=

    For those that don't know, Windows doesn't support UTF-16 debug messages, even in kernel mode.  OutputDebugStringW converts your string to the ANSI code page before calling OutputDebugStringA — the reverse of what the rest of the Win32 API does.  There are probably other exceptions, but this is one that stands out.

  4. Nick says:

    I agree with Raymond. Unless an API specifically only has a "W" version (and some do), explicitly say you want Unicode and let the CLR take care of it.

  5. J. Edward Sanchez says:

    OutputDebugStringA() makes me sad. I keep waiting for a proper OutputDebugStringW() to be implemented, but it seems like I'll be waiting forever.

    What's the best place to request that this be fixed?

  6. David says:

    LoadLibrary is probably another example of W delegating to A.

  7. Harry Johnston says:

    @Spire: that would be the "a hundred friends and a pony" department. :-)

    http://www.gocomics.com/…/13

    Seriously, they'd have to change the debugger interface.  Far too difficult a change for too trivial a benefit.  You can always encode your Unicode strings first if you really have to output them to the debugger.

  8. Neil says:

    Does this mean that you'd have been out of luck on Windows 9x/Me?

  9. Myria says:

    @David: No, but it *is* true of the import table.  LoadLibraryW -> LoadLibraryExW -> LdrLoadDll -> NtCreateFile + NtCreateSection + NtMapViewOfSection.  LdrLoadDll and later in that list all take NT's UNICODE_STRING type as parameters.  That function list is highly simplified, of course; loading a DLL actually has way more steps than that.  Those are just the fundamental ones.

    The import table of a PE only has 8-bit characters for the names of the DLLs from which it is importing functions.  Since this is usually from the Win32 API, this is fine, but theoretically could be a problem for application-specific DLLs.

  10. Joshua Bowman says:

    @Neil, well, ReadConsoleOutputCharacter didn't even exist until Win2k. I'm sure you could somehow grovel for the information, but it doesn't seem to have been supported in any way.

  11. WndSks says:

    @Joshua Bowman: ReadConsoleOutputCharacter goes all the way back to NT 3.51 and also exists in Win9x. You cannot trust version information on MSDN…

  12. Joshua says:

    @Spire: You will wait forever. That one goes to VGA text mode in some scenarios, which CANNOT support UTF-anything.

Comments are closed.