Keep your eye on the code page: C# edition (warning about DllImport)


Often, we receive problem reports from customers who failed to keep their eye on the code page.

Does the SH­Get­File­Info function support files with non-ASCII characters in their names? We find that the function either fails outright or returns question marks when asked to provide information for files with non-ASCII characters in their name.

using System;
using System.Runtime.InteropServices;

class Program
{
 static void Main(string[] args)
 {
  string fileName = "BgṍRồ.txt";
  Console.WriteLine("File exists? {0}", System.IO.File.Exists(fileName));
  // assumes extensions are hidden

  string expected = "BgṍRồ";
  Test(fileName, SHGFI_DISPLAYNAME, expected);
  Test(fileName, SHGFI_DISPLAYNAME | SHGFI_USEFILEATTRIBUTES, expected);
 }

 static void Test(string fileName, uint flags, string expected)
 {
  var actual = GetNameViaSHGFI(fileName, flags);
  Console.WriteLine("{0} == {1} ? {2}", actual, expected, actual == expected);
 }

 static string GetNameViaSHGFI(string fileName, uint flags)
 {
  SHFILEINFO sfi = new SHFILEINFO();
  if (SHGetFileInfo(fileName, 0, ref sfi, Marshal.SizeOf(sfi),
                    flags) != IntPtr.Zero) {
   return sfi.szDisplayName;
  } else {
   return null;
  }
 }

 [StructLayout(LayoutKind.Sequential)]
 struct SHFILEINFO {
  public IntPtr hIcon;
  public int iIcon;
  public uint dwAttributes;
  [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
  public string szDisplayName;
  [MarshalAs(UnmanagedType.ByValTStr, SizeConst = 80)]
  public string szTypeName;
 }

 const uint SHGFI_USEFILEATTRIBUTES = 0x10;
 const uint SHGFI_DISPLAYNAME = 0x0200;

 [DllImport("shell32.dll")]
 static extern IntPtr SHGetFileInfo(
    string path, uint fileAttributes, ref SHFILEINFO info, int cbSize,
    uint flags);
}
// Output:
// File exists? True
//  == Bg?R? ? False
// Bg?R? == Bg?R? ? False

If we ask for the display name, the function fails even though the file does exist. If we also pass the SHGFI_USE­FILE­ATTRIBUTES flag to force the system to act as if the file existed, then it returns the file name but with question marks where the non-ASCII characters should be.

The SH­Get­File­Info function supports non-ASCII characters just fine, provided you call the version that supports non-ASCII characters!

The customer here fell into the trap of not keeping their eye on the code page. It goes back to an unfortunate choice of defaults in the System.Runtime.Interop­Services namespace: At the time the CLR was originally being developed, Windows operating systems derived from Windows 95 were still in common use, so the CLR folks decided to default to Char­Set.Ansi. This made sense back in the day, since it meant that your program ran the same on Windows 98 as it did in Windows NT. In the passage of time, the Windows 95 series of operating systems became obsolete, so the need to be compatible with it gradually disappeared. But too late. The rules were already set, and the default of Char­Set.Ansi could not be changed.

The solution is to specify Char­Set.Unicode explicitly in the Struct­Layout and Dll­Import attributes.

FxCop catches this error, flagging it as Specify­Marshaling­For­PInvoke­String­Arguments. The error explanation talks about the security risks of unmapped characters, which is all well and good, but it is looking too much at the specific issue and not so much at the big picture. As a result, people may ignore the issue because it is flagged as a complicated security issue, and they will think, "Eh, this is just my unit test, I'm not concerned about security here." However, the big picture is

This is almost certainly an oversight on your part. You didn't really mean to disable Unicode support here.

Change the lines

 [StructLayout(LayoutKind.Sequential)]
 [DllImport("shell32.dll")]

to

 [StructLayout(LayoutKind.Sequential, CharSet=CharSet.Unicode)]
 [DllImport("shell32.dll", CharSet=CharSet.Unicode)]

and re-run the program. This time, it prints

File exists? True
Bg?R? == Bg?R? ? True
Bg?R? == Bg?R? ? True

Note that you have to do the string comparison in the program because the console itself has a troubled history with Unicode. At this point, I will simply cue a Michael Kaplan rant and link to an article explaining how to ask nicely.

Comments (19)
  1. acq says:

    Thanks for giving the new links to the texts of Michael Kaplan! Maybe you can fix the "blogroll" section too.

  2. Grzechooo says:

    > 2014

    > codepages

    > ANSI

    Is this serious

  3. Craig says:

    The default for these two attributes is extremely unfortunate. In support, I see issues resulting from this maybe once or twice a year.

    While the defaults cannot be changed, I can think of at least two improvements:

    1) Depreciate the constructor that does not take a CharSet parameter. Of course you can't remove it, but at least it will produce a warning. Some people will still ignore it, but at least some will address it.

    2) Depreciate both StructLayout and DllImport, and add a StructLayoutEx and DllImportEx that default to CharSet.Unicode. Or just leave out the default and force the developer to choose appropriately.

    Also, thanks for the Michael Kaplan link. I had been wondering what happened to his him and his blog for months.

  4. Chris Warrick says:

    @Grzechooo: yes.  Windows is stuck with all that, because they just HAVE to be backwards compatible with every single app developed in the last 19 (or so) years.  For some crazy reason, they can’t just say “okay, it should not be used by any good program, we’ll get rid of it”.  Not that there are programs that are incompatible with the newer versions, tons of them.

  5. Wear says:

    @Craig Deprecating things becomes a documentation nightmare. Java's done this a fair bit and HTML did it heavily with the move to CSS for styling. You have a bunch of code that works fine and then you upgrade to the newest version (Or DOCTYPE for HTML) and suddenly everything is underlined. There are tooltips everywhere telling you how it's really bad for you to do what you've been doing for years. Now you need to go hunting to figure out why you can't do what you've been doing and what you are supposed to be doing. You only upgraded because you wanted to use some new feature, now you have to go and fix a ton of errors.

  6. Dave Bacher says:

    [Obsolete("Use DllImportAnsi or DllImportUnicode instead.")]

    public sealed class DllImportAttribute : DllImportBaseAttribute

    { }

    public abstract class DllImportBaseAttribute : Attribute

    { }

    [Obsolete("Use the Unicode one or else you'll regret it.  Maybe not today, maybe not tomorrow, but soon.  And for the rest of your product's life.")]

    public class DllImportAnsiAttribute : DllImportBaseAttribute

    { }

    public class DllImportUnicodeAttribute : DllImportBaseAttribute

    { }

    Reference the changes made to System.Web between 3.5 and 4.0 — this doesn't generally break binary backwards compatibility, and also doesn't break source backwards compatibility.

    Does potentially break poorly written reflection code, but generally speaking — the number of scenarios that would care about that ought to be pretty close to 0. :)

  7. Eric Wilson says:

    The default is especially insane when you consider they have a 'CharSet.Auto' value available which would have automatically chosen the correct version based on the OS.  How THAT option was not the default is completely beyond me.

  8. voo says:

    Oh gosh I completely missed that Michael got a new blog! Wuhu, great!

    @Wear: If the reason why it's deprecated is due to security problems (hello thread.stop and co) or inherent brokenness (hello Java's calendar APIs) it's better to warn developers than to ignore it. If they don't like the warnings they can always ignore them after all.

  9. Azarien says:

    It's never too late for Microsoft to implement DllImportEx or DllImportW with Unicode as default.

    And you don't really need to deprecate anything…

  10. JM says:

    I always import the W version of the function with [DllImport(ExactSpelling=true)] for exactly this reason. Accidentally calling SHGetFileInfoW with the wrong character set will get you glaringly obvious problems. Since managed strings are Unicode to begin with, the ability to call ANSI versions of functions has no added value now that Windows 95 and relations have gone the way of the dodo.

  11. John Ludlow says:

    Since apps compiled for one version of the CLR aren't compatible with different versions of the CLR* surely they could have changed this when introducing .NET 2.0, or 4.0, since those introduced new CLRs.

    Yes, ok, you can put some values in your config file that allow you to be compatible with both .NET 3.5 (v2.0 CLR) and .NET 4.x (v4.0 CLR) but that's not guaranteed to work.

  12. sevenacids says:

    @Eric Wilson: I just asked myself the same thing.

    Now I'm sure I'm missing something important here, but what would be the issues of changing the default to 'Auto' or 'Unicode' right now? I mean, since the current versions of the CLR/.NET Framework only run on NT-based operating systems (which have been Unicode ever since) one could simply forget about the ANSI versions of the API, couldn't you?

  13. Deduplicator says:

    Might part of the reason it wasn't changed be that there is another implementation (Mono) and it is standardised?

    Or wasn't that part standardised as well?

  14. Joshua says:

    > Yes, ok, you can put some values in your config file that allow you to be compatible with both .NET 3.5 (v2.0 CLR) and .NET 4.x (v4.0 CLR) but that's not guaranteed to work.

    It's not eve that hard to deal with. DLLs know what version of the CLR they target.

  15. jonwil says:

    Regarding depreciation, Visual C++ added stuff to give warnings/errors/other things for a whole bunch of "insecure" stuff (among other things) that you can turn off with a global compiler flag or #define or something if you need to. No reason the CLR couldn't do the same (depreciate the use of the dllimport stuff that doesn't specify a char-set and if you want to keep using it instead of fixing your program you flip the switch to shut off the warnings)

  16. cheong00 says:

    Since the CharSet enum has helpful CharSet.Auto which automatically choose Ansi for Win9X/ME and Unicode for others, I wonder why it's not chosen as the default value.

    Note that CharSet.Auto is the default chosen for CLR itself, just that C#,VB.NET override it with CharSet.Ansi and C++ override it to CharSet.None by default

  17. Scarlet Manuka says:

    Pet peeve: @Craig and @jonwil, please learn the difference between 'deprecate' and 'depreciate'.

    (On my initial read though the thread it seemed like most people had gotten it wrong. But when I went back and counted, three had it correctly and only two incorrectly. A win for humanity?)

  18. Neil says:

    Bah, he's switched to blogging software which thinks that a) it's a good idea to detect page zoom changes b) it's a good idea to do this by polling c) it's a good idea to poll every second d) it's a good idea to use code that takes more than a second to execute.

  19. Wouter Ballet says:

    There's an even easier solution for this problem. Just put the following attribute in your AssemblyInfo.cs file:

    [module: DefaultCharSet(CharSet.Unicode)]

    This will change the default CharSet to Unicode for all DllImport attributes.

Comments are closed.