Further adventures in trying to guess what encoding a file is in


The Is­Text­Unicode function tries to guess the encoding of a block of memory purporting to contain text, but it can only say "Looks like Unicode" or "Doesn't look like Unicode", and there some notorious examples of where it guesses wrong.

A more flexible alternative is IMulti­Language2::Detect­Code­page­In­IStream and its buffer-based equivalent IMulti­Language2::Detect­Input­Code­page. Not only can these methods detect a much larger range of code pages, they also can report multiple code pages, each with a corresponding confidence level.

Here's a Little Program that takes the function out for a spin. (Remember, Little Programs do little to no error checking.)

#define UNICODE
#define _UNICODE
#include <windows.h>
#include <shlwapi.h>
#include <ole2.h>
#include <mlang.h>
#include <shlwapi.h>
#include <atlbase.h>
#include <stdio.h>

bool IsHtmlFile(PCWSTR pszFile)
{
 PCWSTR pszExtension = PathFindExtensionW(pszFile);
 return
  CompareStringOrdinal(pszExtension, -1,
                       L".htm", -1, TRUE) == CSTR_EQUAL ||
  CompareStringOrdinal(pszExtension, -1,
                        L".html", -1, TRUE) == CSTR_EQUAL;
}

int __cdecl wmain(int argc, wchar_t **argv)
{
 if (argc < 2) return 0;
 CCoInitialize init;
 CComPtr<IStream> spstm;
 SHCreateStreamOnFileEx(argv[1], STGM_READ, 0, FALSE, nullptr, &spstm);

 CComPtr<IMultiLanguage2> spml;
 CoCreateInstance(CLSID_CMultiLanguage, NULL,
     CLSCTX_ALL, IID_PPV_ARGS(&spml));

 DetectEncodingInfo info[10];
 INT cInfo = ARRAYSIZE(info);

 DWORD dwFlag = IsHtmlFile(argv[1]) ? MLDETECTCP_HTML
                                    : MLDETECTCP_NONE;
 HRESULT hr = spml->DetectCodepageInIStream(
     dwFlag, 0, spstm, info, &cInfo);
 if (hr == S_OK) {
  for (int i = 0; i < cInfo; i++) {
   wprintf(L"info[%d].nLangID = %d\n", i, info[i].nLangID);
   wprintf(L"info[%d].nCodePage = %d\n", i, info[i].nCodePage);
   wprintf(L"info[%d].nDocPercent = %d\n", i, info[i].nDocPercent);
   wprintf(L"info[%d].nConfidence = %d\n", i, info[i].nConfidence);
  }
 } else {
  wprintf(L"Cannot determine the encoding (error: 0x%08x)\n", hr);
 }
 return 0;
}

Run the program with a file name as the command line argument, and the program will report all the detected code pages.

One thing that may not be obvious is that the program passes the MLDETECTCP_HTML flag if the file extension is .htm or .html. That is a hint to the detector that it shouldn't get faked out by text like <body> and think it found an English word.

Here's the output of the program when run on its own source code:

info[0].nLangID = 9
info[0].nCodePage = 20127
info[0].nDocPercent = 100
info[0].nConfidence = 83
info[1].nLangID = -1
info[1].nCodePage = 65001
info[1].nDocPercent = -1
info[1].nConfidence = -1

This says that its first guess is that the text is in language 9, which is LANG_ENGLISH, code page 20127, which is US-ASCII, That text occupies 100% of the file, and the confidence level is 83.

The second guess is that the text is in code page 65001, which is UTF-8, but the confidence level for that is low.

The language-guessing part of the function is not very sophisticated. For a higher-quality algorithm for guessing what language some text is in, use Extended Linguistic Services. I won't bother writing a sample application because MSDN already contains one.

Comments (8)
  1. Malcolm says:

    The HREF for "some notorious examples" is missing its = sign

  2. Deduplicator says:

    Why should the confidence for UTF-8 ever be lower than for ASCII?

    Remember that one of the design-criteria for UTF-8 was that all ASCII data is also valid UTF-8 data, without changing a single bit.

  3. Deduplicator: If I had to guess I'd say because it doesn't contain any characters that are outside the US-ASCII range.  As Raymond notes "that text occupies 100% of the file".  If there were characters that were outside the range of 7-bit ASCII I would assume the confidence would be higher.

    [If a file consists entirely of 7-bit ASCII, you still don't have perfect confidence that it is UTF-8. Perfect confidence would mean that you could change 41 to C3 85 to convert a capital A to a capital Å. -Raymond]
  4. Wear says:

    @Deduplicator It's likely designed to go for the simplest case first. It does indicate that it could be UTF-8 but without any characters that must be UTF-8 there's no way for it to be sure. So it goes for the simpler case of US-ASCII and includes UTF-8 as a possibility.

  5. Dave Bacher says:

    I'm actually surprised it only reports those two encodings -- it'd match a lot of code pages, and given it's program code, so there's a lot of non-text, I'd think it'd match a lot of code pages.

  6. Cube 8 says:

    @Dave Bacher

    I suppose that's why the confidence about the first one is 83%.

  7. @Dave Bacher: 7-bit ASCII is the standard that most other OEM codepages extended from, so it makes sense to only report the parent codepage instead of spamming the results with all the theoretical possibilities.

  8. All this reminds me of how I hate Plain Text (.txt) format. I wish there was a .textx format; something that started with a codepage 4CC before the plain text itself.

Comments are closed.

Skip to main content