What(‘s) a character!


Norman Diamond seems to have made a side career of harping on this topic on a fairly regular basis, although he never comes out and says that this is what he's complaining about. He just assumes everybody knows. (This usually leads to confusion, as you can see from the follow-ups.)

Back in the ANSI days, terminology was simpler. Windows operated on CHARs, which are one byte in size. Buffer sizes were documented as specified in bytes, even for textual information. For example, here's a snippet from the 16-bit documentation for the GetWindowTextLength function:

The return value specifies the text length, in bytes, not including any null terminating character, if the function is successful. Otherwise, it is zero.

The use of the term byte throughout permitted the term character to be used for other purposes, and in 16-bit Windows, the term was repurposed to represent "one or bytes which together represent one (what I will call) linguistic character." For single-byte character sets, a linguistic character was the same as a byte, but for multi-byte character sets, a linguistic character could be one or two bytes.

Documentation for functions that operated on linguistic characters said characters, and functions that operated on CHARs, said bytes, and everybody knew what the story was. (Mind you, even in this nostalgic era, documentation would occasionally mess up and say character when they really meant byte, but the convention was adhered to with some degree of consistentcy.)

With the introduction of Unicode, things got ugly.

All documentation that previously used byte to describe the size of textual data had to be changed to read "the size of the buffer in bytes if calling the ANSI version of the function or in WCHARs if calling the Unicode version of the function." A few years ago the Platform SDK team accepted my suggestion to adopt the less cumbersome "the size of the buffer in TCHARs." Newer documentation from the core topics of the Platform SDK tends to use this alternate formulation.

Unfortunately, most documentation writers (and 99% of software developers, who provide the raw materials for the documentation writers) aren't familiar with the definition of character that was set down back in 1983, and they tend to use the term to mean storage character, which is a term I invented just now to mean "a unit of storage sufficient to hold a single TCHAR." (The Platform SDK uses what I consider to be the fantastically awkward term normal character widths.) For example, the lstrlen function returns the length of the string in storage characters, not linguistic characters. And any function that accepts a sized output buffer obviously specifies the size in storage characters because the alternative is nonsense: How could you pass a buffer and say "Please fill this buffer with data. Its size is five linguistic characters"? You don't know what is going into the buffer, and a linguistic character is variable-sized, so how can you say how many linguistic characters will fit? Michael Kaplan enjoys making rather outrageous strings which result in equally outrageous sort keys. I remember one entry a while ago where he piled over a dozen accent marks over a single "a". That "a" plus the combining diacritics all equal one giant linguistic character. (There is a less extreme example here, wherein he uses an "e" plus two combining diacritics to form one linguistic character.) If you wanted your buffer to really be able to hold five of these extreme linguistic characters, you certainly would need it to be bigger than WCHAR buffer[5].

As a result, my recommendation to you, dear reader, is to enter every page of documentation with a bias towards storage character whenever you see the word character. Only if the function operates on the textual data linguistically should you even consider the possibility that the author actually meant linguistic character. The only functions I can think of off-hand that operate on linguistic characters are CharNext and CharPrev, and even then they don't quite get it right, although they at least try.

Comments (53)
  1. Nathan says:

    CharNext ? We all know real men and women use ++ ;)

    Anyway, it’s all just ones and zeros, you just want to get them in the right order.

  2. Dave says:

    As an "ugly American programmer" with a few decades of experience, I’ve grown used to handling characters with ASCII. (If it was good enough to represent every character in the US Constitution, it’s good enough for me.) The whole mess with character sets and their type representations, especially in C++, drives me crazy. I’m not blaming anyone, and I know it’s needed. I’m just expressing complete frustration with the complicated mess that exists and annoyance at the need to become an expert in international glyphs as well as programming.

    Let’s go back to the good old days and party like it’s 1999. What was the name of the guy who sang that? Oh yeah, I can’t type it in ASCII.

    http://en.wikipedia.org/wiki/Image:Prince_symbol.svg

  3. oidon says:

    @Dave

    The US Constitution, in it’s original form, can not be fully expressed in ASCII. Neither can the Bill of Rights. Here is an example of U+017F (long s) in the Bill of Rights:

    http://en.wikipedia.org/wiki/Long_s

    You may want to move that that party to the 1980s. By 1999 much of the world was long down the Unicode road.

  4. David Walker says:

    Um, Raymond, I think you left out a word there.

    You say "the term was repurposed to represent "one or bytes""…..

    One or two maybe?  One or more?

  5. Mihai says:

    There is no need to invent a new term (storage character). There is already something defined by Unicode, named "code unit" (http://unicode.org/glossary/#code_unit).

    We might understand each other better if we would use the same (standard) terminology.

  6. just me says:

    Interestingly, though, that in kernel mode, a the count of a UNICODE_STRING is not in (storage) characters, but in bytes. Thus, you have to read the WDK and the SDK differently.

    I am in kernel mode more often than in user space. I find the convention in kernel better: Every size, whatever it is, is measured in the smallest possible unit (sizeof(char)==1), so you do not have to remember if this or that function uses this or that convention. Easy, isn’t it?

  7. Vijay says:

    English we knew was a funny language. But English and Computers together? I guess we’re set for a riot :)

    We definitely need top notch technical writers to rise to the occassion. What a character, indeed!

  8. Ed says:

    repurposed?  There is already a perfecly cromulent word known as "redefined"

    Repurposed sounds to me like "the chair was repurposed to block the door" or "the book was repurposed to level the table.

    /nerd

  9. Demotivator says:

    "Norman Diamond seems to have made a side career of harping on this topic on a fairly regular basis. although he never comes out and says that this is what he’s complaining about. He just assumes everybody knows. (This usually leads to confusion, as you can see from the follow-ups.)"

    http://www.encyclopediadramatica.com/index.php/Image:Inspirational_poster_-_BAN.jpg

  10. Arno says:

    I think there are basically two ways a "storage unit" and a character are not the same, one being diacritics, the other character sets with variable-length encoding. In MBCS, I suppose both may happen. But in Unicode, does Windows support characters outside the Basic Multilingual Plane (which are encoded in >2 bytes)?

  11. Ben Bryant says:

    fantastically awkward term normal character widths

    Why? That is the term that is most immediately understood. To me "storage character" comes in second, and "code unit" (why does MS never look to the Unicode standard before choosing terminology?) comes in third. To make things more confusing, the msdn article you referenced actually also has a typo:

    "count of characters". However, this term is strictly correct because…

    should be:

    NOT strictly correct

  12. Abhi says:

    Having worked with Unicode library I know how UTF8 and other encoding standards make the code look like an intimidating thingy. UTF16 is so pretty :-). Well its better to think in terms of Byte rather than linguistic character. When in confusion use (char*) but well I ain’t that talented and a novice programmer  and I like to keep things simple :-).

  13. Dean says:

    Arno: windows has had support for surrogate characters since Windows 2000. though it depends on your definition of "support" – NT could have displayed them with proper fonts, but that’s about all.

  14. Norman Diamond says:

    > Norman Diamond seems to have made a side

    > career of harping on this topic

    Surely you know the reason.  Since I’ve had to do Windows programming for several years now, I’ve been forced to have a side career as well:  After reading MSDN pages I have to experiment to see which uses of “character” mean character, which uses mean byte, and which uses mean TCHAR.  It got tiring the first month, and yet it never stops.

    > He just assumes everybody knows.

    That’s because it’s perfectly obvious.  In order to see _why_ it’s obvious (or _how_ or something like that?), take a look at the answer comment in the comment at:

    http://blogs.msdn.com/oldnewthing/archive/2007/01/04/1411080.aspx#1414348

    Microsoft still has employees asserting that in an ANSI string every character fits in a single byte.  One even posted that assertion in a followup to my posting about breakage in StringCchPrintf or something similar.

    The fact that an ANSI character might require one or two bytes is not only at least as obvious as the “fact” that every ANSI character fits in a single byte, it also happens to be more true.

    Well, now I will give some good news, but please notice they still show why fixes are important.

    One of the exceptional cases where Microsoft decided to fix a bug I reported during Visual Studio 2005 betas was where the runtime was corrupting customers’ data files because it didn’t properly figure out how long a string was.  I am glad it was decided to be fixed.  (I haven’t checked if it was actually fixed, because:  a colleague’s product had to be shipped while VS2005 was still in beta, so the product was coded using VB6, and we already had workarounds for miscalculations or misdocumentation in VB6 runtime.)

    I think that MSDN pages for .Net Framework 1.1 were written around year 2002, yet they still included false statements about the meaning of TCHARs.  Taking a look now, I think they’ve been fixed.  I’m glad they were fixed.

    [It is true that every (storage) character in an ANSI string occupies one byte. The count of characters in StringCchPrintf is clearly storage characters not linguistic characters, and that is the context in which the term “character” is being used. This is precisely an example of your taking a hyperliteral interpretation instead of adjusting with context. -Raymond]
  15. For the record, there exists a term for ‘linguistic character’. The term is ‘grapheme’, and it is used by the linguistic community as well as by the Unicode project.

    You could object, of course, that people are less familiar with "grapheme" than "linguistic character" and get scared and run away.

  16. Personally I use the term "glyph" when referring to "a character on the screen".  A glyph may be composed of multiple characters (in both ANSI and Unicode).

    Grapheme works too.

  17. Mihai says:

    <<Personally I use the term "glyph" when referring to "a character on the screen".  A glyph may be composed of multiple characters (in both ANSI and Unicode).

    Grapheme works too.>>

    According the Unicode definitions "glyph" and "grapheme" are different beasts. The grapheme is "what a user thinks of as a character" and the relation between the two is many-to-many.

    • the "fi" ligature (U+FB00) in a font is one glyph, represents 2 graphemes, but one code unit
    • the fi ligature in notepad, created by typing ‘f’ and ‘i’ is still one glyph, but 2 graphemes and 2 code units. the same value in wordpad, who does not know about ligatures, is 2 glyphs (but still 2 graphemes and 2 code units)

    • a + combining acute (<0061 0301>) are two glyphs, form one grapheme, and has 2 code units

    • Arabic shaping has several glyphs for the same "linguistic character" and which one use depends on the context

    Now, there is also a "grapheme cluster" :-)

    I fully agree that the Unicode glossary is not very clear, but it is clear that glyph and grapheme is not the same thing.

    It is a full mess there, between

    glyph/glyph code/glyph identifier/glyph image/grapheme/grapheme cluster/graphic character, but I am not sure creating our own definitions is the way to fix it.

  18. Mihai says:

    @Norman

    <<.Net Framework 1.1 were written around year 2002, yet they still included false statements about the meaning of TCHARs>>

    I am quite sure the .NET Framework (no matter version) does not deal with TCHARs

  19. James says:

    Norman, I tend to be a stickler for precise meanings myself, but as Raymond points out, to interpret ‘characters’ in that context as having any meaning other than the number of TCHARs which fit in that buffer is nonsensical. If you found yourself paying for rope by the foot, would you take your shoes off to measure with *your* feet, or use the same units everyone else would? How could you possibly be expected to measure the size of your buffer in variable-size units?!

    I do often find infuriating omissions in the documentation, but I don’t think I’ve ever hit the problem you describe; from the posts I’ve seen here, they seem to be specific to you. Unless you’re doing something involving fonts (or otherwise actually drawing a string somewhere), why would ‘characters’ (or glyphs, or whatever else you like to call them) matter to anyone?

    I’m curious about how a bug in Visual Studio corrupted your customers’ files, though.

  20. Thriol says:

    The character by Michael Kaplan looks as described on Office 2007, but it gives some strange effects. See picture here: http://thomasolsson.spaces.live.com/blog/cns!1EB93731488C4EA3!302.entry

  21. KJK::Hyperion says:

    "just me", that’s because UNICODE_STRING, ANSI_STRING, OEM_STRING and STRING must be freely convertible among one another through a bitwise copy. The idea is that low-level components are supposed NOT to worry with linguistics and just pass strings around as immutable binary buffers. String equivalence is very bare, and not 100% correct linguistically (not to mention subtly inconsistent with the CRT – CRT case insensitivity lowercases, RTL uppercases, which can cause issues with Hungarian filenames)

    Arno: it does, since Windows 2000. As linguistic support for Windows got better, UTF32 support got better with it (in Windows 2000 being basically limited to support in text rendering functions)

    Norman Diamond: shut up. The use is consistent throughout, I can only remember ONE function not acting as documented, and it was an obscure PSAPI routine. Everyone gets it. You are the problem. You don’t encounter (and need not worry with) graphemes well into raw Uniscribe, and I somehow doubt you are reimplementing a rich text control

  22. Mihai says:

    I would really like to see how can you convert between UNICODE_STRING, ANSI_STRING, OEM_STRING through a bitwise copy :-D

    And Windows has only primitive UTF32 support (basically converting to/from it to UTF16, where the real work is done). UTF16 support improved, true.

    And Norman Diamond said nothing about graphemes, I did (nothing to do with the article, but with the terminology in LarryOsterman’s post).

    So, you got 0 points out of 3 :-)

  23. Norman Diamond says:

    > It is true that every (storage) character in

    > an ANSI string occupies one byte.

    In that kind of sentence you need to delete the parentheses from around the word "storage", and you should give some thought to skipping it entirely and just using the word TCHAR (which you and SOME of your colleagues often use correctly).

    > The count of characters in StringCchPrintf is

    > clearly storage characters not linguistic

    > characters,

    Clear to you.  SOMETIMES clear to me, but sometimes not, because:  SOME MSDN pages really count linguistic characters the way they say instead of counting TCHARs the way that some programmers learn to interpret it.  Clear to

    SOME of your colleagues.  However, SOME of your colleagues still end up thinking that storage characters are linguistic characters, they still end up posting falsities in newsgroups or e-mail, and they still end up writing defective code which we victims have to workaround.

    > This is precisely an example of your taking

    > a hyperliteral interpretation instead of

    > adjusting with context.

    Compare that to the results when some of your colleagues guess wrong about interpretations or maybe didn’t get the training they need to do the interpretations.

    It sounds like you’re agreeing with a recent posting by Larry Osterman saying that part of the contract between caller and callee is implicit (the callee’s code determines what the contract is) instead of explicit (the documentation).  My answer is that publication of the contract is overdue.

    Sunday, January 07, 2007 1:46 PM by James

    > I’m curious about how a bug in Visual Studio

    > corrupted your customers’ files, though.

    I think it was a bug in the .Net Framework version 2-beta-something runtime rather than in Visual Studio 2005 beta-something itself.  It WOULD have corrupted customers’ files if we hadn’t discovered it and if we hadn’t decided to stick with VB6 where we knew workarounds (as already stated).  Where a library call was supposed to write a record of some length in bytes, it wrote more bytes than it was supposed to, corrupting the adjacent record which wasn’t supposed to be touched.

    Sunday, January 07, 2007 5:25 PM by Mihai

    > I am quite sure the .NET Framework (no matter

    > version) does not deal with TCHARs

    http://msdn2.microsoft.com/en-us/library/system.runtime.interopservices.unmanagedtype.aspx

    I think that page is one which used to say that ByValTStr counted characters not bytes.  The fact was that by default (in Visual Studio 2003 the default was ANSI) it counted bytes not characters.  As far as I can tell that page is fine now.

    Monday, January 08, 2007 5:41 PM by KJK::Hyperion

    > that’s because UNICODE_STRING, ANSI_STRING,

    > OEM_STRING and STRING must be freely

    > convertible among one another through a

    > bitwise copy

    Converting between UNICODE_STRING and ANSI_STRING by doing bitwise copies instead of conversion tables?  Sounds like there are even more broken APIs than I knew about.

    Regarding James’ tangent:

    > If you found yourself paying for rope by the

    > foot, would you take your shoes off to measure

    > with *your* feet, or use the same units

    > everyone else would?

    Excellent example, thank you.  Prior to adoption of the metric system units like "foot" varied by country.  Besides using the same units everyone else would, you also had to figure out which everyone elses were today’s everyone elses.

  24. James says:

    Norman, it only appears to you as a ‘tangent’ because you missed the point. Asked for a distance in feet, I consider it blatantly obvious that someone is meaning 0.3048 metres, because it’s the only rational interpretation available. The analogy with StringCchPrintf should now be obvious.

    Similarly, can you give an example where any "confusion" could actually exist? StringCchPrintf clearly isn’t one: there is only one possibility, as I explained (measuring in ‘linguistic characters’ simply cannot work). Was your .Net problem to do with expecting an I/O function call to be trying to count the latter when it wasn’t?

  25. stegus says:

    Norman, do you not agree that as long as you interpret the word ‘character’ as almost always meaning ‘storage character’ in the MSDN documentation, everything is quite clear.

    It is only when you are stubbornly interpreting the word ‘character’ as meaning ‘lexical character’ that there are any confusions.

    I believe that must of the arguments you have had on various forums about these kinds of issues boils down to you insisting that the word character must be interpreted as ‘linguistic character’, while everybody else thinks that ‘character’ means ‘storage character’

    Of course, you could continue to claim that everybody else should change, or maybe you could consider actually listening to what people are saying and adjust your own thinking a little bit?

  26. Mihai says:

    <<http://msdn2.microsoft.com/en-us/library/system.runtime.interopservices.unmanagedtype.aspx

    I think that page is one which used to say that ByValTStr counted characters not bytes.  The fact was that by default (in Visual Studio 2003 the default was ANSI) it counted bytes not characters.  As far as I can tell that page is fine now.>>

    That has nothing to do with TCHAR, the selection of the names is most unfortunate, and it only ads to the confusion.

    LPTStr, TBStr = platform-dependent: ANSI on Windows 98 and Unicode on Windows NT and Windows XP.

    ByValTStr = The character type is determined by the System.Runtime.InteropServices.CharSet

    Although the idea is clear (emulate the generic text data types used in non-managed code), they are affected by completely different things. They are not technically TCHAR, they are "TCHAR-like," this is what I objected.

    But I see your point.

  27. stegus says:

    Mihai:

    you say that most programmers think char (meaning byte) when reading character.

    Norman obviously thinks linguistic character when he reads character.

    The correct interpretation is instead to almost always think TCHAR when you read character.

    Of course it had been clearer if MSDN had used TCHAR everywhere, but I really do not think that the current situation is so bad.

    As you say, the documentation is not incorrect, but it could be clearer.

  28. Mihai says:

    <<The correct interpretation is instead to almost always think TCHAR when you read character.>>

    Well, that "almost" in there is the problem :-)

  29. KJK::Hyperion says:

    Mihai, I’ll pretend I’m being nice to you, so Raymond will let my comment through. Not my fault you sign yourself with an ambiguous nickname, is it?

    First, "isomorphism". Look it up. Preferrably with "grep" on PSDK headers. One point for me, because what I did not say cannot be wrong

    Second, with "UTF32" I’m obviously referring to characters outside of the BMP, also quite obviously referring to sort keys and Uniscribe. Two points for me, because nobody likes a nitpicker, an anonymous coward at it

    Third, no dear, I’m not talking with you. Sorry. Here, I baked you a cheesecake to make up for it. See, that Norman in his deep knowledge of linguistics did not use the term "grapheme" when referring to his confusion on StringCchPrintf is surely an unfortunate accident. Nevertheless I was speaking to him. That makes it three points for me, because smileys in serious conversation will earn you no respect. Raymond never used smileys in his articles, did he?

    In closing, since nobody in his right mind would try to win an internet argument with logic reasoning, I’ll that I despise you and your hypocritical double-speak that makes you worse than a Nazi. By Godwin’s law, I win with a final score of Hitler-0: http://wendykaveney.com/uploads/deluxe/borderless/0005/0612061109361special_olympics_18_puffed_up__l.jpg

  30. stegus says:

    First of all, I believe that the problem is in reality very small.

    I could formulate the rule like this:

    The word ‘character’ should always be interpreted as ‘storage character’ unless it is plainly obvious from context that it is refering to linguistic characters.

    As a matter of fact, even using TCHAR is not entirely correct, since the size of TCHAR is entirely controlled by the UNICODE constant.

    When you are calling an A-function a storage character is one byte, and when you are calling a W-function, the storage character is 2 bytes.

    You can call either version manually regardless of the UNICODE macro.

    So, Raymond has described the situation perfectly as usual. Documenting the behavior in very precise terms leads to extremely long and wordy descriptions, and the current conventions work well enough in practice.

    As a simple improvement suggestion, maybe MSDN should have a link to a page describing the interpretation of ‘character’ in detail on every page that talks about string functions. This should clear up any possible confusion.

    The real problem is actually that Norman Diamond refuses to accept that the word ‘character’ can ever be interpreted as anything other than ‘linguistic character’

  31. stegus says:

    Hmm

    When you start thinking about the mbcs-functions, things start to get really interesting.

    Look at this page for example:

    http://msdn2.microsoft.com/en-gb/library/5dae5d43(VS.80).aspx

    How should the parameters to _mbsncpy_s be interpreted, and why ?

    Saying that everything is perfectly clear might be a slight exaggeration…

  32. Mihai says:

    <<Not my fault you sign yourself with an ambiguous nickname, is it?>>

    For regulars of internationalization blogs/newsgroups, and for regulars of this blog, I did not think is ambiguous.

    But ok, here it is "Mihai Nita, i18n MVP":

     http://www.mihai-nita.net

     https://mvp.support.microsoft.com/default.aspx/profile=FA049700-6927-4F02-8F91-6552781C7407

    <<First, "isomorphism". Look it up.>>

    "Isomorphism" and "bitwise copy" is not the same beast. Sorry.

    <<Second, with "UTF32" I’m obviously referring to characters outside of the BMP>>

    You obviously have no clue about the Unicode terminology. UTF-8, UTF-16, UTF-32 are 100% equivalent, in that they can address the full Unicode space (0-10FFFF). Since UTF-32 has almost no supported in Windows, and a UTF-32 code unit is enough to store any Unicode code point, there is no need to improve anything. The improvement in Windows was in the surrogate support. Surrogates are a mechanism unique to UTF-16. So Windows moved from UCS2 to UTF16.

    <<Norman in his deep knowledge of linguistics did not use the term "grapheme" when referring to his confusion on StringCchPrintf is surely an unfortunate accident. Nevertheless I was speaking to him.>>

    Yes. And even if I mentioned UNICODE_STRING, and UTF32, I was not referring to your post, but I was speaking to Norman ;-)

    And there is nothing to win here, because I was not arguing. Just trying to teach you some basic things. I did not realize that you, in your "deep knowledge," need no such thing.

  33. stegus says:

    Hmm

    When you start thinking about the mbcs-functions, things start to get really interesting.

    Look at this page for example:

    http://msdn2.microsoft.com/en-gb/library/5dae5d43(VS.80).aspx

    How should the parameters to _mbsncpy_s be interpreted, and why ?

    Saying that everything is perfectly clear might be a slight exaggeration…

  34. Mihai says:

    @stegus

    <<as long as you interpret the word ‘character’ as almost always meaning ‘storage character’ in the MSDN documentation, everything is quite clear.>>

    Except in the situations where character *does* mean char, sometimes ‘linguistic character’, or, even wint_t.

    I am not arguing that it can be clear when you think about every specific situation. But it is not "automatically" clear, you really have to think about it.

    As a programmer I deal with char, CHAR, WCHAR, TCHAR. As a human being, I deal with (linguistic) characters.

    Using layman terminology (character) to mean programmer concepts (char/WCHAR/TCHAR) is making things more difficult that they have to be. And the problem is that most programmers think char when you say character.

    Read Converting a Project to Unicode: Part 5 (http://blogs.msdn.com/michkap/archive/2007/12/01/1391798.aspx) and you will see size expressed as sizeof(buffer) instead of sizeof(buffer)/sizeof(TCHAR) or sizeof(buffer)/sizeof(buffer[0]).

    This is because of that automatic use of something, without thinking.

    So, in this respect, is the MSDN documentation incorrect? I would say no. Is it clear? Not quite. It would benefit from some improvement. Just use TCHAR when talking about ‘storage characters’, we will all know what that means without thinking twice.

  35. Norman Diamond says:

    Tuesday, January 09, 2007 10:00 AM by James

    Norman, it only appears to you as a ‘tangent’

    because you missed the point. Asked for a

    distance in feet, I consider it blatantly

    obvious that someone is meaning 0.3048 metres,

    because it’s the only rational interpretation

    available.

    Thereby proving that you missed the point:  you’re in country X, and it’s blatantly obvious that there’s only one rational interpretation available in country Y, therefore country X must bow down to country Y’s interpretation — this fails when country X is sufficiently powerful or independent.  The invention of the metric system had more goals than just getting rid of one king’s foot.

    Was your .Net problem to do with expecting an

    I/O function call to be trying to count the

    latter when it wasn’t?

    There was an I/O function which was supposed to write some number bytes, but it wrote more bytes than it was supposed to.  One of the arguments to the function was a Unicode string and the function had to convert to ANSI (because the file’s contents are ANSI).  Now all I can do about the internals is guess, but my guess is that the function’s implementor thought that the length of the ANSI string in bytes would be equal to the length of the original string in wchars and thought they could just write the result without doing any actual length checking on the result.  We have seen some Microsoft employees write in English that they think this way and I think I saw one MSDN article in Japanese that depended on the same thinking.  These people need training.

    Tuesday, January 09, 2007 11:30 AM by stegus

    Norman, do you not agree that as long as you

    interpret the word ‘character’ as almost

    always meaning ‘storage character’ in the

    MSDN documentation, everything is quite clear.

    Clear but not accurate.  For a while I did interpret most MSDN pages exactly that way, but it turned out that I was equally wrong.  There are more cases than I thought there were, where even the ANSI version of an API really counts characters as documented instead of counting bytes.  One is CreateFileA, and some others related to the contents of edit controls (how many characters in a line or how much room they occupied or something like that, it’s been a while).

  36. Myria says:

    I wish code would stop using TCHAR.  Now that almost nobody cares about Win9x compatibility, code should use WCHAR and the W versions of functions exclusively.  It annoys me to see code written recently that still uses the A versions.

    Even if Win9x compatibility is important to you, use unicows…

    Melissa

  37. Dean Harding says:

    One is CreateFileA

    A little bit of thought can set the reason for that straight. Clearly, the person who implemented CreateFileA simply did a MultiByteToWideChar on the passed-in string, and presumably they had a fixed MAX_PATH buffer of WCHARs to hold the result. Then they used that buffer to call the "real" CreateFileW.

    But I don’t think this is an exception. This is a "length of the string" use of the term, which as I mentioned above is the correct usage.

  38. Dean Harding says:

    As far as I know in MSDN, there are only two(*) different uses of the word "character".

    The first, when you’re talking about buffer sizes (as in, "how big is the buffer you’re supplying?") is where "characters" == "TCHARs"

    The second, when you’re talking about string lengths (as in, "what is the length of this string e.g. strlen, mbcslen, etc") is where "characters" == "code points" (for want of a better term — it’s still not talking about "linguistic characters", because it will return "2" for a denormalized á for example)

    Which is which should be perfectly clear from the context. And I would argue this is the correct way to do it. There’s no need to introduce two new words to "disambiguate" something that is not ambiguous to start with.

    (*) the possible exception here is CharNext and CharPrev, but in my opinion, those are two rather under-documented functions anyway.

  39. stegus says:

    Norman:

    << There was an I/O function which was supposed to write some number bytes, but it wrote more bytes than it was supposed to.  One of the arguments to the function was a Unicode string and the function had to convert to ANSI >>

    It would be really interesting to know more details about this. Exactly which function are you talking about ? Most .net-functions are very explicit about encoding issues – When you are writing strings you always specify the number of characters to write. The number of resulting bytes always depends on the selected encoding.

  40. stegus says:

    Dean:

    <<A little bit of thought can set the reason for that straight. Clearly, the person who implemented CreateFileA simply did a MultiByteToWideChar on the passed-in string, and presumably they had a fixed MAX_PATH buffer of WCHARs to hold the result. Then they used that buffer to call the "real" CreateFileW.

    But I don’t think this is an exception. This is a "length of the string" use of the term, which as I mentioned above is the correct usage.

    >

    But how are you supposed to know that in this particular case ‘length of a string’ means the number of multi-byte characters ?

    The normal rule is that ‘length of a string’ in the ANSI functions means the number of bytes. (since the storage character for these functions is a byte)

    CreateFileA is an ANSI-function, yet when the documentation talks about MAX_PATH characters, it is really talking about the number of WCHARs after MultiByteToWideChar conversion.

    How is the reader of the documentation supposed to know this ?

  41. stegus says:

    James:

    I see your point, I agree that this particular issue is unlikely to cause any problems.

    However, look at _mbsncpy_s – as far as I can understand the numberOfElements parameter is measured in bytes, and the count parameter refers to the number of multibyte-characters (linguistic characters) to copy.

    This is a very strange combination of parameters, and this is not spelled out in the documenation at all.

    The documentation even talks about how you can specify count=size-1 in order to truncate the string. This is obviously nonsense if count and size are measured in different units.

    The mbcs-functions desperately need improved documentation. It would be enough with a general note that described that numberOfElements is always measured in bytes and count always refers to the number of multibyte characters.

  42. James says:

    OK, technically there may still be countries using ‘foot’ to mean something other than the standard agreed decades ago by the US and Commonwealth countries, 0.3048m, just as in theory a byte could be something other than 8 bits in size – but in both cases, I can use them without fear of being misunderstood in good faith.

    The point remains that it is impossible to use "linguistic characters" as the unit of measurement for a buffer passed to StringCchPrintf.

    It isn’t a case of CreateFileA being limited to "MAX_PATH linguistic characters", either. *Some implementations* (those on NT derived operating systems) of CreateFileA transcribe the buffer into WCHAR buffer[MAX_PATH] in order to call NtCreateFile, giving a limitation of MAX_CHAR WCHARs (not ‘linguistic’ anythings) on the input, but you can only rely on support for MAX_PATH bytes safely.

    Stegus, the reader can safely interpret it as intended: you can pass in up to MAX_PATH bytes as the filename. Depending on the platform and the specific bytes you’re passing, you may in fact be able to get away with longer strings some of the time, just as I’m sure other functions will sometimes accept values the documentation doesn’t guarantee will work, but "CreateFileA has a limit of MAX_PATH characters" (characters==bytes here) is sufficient: the limit just isn’t rigidly enforced.

  43. Norman Diamond says:

    Wednesday, January 10, 2007 2:39 AM by stegus

    Norman:

    > There was an I/O function which was supposed

    > to write some number bytes, but it wrote more

    > bytes than it was supposed to.  One of the

    > arguments to the function was a Unicode

    > string and the function had to convert to

    > ANSI

    >

    It would be really interesting to know more

    details about this. Exactly which function

    are you talking about ?

    It took a little bit of searching to find these details again.  The FilePut function (including a record number) wrote more bytes than the record length that had been set in calling the FileOpen function.  To repeat in the interests of fairness, this was a beta, and I’m very glad that this was one of the rare cases where Microsoft decided to fix it.

  44. Dean Harding says:

    stegus: I mentioned my "rule" in my second-to-last post. "buffer size" parameters are bytes (for *A functions; WCHARs for *W functions) "length-of-string" parameters are "multi-byte units" (if you know what I mean).

    In the case of _mbsncpy_s, numberOfElements is the size of the "strDest" buffer while "count" is the maximum length of the string to copy.

  45. stegus says:

    Dean: Your rule shows that you have a deep insight into these issues.

    I do not think that all developers are quite so insightful.

    The problem is that in almost all other cases, buffer size and string lengh parameters are measured in the same unit (what Raymond called storage characters)

    The big exception is the mbcs-functions where string length is measured in multi-byte characters, and buffer sizes are measured in bytes.

    The mbcs functions are extremely dangerous – for example if you have a 10-byte buffer and you copy a 8-character string (as measured by _mbslen) into the buffer you risk a buffer overflow.

    I believe that there should be a clear warning about this in the documentation for all the mbcs-functions.

    Note also that if you define _MBCS, sizeof(TCHAR) is still 1, but for example _tcslen() is suddenly mapped to _mbslen() which means that you suddenly have to handle the mixed personality of the mbcs functions.

    A programmer who is used to normal string handling in C will definitely create lots of dangerous bugs in a _MBCS-enabled program.

    Of course, the individual developer is responsible for any bugs he creates, but confusing documentation does not help.

  46. stegus says:

    Norman: OK, so the error was in FilePut which only exists for backwards compatibilty with VB6. Since it is meant for backwards compatibility, it kind of makes sense that it should emulate VB6 behavior as much as possible even if the behavior contradicts the documentation.

    You might be interested to know that if you are using VB.NET FilePut on a BINARY(not random) file, it will still write the number of bytes that results from converting unicode to mbcs even if the documentation specifies that it should write the same number of bytes as the number of characters in the string. So if you have a 10-character japanese string, FilePut might write anything from 10-20 bytes to the file. Horrible!

    Unfortunately the problem is in VB6 where the conversion from unicode-ANSI is seriously broken. If we had a time machine we could go back and fix this in VB6, but right now it can not be fixed for backcompat reasons.

    As a general rule you should try to avoid the VB6 compatibility functions in a .net program – use the functions in the System.IO namespace directly instead.

  47. Harvey Pengwyn says:

    Ah but which foot? :-) There is the U.S. Survey Foot to consider (this is a real issue not just some bizarre archaic unit no-one users) http://www.vterrain.org/Projections/sp_feet.html

  48. Norman Diamond says:

    Thursday, January 11, 2007 4:47 AM by stegus

    Norman: OK, so the error was in FilePut which

    only exists for backwards compatibilty with

    VB6.

    Huh!?  As mentioned, we knew of some workarounds for some VB6 problems, plus VB6 wasn’t in beta so my boss shipped VB6 code.  But we weren’t aware of FilePut in VB6 causing this same corruption to customers’ files.  When I have time to look at VB stuff again I’ll have to take another look at this.  Thanks for the heads-up.

  49. James says:

    Harvey: The difference is just 610 nm – and the term is ‘survey foot’ as opposed to ‘foot’, making it a different unit, just as a "baker’s dozen" is 13 (as opposed to a regular dozen being 12) and a ‘nautical mile’, which isn’t the same thing as a ‘mile’.

    Moreover, it’s used in specific circumstances, so the context helps: it shouldn’t confuse people in usage, any more than I’d be surprised at CreateWindow not installing panes of glass in my home.

  50. Miral says:

    This whole discussion is why I heartily wish that *all* WinAPIs, without exception, exclusively used a count of bytes and not characters or storage characters or whatever.

    I know, I know, no time machines.  Doesn’t stop me grumbling about it though :)

  51. Yesterday in response to When is a character not a character? , reader Bart commented : Maybe you should

  52. In response to Raymond’s What(‘s) a character! , Miral commented: This whole discussion is why I heartily

  53. Previous blogs in this series of blogs on this Blog: Part 0: The intro, sans content By the end of this

Comments are closed.

Skip to main content