Is the maximum size of the environment 32K or 64K?


There appears to be some confusion over whether the maximum size of the environment is 32K or 64K. Which is it?

Both.

The limit is 32,767 Unicode characters, which equals 65,534 bytes. Call it 32K or 64K as you wish, but make sure you include the units in your statement if it isn't clear from context.

Comments (39)
  1. Daniel says:

    In any case, much better than 256 bytes. :)

  2. Adam says:

    When is "K" /ever/ used to mean "Kilocharacters"?

    "K" without an additional unit specifier (e.g. "b", "B", "g", etc…) in a computing context /always/ means Kilobytes[0]. "K" is a unit. Calling it "64K" does include a unit and is completely unambiguous.

    If you want to count characters, yes, use an appropriate unit. "K" is /not/ an appropriate unit. "Kilocharacters" would be. "Kc" /might/ be, but I don’t know if anyone else would understand you.

    If the environment is 64 kilobytes, it is 64K. It is not 32K, and never will be.

    ([0] Unless it means Kibibytes if you insist on the computing fraternity not being able to reappropriate "Kilo" to mean "1024", but that’s a whole nother off-topic discussion…)

  3. Matt Green says:

    Is such a limit a problem for any program out there? I’m curious what people do with the environment these days. I had an entire runtime library designed around the fact that you can (ab)use the environment as a global variable, and stored the root bookkeeping structure in a TLS slot whose index I recalled using a certain environment variable. It is pretty wacky, but I wanted to see if I could get away with not needing users to link to something.

  4. pcooper says:

    And to be completely pedantic, by "unicode characters", you probably mean UTF-16 code units.

  5. dbt says:

    there is no such thing as UCS-2.  There is only UTF-16.

  6. Tony Cox [MSFT] says:

    No such thing as UCS-2?

    Wikipedia begs to differ: http://en.wikipedia.org/wiki/UCS-2

  7. Gabe says:

    The environment is a UNICODE_STRING structure which has a 2-byte header for the count of bytes in the string. Since the 2-byte value would overflow at 65,536 and 65,535 isn’t a valid number of bytes for a UCS-2 string, the maximum usable length is 65,534 bytes for the string. Because of the count, no null termination is required.

    Note that the structure holds the length of the string and the length of the allocated buffer. You could argue that they should have made the structure use 4-byte values for the length, but that would waste 4 bytes for every string in the system.

    I suppose they could have made the counts hold the number of characters instead of bytes, but that would only double the possible string length, while making string handling code more confusing.

  8. Spire says:

    If you want to be *really* unambiguous: The limit is 32,767 UCS-2 characters, which equals 65,534 bytes.

    Microsoft documentation almost always means "UCS-2" when it uses the term "Unicode", but in reality UCS-2 is just one of many different Unicode character encodings. Not everyone is aware of this convention, so it is safer to use the correct term, especially when there is a chance that it might be taken out of context.

  9. Adam says:

    J. Edward: Don’t you have that backwards? I’d have thought that the limit *is* 65535 bytes, into which you can happen to put 32767 UCS-2 characters. (Or between 16383 and 32767 UTF-16 characters, depending on how many of them are outside the BMP)

  10. Spire says:

    dbt: Officially, UCS-2 may be obsolete/deprecated, but that doesn’t mean that it is no longer in use anywhere.

    Adam: My statement was about equivalence, not causality. However, I’m guessing that you’re correct that the real limit is technically 65,535 bytes — not 65,534 bytes as Raymond seemed to imply.

    (And of course we haven’t even gotten into the amount of *usable* space, taking null termination into account.)

  11. Random Reader says:

    "K" without an additional unit specifier (e.g. "b", "B", "g", etc…) in a computing context /always/ means Kilobytes[0]. "K" is a unit. Calling it "64K" does include a unit and is completely unambiguous.

    In my experience, that’s imposing a convention that simply doesn’t exist.  I see the "K" suffix used to mean 1024 (and occasionally 1000, but that’s another topic) quite often, with the unit being inferred from the context.  A unit of bytes (or octets) is definitely not assumed without the context.

    "KB", on the other hand, does mean kilobytes.

  12. Adam says:

    Did you not see the "in a computing context" part, which you quoted?

    Or do you mean that "K" is used "quite often", in a computing context, without other units being mentioned, and not mean KB?

    Do you have any examples of such use?

  13. Centaur says:

    K never means 1000. k does.

    I was surprised to find out that kilobits and megabits are decimal, despite the computing context…

  14. fraggle says:

    Here’s a more important question: why on earth is there a limit on the environment size?

  15. Random Reader says:

    Or do you mean that "K" is used "quite often", in a computing context, without other units being mentioned, and not mean KB?

    Yes; a common one is when talking about line speeds (cable and DSL especially), where it usually refers to units of bits.  And 1000 instead of 1024, unfortunately.

    For some descriptions of hashtables or cache architectures, I’ve see terms like "4K entry" to refer to the number of slots (where each slot is N bytes on its own), but I don’t have a reference for that offhand.  Then there’s the whole Y2K thing…

    As Centaur also notes, there are "correct" forms, but such conventions seem to go out the window in common use.  I tend to use the context to figure out what is meant first, and only analyze the suffix if the context isn’t clear.

  16. Stu says:

    Don’t forget the "hard drive manufacturer"’s definitions of units.

    I have often seen drives advertised as xGB and a note saying "1GB=1,000,000,000 bytes". Same goes for USB pens and other storage devices.

  17. Phoenix says:

    However; according to the Intel, 4kb is 0x1000bytes (4096)

    There is a big lack standarts…

  18. Dewi Morgan says:

    "There is a big lack standarts…"

    Not so: there are standards, but since the movers and shakers (MS, et al) don’t uphold them, nobody else does. Many don’t even know of them.

    Kb to denote 10^3 is a standard.

    Kb to denote 2^10 is forbidden by the SI, the IEC, and the IEEE.

    Instead, the recommended form for binary values is the prefixes kibi-, mebi-, gibi-, tebi-, pebi-, exbi-, written Ki, Mi, Gi, Ti, Pi, Ei. The postfix units are bit (b), byte (B) and octet (o).

    Thus, the environment storage is 64Kib, or 64Kio, since I believe the two are the same even on Win64.

    http://en.wikipedia.org/wiki/Kibibyte

    Raymond is correct that the storage would often be listed as 32K with the units left off, since there’s no official notation for "arbitrary width character".

    So if someone asks "What string length can I fit in the environment?" (which is a far more important and practical question than "how many bytes", they will most likely be told 32K.

    He was right to point out that it’s important to get the units right but as others pointed out, it’s almost as important to get the base indicator right: the environment is "32Ki characters"

  19. Mike Jones says:

    This brings up another question… XML files typicaly use UTF-8 encoding which is a really encoding: for "regular" characters its the same as olde text files, but can expand to hold the most complex Unicode characters.  UTF-16 which we see all over the Windows API is a pain to use and can not hold those high numbered Unicode characters.  It was a bad choice.

  20. Random Reader says:

    Yes, both UTF-8 and UTF-16 are variable-width encodings and can handle the entire range of Unicode characters.  UTF-16 is the right choice for most general-purpose APIs; it strikes a balance between storage and processing overhead.

    UTF-8’s big advantage is compatibility with ASCII, since Unicode keeps code points 0 to 127 identical to the ASCII standard.  UTF-8 uses the 8th bit to indicate encoding for code points 128 and above, but uses up to 4 code units to do it.  This wide range adds to some processing overhead.  There’s also the need to deal with invalid but logically possible code unit sequences; a simple implementation could easily take "C0 80" to mean "00".  The problem of having multiple representations for the same character can lead to security flaws.  There are several well-known exploits of this, including one for an old version of IIS.  UTF-8 is also at a memory disadvantage for much of the common East Asian character set, as it requires 3 octets to represent what UTF-16 can do it 2.

    UTF-16 trades the minimum 2 octets per code unit memory cost, and lack of ASCII compatibility, for easier processing.  Using surrogate code points, UTF-16 only ever expands to 2 code units, and the surrogate encoding does not have the same multiple-representation problem UTF-8 does.

    Most general purpose Unicode systems made the same choice, notably Java and Mac OS X.  The Web and related technology in general has gone with ASCII compatibility, a decision which makes sense for its specific domain.

    How various parts of Windows actually handles surrogate code pairs is another story entirely.  One that’s been told before:

    http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx

    http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx

  21. BryanK says:

    I am pretty sure that UTF-16 *can* hold characters whose code point is greater than 65535, though.  To do it, you use two (or more?) UTF-16 values, one of which is an escape.

    (Or it’s done using "combining characters", which may amount to the same thing.  Or possibly something else.)

    In short: UTF-16 does not mean that each character is always 16 bits…

  22. Mike Jones says:

    Since UTF-16 can also expand for high-numbered characters we can’t accurately say the environment which is limited to 64 kilobytes can hold 32 kilo-UTF-16-chars since a UTF-16 character can be more than 2 bytes.  We can only says it holds AT MOST 32 kilo-UTF-16-chars … but maybe less.

  23. Gabe says:

    Don’t forget that when Windows NT was being designed in the early 1990s, there was nothing but UCS-2. UTF-8 wasn’t presented to the public until the same year that NT was first released, and not a formal standard until after NT 4.0 was released.

    I don’t think there were any characters that required more than two bytes in UTF-16 until after Windows 2000 was released. The characters that require more than 2 bytes are just special purpose things like old languages and math/music symbols. Considering that, I think UTF-16 was a perfectly logical choice for encoding.

  24. Nick Lamb says:

    Random Reader – In practice though UTF-16 has been a disaster everywhere it has been "implemented" (scare quotes to represent the reality that in practice so much UTF-16 code is really just re-branded UCS-2 code with all the consequences that entails).

    Look at the state of Unicode on a typical user’s Windows desktop after 15 years of "support". Probably only half of their apps are Unicode capable at all, and few if any of those work properly outside the BMP. Why? Because UTF-16 is a lot of extra work for the programmers, and even Microsoft’s own tech writers don’t seem to understand UTF-16 Unicode scenarios well enough to document them properly as Raymond helpfully illustrates here.

    The security differential is null, you must get your implementation details absolutely correct or there are security problems in either encoding. Microsoft screwed up both their UTF-8 and UTF-16 handling from a security point of view, and we’re supposed to be happy that in 2003 or so this is finally mostly fixed.

    IMNSHO Most platforms that married UTF-16 did it because they were already engaged to UCS-2. UTF-16 has all the disadvantages of UCS-2, which were already considerable, plus the burden of being a variable length encoding. The worst of both worlds. It’s like getting to the altar and the bride whispers, "Oh, I forgot, I’m actually sleeping with someone else. I hope that won’t be a problem for our marriage".

    ASCII compatibility wasn’t enough reason to use UTF-8 on the Internet, CESU or other encodings would have met that requirement. UTF-8 won on the Internet because it has the least disadvantages of any encoding. It’s smaller for most of today’s and tomorrow’s data (especially after you factor in the real size with ‘deflate’ as the only compression in practical use); it preserves the code point ordering; it’s endian-neutral and it recovers correctly from dropped bytes, still a common data corruption today.

    And on the platforms that married UTF-8 non-Unicode applications are now an endangered species. Because it was so easy to dip a toe into the water, most programmers were swimming before they even knew it. The BMP is nothing special in UTF-8, so your Unicode support doesn’t stop at its edge either.

  25. Random Reader says:

    There’s nothing inherent in UTF-16 that makes the platforms that use them a "disaster" in practice — Java certainly isn’t, nor is ECMAscript or .NET.

    I didn’t mean to say that ASCII compatibility was the only consideration for Internet use, just that it was a major factor in deciding to use an encoding with 8bit code units.  You do bring up a general point I neglected earlier: the Internet is concerned with information exchange/transfer, not processing.  The two concepts have different goals.

    (As a side note, I’ve seen that comment about UTF-8 recovering from arbitrary lost bytes before.  I must be hanging out in entirely the wrong parts of the ‘net, because I’ve never encountered a situation where it was the responsibilty of a character encoding to handle such things.  UTF-8 doesn’t even have error detection!  These are the responsibilities of transport protocols, not language-oriented standards.)

    The platforms I’m referring to, OSes and VMs and what have you, are concerned with processing that data.  UTF-16, even with its variable width, is simply easier to process than UTF-8.  A general-purpose platform like an OS also can’t make the assumption that the East Asian languages simply won’t be "common" — it needs to handle all cases and handle them well.  UTF-16 is a nice balance for doing that.  The security point I brought up is specific to the UTF-8 encoding; I’m not aware of any others that are specific to either UTF-8 or UTF-16.  There are plenty for Unicode in general, depending on the use context.  The whole IDN thing is a recent example.

    You seem to be confusing support for an encoding with actual user-ready support for Unicode.  The two are very far apart.  As a trivial example, look at how many applications support "case-insensitive" operations on UTF-8 backing stores — but only for the ASCII characters.  This problem affects everything from PHP to text editors.  Many applications "handle" UTF-8 quite by accident, as it simply doesn’t get in the way of the ASCII bytes they recognize.  That’s a great property for pass-through transfer, but meaningless for doing actual work on data.

    As far as Windows goes, there are some contributing factors to that.  (15 years?  What are you counting from?  The first seriously public implementation was NT4, 1996.)  One is the API situation: Windows 9x simply didn’t support most of them, and the MSLU was inconvenient when it appeared.  For the general consumer, a UTF-16 platfom wasn’t available until XP, 2001-2002.  Without that general availability, not many would want to target it.  And in order to target the UTF-16 APIs at all, you have to want to do Unicode — it’s not accidental like the ASCII->UTF-8 transition is.  Wanting to do Unicode at all is also a relatively recent thing.

    Which brings me to the real point: Unicode is hard.  Even the major web search engines can’t do it right.  (Search Micheal Kaplan’s blog for some fun comparisons.)  It takes much more work than just supporting a particular encoding.

    Encodings are easy.

  26. Norman Diamond says:

    Friday, July 07, 2006 4:26 AM by Centaur

    > K never means 1000. k does.

    When I needed a 47K resistor, K didn’t mean 1024.

    Friday, July 07, 2006 7:03 AM by Stu

    > Don’t forget the "hard drive manufacturer"’s

    > definitions of units.

    Here’s someone else who won’t forget:

    http://www.wdc.com/settlement/

    Friday, July 07, 2006 11:01 AM by Mike Jones

    > XML files typicaly use UTF-8 encoding which

    > is a really encoding: for "regular"

    > characters its the same as olde text files,

    No it is not.  It is an encoding but it is not the same as old national encodings.  (Except in a country which I think is the world’s third largest by population, and a few smaller countries or parts thereof.)

    Friday, July 07, 2006 7:03 PM by Gabe

    > I don’t think there were any characters that

    > required more than two bytes in UTF-16 until

    > after Windows 2000 was released.

    Close.  The characters existed in the world’s largest country (by population).  Obviously there’s no encoding for them in UCS-2 and I’m not sure when UTF-16 was invented in order to allow for them.  Also I don’t know the national encodings of that country (nor provinces thereof) so don’t know if national encodings included those characters.

    In Japan there was some debate over whether to add encodings to represent miswritings of characters, because miswritings had been performed by government officials in registering people’s names, and those people had to use the government-assigned characters rather than the correct characters.  (Sorry I don’t know the outcome.  I don’t even know if the issue has been decided yet.)

    Saturday, July 08, 2006 4:43 AM by Nick Lamb

    > and even Microsoft’s own tech writers don’t

    > seem to understand UTF-16 Unicode scenarios

    Nor even UCS-2, depending on which MSDN pages you read.  Nor ANSI code pages, depending on which MSDN pages you read.

    > UTF-8 won on the Internet because it has the

    > least disadvantages of any encoding. It’s

    > smaller for most of today’s and tomorrow’s

    > data

    That depends on how you define "most".

    > it preserves the code point ordering;

    Does not.

  27. Dean Harding says:

    UTF-8 doesn’t even have error detection!

    It "sort of" does, in that you can tell just by looking at a single byte whether you’re in the middle of a multi-byte sequence, or at the start of a new one. I think that’s what he meant.

    The thing is, UTF-16 IS better for data processing than UTF-8. Technically, since they both represent the same data, anything you can implement in UTF-16 you can also implement directly on the UTF-8 bytes. But generally when processing Unicode data, you would transform the UTF-8 into UTF-16, do the processing, then transform back (if you want everything in UTF-8).

    For example, linguistic sorting, normalization, and so on are all MUCH simpler when working in UTF-16. Sure, it’s POSSIBLE to do it directly in UTF-8, but nobody would WANT to.

    Also, the way surrogates work, you can mostly just process them like combining characters. For example, a + an acute accent could be <U+00C1> or it could be <U+0041 U+0301>.

    This means that in UTF-8, a + acute accent could be: 0xC3 0x81 (which is the precomposed form) or it could be: 0x41 0xCC 0x81 (which is the decomposed form) Notice that this is NOT the same as a simple "variable number of bytes per character" that UTF-8 uses anyway. This is "multiple code-points per logical character" which something else.

  28. Nick Lamb says:

    I pointed out that UTF-8 preserves the binary code point ordering of Unicode, and Norman replied concisely but inaccurately "Does not". Anyone who wonders about this can trivially visit Unicode.org and read the UTF-8 standard for themselves. I suppose if Norman still thinks he is right he could try to provide an actual example next time…

    Dean, if you were going to transform Unicode text into one of the UTF encodings as a convenience for writing text processing algorithms it would be UTF-32. Of course the additional memory bandwidth means a sober analysis is needed here. Are you really going to do this piece of programming over and over? If not, why not get it right in your native encoding just once and save the overhead? Sure enough, on this UTF-8 system the linguistic sorting, normalisation, case transforms and similar operations are done on UTF-8 strings.

    The idea that you can muddle the surrogates and combining forms doesn’t work, it just creates more bugs. It is interesting how often Win32 developers, exhausted by problems with UTF-16, wish them all away by pretending that surrogates aren’t important enough to handle correctly (even Michael Kaplan wondered if he could treat them as ligatures). This means more places where developers get confused and give up, returning to the sanctuary of their "ANSI" codepages.

    Random, it’s hard to argue that Java or .NET are much better here, programmers just have no choice – they never had any support for other encodings. You could as well say that Java’s IPv6 implementation is great – programmers use it, bugs and all because there’s no alternative. The "major web search engines" all have problems just past the edge of the BMP. Obviously this doesn’t prove that the problem is UTF-16, but it’s a funny coincidence isn’t it?

  29. Dean Harding says:

    Nick: UTF-32 doesn’t save you from combining characters, it’s the same thing in UTF-8, UTF-16 and UTF-32 (or any other encoding of Unicode). I’m not saying it’s exactly the same thing, just that it’s a similar problem, and since it’s one that ALL conformant Unicode implementations (be they UTF-8 or UTF-16) need to solve, most of the work for surrogates in UTF-16 is already in place.

    Anyway, this is getting off-topic now…

  30. Random Reader says:

    Sunday, July 09, 2006 11:36 PM by Dean Harding

    >> UTF-8 doesn’t even have error detection!

    > It "sort of" does, in that you can tell just by looking at a single byte whether you’re in the middle of a multi-byte sequence, or at the start of a new one. I think that’s what he meant.

    Yeah, I figured he was referring to the fact that the UTF-8 decoder has enough information to sense and ignore broken multi-unit sequences, but that’s just a side effect of the encoding form putting size considerations first.  It can’t, for example, detect the loss of a single code point in the ASCII range, or a whole multi-unit sequence.

    A real error detection mechanism could do that.  The classic solution is checksums, which most data transport systems already have in abundance.

    Monday, July 10, 2006 6:03 AM by Nick Lamb

    > It is interesting how often Win32 developers, exhausted by problems with UTF-16 …

    You keep referring to major problems with UTF-16 (that apparently don’t exist in UTF-8), but you haven’t mentioned anything specific.  What do you mean?

    > Random, it’s hard to argue that Java or .NET are much better here, programmers just have no choice – they never had any support for other encodings.

    I’m not following this thought.  To summarize our conversation so far, you said that UTF-16 has been a disaster everywhere, I replied saying there’s nothing inherently disastrous about it, and now you’re saying…?  It’s impossible to determine whether .NET and Java are disasters because they only use UTF-16?  Huh?

    As far as support goes, they both can convert to and from UTF-8 just fine.  They merely use UTF-16 natively, the same way a UTF-8 processing tool uses UTF-8 natively.

    > The "major web search engines" all have problems just past the edge of the BMP.

    Actually I was referring to things like this:

    http://blogs.msdn.com/michkap/archive/2005/11/15/492301.aspx

    It’s completely unrelated to BMP or the encoding used.

  31. Norman Diamond says:

    Monday, July 10, 2006 6:03 AM by Nick Lamb

    > I pointed out that UTF-8 preserves the binary

    > code point ordering of Unicode, and Norman

    > replied concisely but inaccurately "Does not"

    Maybe I need to quote more of your posting that I replied to.

    Saturday, July 08, 2006 4:43 AM by Nick Lamb

    > ASCII compatibility wasn’t enough reason to

    > use UTF-8 on the Internet, CESU or other

    > encodings would have met that requirement.

    It looks like you’re talking about compatibility with code pages.  For some reason I thought you were in the UK where you already need a code page bigger than the 128 code points that ASCII has, so I assumed you didn’t really mean just ASCII.  If you did mean that one high priority requirement for internet communications should be codepoint compatibility with one country’s national standard, then I made a wrong assumption, sorry.

    > UTF-8 won on the Internet because it has the

    > least disadvantages of any encoding. It’s

    > smaller for most of today’s and tomorrow’s

    > data (especially after you factor in the real

    > size with ‘deflate’ as the only compression

    > in practical use); it preserves the code

    > point ordering;

    It looks like you were explaining how UTF-8 has fewer disadvantages than others, where one of the others is UTF-16 and one is the more general way that HTTP headers specify what encoding is used in the content.

    In fact HTTP headers that specify the encoding allow preservation of 100% of the codepoint ordering of whatever national encoding was used.  Furthermore, those HTTP headers work.  Of course those HTTP don’t work when not used (for example when foreign sites leave the encoding to the viewer’s default but they forgot to encode their content in EUC or Shift-JIS) but they do work when they are used.

  32. Nick Lamb says:

    Dean, it’s the "same thing", but not "exactly the same thing" ? Does that mean you don’t "exactly" disagree with me after all ?

    Random, the robustness which you refer to as a "side effect" was a design criterion for the encoding. If you look at earlier variable length character encodings this feature is missing. UTF-16 had the same requirement at the code unit level, because without it seeking is painful, but because transports are /byte/-oriented it’s a less complete fix.

    So, you asked, what’s wrong with Java? It took a good part of a decade for Java to implement UTF-8 correctly (and internally it still uses a CESU-8 like hack for serialisation, making development tools and interoperability one step trickier). There’s lots of Java code that assumes (as Sun wrote in their original documentation) that one Java character = one Unicode character, which means it will break, perhaps spectacularly outside the BMP. Now the "character" type in Java is more or less useless. It’s not a byte (which Java already has) and it’s not a string (the minimum unit which can hold a Unicode character) it’s just an arbitrary UTF-16 code unit. Bye bye abstraction.

    Yes, there are several bugs in search engines. Some of those bugs are fundamental character encoding problems, which I’d argue (without seeing the code) are symptoms of UCS-2 support masquerading as UTF-16.

    Norman, the phrase I used was "code point ordering of Unicode". If that phrase didn’t mean anything to you, why write that you disagree? ISO 10646 provides a mapping from numbers to characters (and control codes, symbols, etc.) and each of the encodings converts those numbers into byte sequences. This mapping, which underpins all of Unicode, is ordered. UTF-8 encoding preserves the ordering, as explained in its standards documents. UTF-16 does not, because of the surrogate characters introduced mid-way through the code range.

  33. Nick Lamb says:

    Ugh, now they have me doing it. "surrogate characters" don’t exist, it should read "surrogate code points" in the last line above.

  34. Norman Diamond says:

    Tuesday, July 11, 2006 4:40 AM by Nick Lamb

    > Norman, the phrase I used was "code point

    > ordering of Unicode".

    Mr. Lamb, your messages of July 8 and July 10 are visible on this page for you to see as well as me.  On July 8 you did not say "code point ordering of Unicode".  On July 9 (in the timezone of blogs.msdn.com) I replied to what you wrote on July 8.  During this part of the discussion, your July 10 retroactive rephrasing of your July 8 message had not yet taken effect.  On July 11 I reminded you of what I originally replied to.

    > ISO 10646 provides a mapping from numbers to

    > characters

    It does indeed.  On July 8 it didn’t look like that’s what you were talking about.  Meanwhile ISO 10646 causes some confusion because it preserves compatibility with SOME national codepoint orderings.  In order to create an environment where everyone could see the issues involved and work to overcome the issues, it would have been better to have a (alternate reality here) ISO 10646 whose codepoint ordering was incompatible with every national code page.

  35. Dean Harding says:

    Dean, it’s the "same thing", but not "exactly the same thing" ? Does

    > that mean you don’t "exactly" disagree with me after all ?

    My first post said didn’t say it was the "same thing", it said was mostly the same: "Also, the way surrogates work, you can MOSTLY just process them like combining characters." (added emphasis)

    > Norman, the phrase I used was "code point ordering of Unicode"

    I don’t think it matters anyway. What’s the big deal in preserving the code-point ordering? There’s no reason to ever want to actually SORT by code-point – it’s almost never the linguistically-correct sort order. And if you don’t care about being linguistically-correct (e.g. it’s only used internally or something and not displayed to a user), having non-BMP characters sorted "out-of-order" is nothing to lose sleep over.

    > Now the "character" type in Java is more or less useless.

    > It’s not a byte (which Java already has) and it’s not a string

    > (the minimum unit which can hold a Unicode character) it’s just an arbitrary UTF-16 code unit.

    That’s like saying the "char" type in C/C++ is more-or-less useless because it can’t represent a full Unicode character either.

    > Yes, there are several bugs in search engines. Some of

    > those bugs are fundamental character encoding problems, which

    > I’d argue (without seeing the code) are symptoms of

    > UCS-2 support masquerading as UTF-16.

    I don’t think you understand the concept of combining characters. Combining characters are totally independent of what encoding you’re using, be it UTF-8, UTF-16 or UTF-32. The character ‘á’ can be represent as either <U+00C1> or it could be represented as <U+0041 U+0301>. In UTF-8, that would be either 0xC3 0x81 or 0x41 0xCC 0x81. Those two byte sequences are what we call "canonically equivalent". That means, if I enter 0xC3 0x81 into a search box, it should also find the byte sequence 0x41 0xCC 0x81. THAT is the problem that current search engines have, they DON’T find the other normalization forms. It has nothing to do with how the text is encoded.

  36. Random Reader says:

    Tuesday, July 11, 2006 4:40 AM by Nick Lamb

    > Random, the robustness which you refer to as a "side effect" was a design criterion for the encoding. If you look at earlier variable length character encodings this feature is missing. UTF-16 had the same requirement at the code unit level, because without it seeking is painful, but because transports are /byte/-oriented it’s a less complete fix.

    * Compact variable-width encoding requires the number of code units in the sequence to be explicitly indicated within the sequence itself, lest one be stuck with escape-type schemes to avoid run-ons, or end markers to allow incremental processing.

    * Enabling random access or reverse scanning requires differentiating the first code unit from all others in the sequence.

    Those two requirements result in a decoder having enough information about sequences to detect broken ones, but that’s not a designed-in reliability mechanism.  As you pointed out, UTF-16 has the same invalid sequence detection as a result of the same requirements, and both it and UTF-8 operate on the code unit level.  Neither one cares about the underlying transport, be it bits, bytes, or something else.  If a design goal of UTF-8 was detecting byte-oriented transport errors, it would detect missing _bytes_ (i.e. code units in the ASCII range, or whole sequences).  It doesn’t.

    > Now the "character" type in Java is more or less useless. It’s not a byte (which Java already has) and it’s not a string (the minimum unit which can hold a Unicode character) it’s just an arbitrary UTF-16 code unit. Bye bye abstraction.

    I think your argument revolves around a percieved notion that UTF-16 is being treated as a "Unicode character" abstraction, and therefore UTF-16 is worthless.  Well, uhm, no, that’s why we’re talking about an encoding.  UTF-8 platforms are no different.

    When you work on a "Unicode character" abstraction, or a "Unicode glyph" abstraction, the encoding is an irrelevant implementation detail.  You may very well be right that many platforms that attempted to use such abstractions pervasively have failed to maintain them, but that isn’t what we’ve been talking about.

    > Yes, there are several bugs in search engines. Some of those bugs are fundamental character encoding problems, which I’d argue (without seeing the code) are symptoms of UCS-2 support masquerading as UTF-16.

    That would imply they got the Unicode details correct within UCS-2 — combining characters, full case folding, etc — and the problem is merely a case of not handling surrogates.  Do you have examples of such a thing?

    > Ugh, now they have me doing it. "surrogate characters" don’t exist, it should read "surrogate code points" in the last line above.

    One of the most irritating things about Unicode is the terminology.  "Character" used to be so simple :(

  37. Nick Lamb says:

    Random, I wrote that UTF-8 recovers correctly, ie the remainder of the text can be decoded, whereas a dropped byte in UTF-16 corrupts the remaining text.

    It seems a bit strange to have an entire datatype devoted to an "irrelevant implementation detail" in a high level language like Java doesn’t it? To have tutorials and API families dedicated to such a type…

    You’re right, as I already said, that there are lots of different kinds of Unicode bugs in search engines, but only one of them is relevant to the choice of encoding. Some of Notepad’s Unicode bugs aren’t caused by the use of UTF-16 ether…

    And of course characters were always complicated, it’s just that before Unicode people were usually solving only a subset of the problem.

    Norman, I see where your misunderstanding arose now. Of course UTF-8 must encode ISO 10646 as it actually is, and not how you might imagine it could be.

    Dean, you used those two phrases next to one another in a single post. I’m sure you didn’t mean for me to make that contrast, but it’s there.

    The C ‘char’ datatype isn’t useless because it is conveniently byte sized. You can’t store Unicode characters in it, of course, only UTF-8 code units because those are also byte sized. A Unicode character can be stored as a C ‘char *’ string.

    I can’t see why you would think I’m ignorant about combining characters, nor why you seem to believe that normalisation is related to processing of surrogates in UTF-16. On the whole I think you’re very confused about Unicode, and I hope you find out a lot more before writing any code that processes Unicode data.

    The lack of normalisation (which would be easy in the search front end, but trickier for the bulk robots) is just one of many problems in today’s popular search engines. Specifically it’s a problem that’s /not relevant/ to character encodings, unlike the trouble beyond the BMP.

    I don’t think there’s much more to add, and we’ve deviated far off topic.

  38. Random Reader says:

    Random, I wrote that UTF-8 recovers correctly, ie the remainder of the text can be decoded, whereas a dropped byte in UTF-16 corrupts the remaining text.

    For "it recovers correctly from dropped bytes, still a common data corruption today" the implication is that it has features oriented toward use with a particular kind of unreliable byte-oriented transport.  I was pointing out that this implication is not correct.  UTF-8 and UTF-16 both have the "detect a lost code unit that was part of a larger sequence" property at the code unit level, not the transport unit level.  It may seem like I’m nitpicking, but there’s a very large semantic difference between the two.  UTF-8 and UTF-16 decoders have enough information to skip code unit sequences that were made invalid by incorrect truncation or concatenation of code unit processors.  Implying that this property can handle the various corruption scenarios of certain lossy transports is akin to claiming ASCII is robust for lossy bit-oriented transports because bit 8 is always 0.  It’s just not an impression you want to give.

    In other words, for the practical applications of this property, UTF-8 and UTF-16 are on even ground.  Since it’s not a designed-in reliability feature, it relies on specific scenarios with specific decoders — which basically means it should not be considered much of a feature for either encoding.  (I consider the fact that invalid sequences can be detected at all to be very much a feature for reliable programming, though.  It’s always  good to know when something is wrong.)

    > It seems a bit strange to have an entire datatype devoted to an "irrelevant implementation detail" in a high level language like Java doesn’t it? To have tutorials and API families dedicated to such a type…

    If you’ll recall, I originally stated that UTF-16 is the most appropriate encoding for general-purpose processing, so I don’t find it strange at all that there’s a datatype dedicated to that.

    You’re also twisting my words a bit; I said the encoding is an irrelevant implementation detail when you’re working with a "Unicode character" abstraction.  I didn’t make any claims about .NET, Java, or ECMAScript applying such an abstraction to their basic datatypes.  I simply noted that they are platforms that have adopted UTF-16 natively.

    The situation isn’t any different on UTF-8 platforms, as you note yourself with the comment on C’s char datatype.  (As a side note, the default signed char is usually inconvenient for UTF-8 processing; I often see a typedef used for those cases in C.  Most platforms with a datatype designed for UTF processing have an unsigned default.)

    In your response to Dean,

    > I can’t see why […] you seem to believe that normalisation is related to processing of surrogates in UTF-16.

    He was referring to the fact that much of the parsing logic for handling surrogates is already required by the need to handle combining characters.  It’s mentioned in Unicode 4.0 section 5.4 under "Strategies for Surrogate Pair Support".

Comments are closed.