Strings, immutability and persistence


Todays post is based on a question from StackOverflow; I liked it so much I figured hey, let’s just blog it today.

When you look at a string in C#, it looks to you like a collection of characters, end of story. But of course, behind the scenes there is a data structure in memory somewhere to implement that collection of characters. In the .NET CLR, strings are laid out in memory pretty much the same way that BSTRs were implemented in OLE Automation: as a word-aligned memory buffer consisting of a four-byte integer giving the length of the string, followed by the characters of the string in two-byte chunks of UTF-16 data, followed by two zero bytes. (Recall that BSTR originally stood for “BASIC string”, because the OLE Automation team was actually part of the Visual Basic team; this is the format that Visual Basic used.)

Using this as the internal implementation of strings has a number of benefits. For example: it only requires one heap allocation per string. The length can be determined without counting characters. The string can contain embedded zero bytes, unlike formats that use zero bytes as end-of-string markers. If you disregard surrogate pairs then the nth individual character can be fetched in O(1) time, unlike in variable-width encodings like UTF-8. If the string is pinned in place and contains no zero characters then the address of the string data can be passed unmodified to unmanaged code that takes a WCHAR*. And so on.

Strings are immutable in .NET, which also has many benefits. As I’ve discussed many times, immutable data types are easier to reason about, are threadsafe, and are more secure. (*)

One of the benefits of the immutable data types I’ve talked about here previously is that they are not just immutable, they are also “persistent“. By “persistent”, I mean an immutable data type such that common operations on that type (like adding a new item to a queue, or removing an item from a tree) can re-use most or all of the memory of an existing data structure. Since it is all immutable, you can re-use its parts without worrying about them changing on you.

Strings in this format are immutable, but they are not persistent, thanks to that pesky single-buffer-with-both-a-prefix-and-suffix layout. Concatenation two strings does not re-use the content of either of the original strings; it creates a new string and copies the two strings into the new string. Taking the substring of a string does not re-use the content of the original string. Again, it just makes a new string of the right size and makes a full copy of the data.

This means that operations on strings such as taking a substring are O(n) in the size of the substring. Concatenations are O(n+m) in the sizes of the two source strings. These operations could be O(1) or O(lg n) if strings were persistent. For example, we could treat strings as catenable deques of chars; there are ways to do very efficient concatenations of catenable deques. (**) Or, we could say that there are two kinds of strings; “regular” strings and “ropes”, which are binary trees of strings. Concatenating two strings just allocates a new rope. It becomes a bit more expensive to find the nth char of a rope, and you can’t pass them unmodified to managed code, but you always avoid a potentially lengthy copy. Similarly with substring operations: taking the substring of a string could simply allocate a wrapper around the original string that keeps an offset and a length. No copying required.

Why don’t we do “persistent” optimizations like this, since we have an immutable data structure already?

The motivation for doing so is based on the incorrect idea that O(1) is always better than O(lg n), is always better than O(n). The asymptotic order of an operation is only relevant if under realistic conditions the size of the problem actually becomes large. If n stays small then every problem is O(1)!

That is the situation we are actually in. Very few developers routinely deal with strings of more than a few hundred or a few thousand characters. If you are dealing with larger strings and doing a lot of concatenation, you can use a StringBuilder; if you’re only doing a small number of concatenations, it is very fast to do the copy. If you’re taking substrings, odds are very good that the substrings are going to be a few dozen characters out of a thousand character string, not a few hundred thousand characters out of a million-character string. Most line-of-business programmers are not in the business of chopping up strings containing million-character DNA strands; they’re in the business of parsing the name of a book out of an XML document or an address out of a comma-separated-value text file. The memory operations for making a relatively small string buffer and copying a few dozen bytes out of it are insanely fast, so fast that there is really little point in optimizing it further.

Unless you have really good reason to believe that the “optimization” is a win, it probably isn’t. I spent a summer about twelve years ago rewriting the VBScript internal string manipulations to use “ropes” — to build up a tree on every string concatenation, and only turn it back into a “real” BSTR when necessary. Our research on real-world code showed that the “optimization” was no optimization at all; most of the time strings were being turned from a rope back into a BSTR after the second or third concatenation, and the added expense of allocating all the tree structures was actually slower than just making a copy of every BSTR every time. (***)

Now, some people are actually manipulating relatively large strings and inserting and removing substrings on a regular basis. My team does that all the time; we write code editors that have to be able to deal with enormous files being rapidly edited. Because operations on strings are O(n) and n is actually big, we do not use big strings; rather, we have immutable persistent data structures that represent edits to the document, and those edits are then fed into an engine that maintains an immutable, persistent model of the lexical and syntactic analysis of the program. We want to be able to re-use as much of the previously-seen text and analysis as possible, to ensure high performance in both memory size and speed.

That was some tricky code to write, and what works for us almost certainly does not work for the people doing DNA strings, or any other big-string-with-lots-of-substrings problem that you care to name. Rather than optimize CLR strings for narrow, rare cases, it’s better to just keep it simple, optimize for the common case, and rely on the hardware taking care of making blindingly fast copies of short strings.

————-

(*) Consider an attack where partially-trusted hostile code passes a string containing a “safe” path to File.Open. The security checker verifies that the path is safe for partially-trusted code to open. And then the partially-trusted code mutates the string on another thread before the file is actually opened! If they get the timing right then the security check passes and the wrong file is opened. If strings are immutable, this kind of thing does not happen.

(**) My article on deques does not show how to make them efficiently catenable; see the referenced papers for details.

(***) If you happen to be one of the people with access to the VBScript source code, I think all that gear is still in there, but turned off. Search for FANCY_STRING in the sources.


Comments (32)

  1. Karellen says:

    "If you disregard surrogate pairs then the nth individual character can be fetched in O(1) time, unlike in variable-width encodings like UTF-8."

    So? If you disregard multibyte sequences in UTF-8, then the nth individual character can be fetched in O(1) time also.

    Sure, but it is far, far more likely for there to be a multi-byte character in a typical UTF-8 string than there is to be a surrogate pair in a typical UTF-16 string. Again, we are optimizing for the common case here. Strings containing modern Japanese, Greek, and so on, are very common; strings containing cuneiform or hieroglyphics are very uncommon. — Eric

    OTOH, if you disregard surrogate pairs in UTF-16, or multibyte sequences in UTF-8, your code is broken.

  2. Stuart says:

    @Karellen: You could perhaps put a flag somewhere (after the double-zero terminator, maybe, or at a negative offset before the length?) that indicates whether or not the string contains any surrogate pairs. Or possibly even an integer storing the first index of such a pair. Then you could get O(1) time access for all characters prior to the first such sequence…

    I have no idea whether the CLR does anything like this.

  3. pete.d says:

    "OTOH, if you disregard surrogate pairs in UTF-16, or multibyte sequences in UTF-8, your code is broken."

    Maybe, maybe not. Surrogate pairs aren't really the same as UTF-8 multi-byte encoding points.

    Think about the usual use cases for using a character index into a string. For example, one might search the string for a specific character or sequence of characters, and then create a new substring based on the resulting index. With a UTF-8 string, to implement the substring logic, you have to scan the string all over again just to translate the character index to a position in the actual data structure's storage. With UTF-16, the presence of surrogate pairs is irrelevant; you just treat each half of the pair as its own character, and the logic works fine.

    Another example might be a line-breaking algorithm that chooses an arbitrary index into the string and then tries to work backwards to find a suitable place to actually break the string (e.g. whitespace). Again, surrogate pairs aren't a problem because it's trivial to identify a pair and whether you're looking at the leading or trailing value (indeed, you're never going to mistake part of a surrogate pair for some other valid encoding point), thus ensuring that the code doesn't inappropriately break the pair apart.

    Fact is, for the purpose of indexing into a string, disregarding surrogate pairs often makes perfect sense with a UTF-16 string, even as that's much less frequently the case when dealing with UTF-8 strings (and in fact, in limited scenarios where one is dealing with _byte_ indexes rather than character indexes, even when working in UTF-8 it's not always a problem).

    You cannot a priori describe code as "broken" unless you know what the code is supposed to do and how it's implemented. Even if it does index into a UTF-16 string using the convenient O(1) time method. There are lots of examples of code that disregards surrogate pairs without being broken.

  4. CarlD says:

    "Now, some people are actually manipulating relatively large strings and inserting and removing substrings on a regular basis.."

    I am the author and maintainer of a document editor for a file type that's used to pass electronic banking transactions between banks and network operators (ACH, for those who are paying attention).  It's not uncommon for these files to be as large as 200Mb in size – well over 1 million lines of text and far larger than the typical documents you'll encounter in a code editor.  Despite the allure of fancy data structures like ropes, my editor is implemented(*) as a view on top of the simplest of all data structures:  a giant array of bytes (the contents are strictly ASCII – or occasionally EBCDIC.  Yes, it's still around).   Inserts, deletes, moves are all implemented in the most straightforward way: by copying large chunks of memory around.  Even with that, today's CPUs are so fast that I could never justify the work to build, test and maintain a fancy rope-like backing store: my simple array is good enough for any user of the program.  Even full undo/redo of whole-file operations that modify a million or more lines of text are fast.

    Much as I like fancy data structures like ropes or dequeues, I suspect I'll never again use one – it's just not needed unless you're working on something that's a very serious outlier.

    (*) Implemented via an abstract base class.  I fully expected to have to retrofit a rope-like data structure once I got to very large files, so I built in a dependency barrier to make the retorfit easy.  Fortunately, Moore's law has outpaced e-commerce growth by a good margin since I first started on the app back in 2004.

  5. Apollonius says:

    "Again, surrogate pairs aren't a problem because it's trivial to identify a pair and whether you're looking at the leading or trailing value (indeed, you're never going to mistake part of a surrogate pair for some other valid encoding point), thus ensuring that the code doesn't inappropriately break the pair apart."

    But in what way is this different from UTF-8? UTF-8 suffix bytes always have the highest bit set and the second highest bit unset, so you can't mistake them for something else, either. The only encodings I know of in which the problem that you describe can occur is the old MBCS encodings.

  6. Simon Buchan says:

    To try and (fail to) head off the inevitable Unicode argument over which UTF is best – almost any algorithm dealing with Unicode will be wrong for some input data. You can't safely line-break arbitrary text without huge lookup tables for word endings in non-spaced languges, you can't delete the single character before the cursor without doing normalization, eg in the case of U+0065 LATIN SMALL LETTER E followed by U+02CA MODIFIER LETTER ACUTE ACCENT, in order to be consistent with U+00E9 LATIN SMALL LETTER E WITH ACUTE. The distinctions between code units and code points pale in comparison to the differences between the many-many mapping between code-points and font glyphs (needed just for basic page layout!), not to mention there's no real definition of a 'character'.

    To be more on topic: is it more likely that a cache-oriented string implementation could give performance benefits? Say, a char array for sizes < one page, but turning into a substring array when bigger – hopefully with the substrings being close to a page in size?

    <pre>

    struct STR_PAGED_SLICE {

       wchar_t* Start;

       wchar_t* End;

    };

    struct STR_PAGED {

       int Size;

       union {

           wchar_t Chars[]; // offsetof(Chars[Size]) <= PAGE_SIZE

           struct { // otherwise

               int Count;

               STR_PAGED_SLICE Items[];

           } Slices;

       };

    };

    </pre>

  7. Michael Cederberg says:

    Can you talk a bit about string internment? I would like to know about how efficient it is, especially in a high performance highly concurrent system.

  8. Jon Skeet says:

    What interests me most about the String design in .NET is how it's very different to the String design in Java which *does* use a persistent approach – each String has a reference to a char[], along with a start index and a length. I don't know how much JVMs "know" about strings – clearly the CLR needs to know rather more about System.String as it's the only type I'm aware of other than arrays where the size of an object isn't fully determined by the type, on a particular CLR instance. (Obviously 64-bit vs 32-bit makes a difference in many types.)

    The other interesting decision Java took was to cache the hash code of a string on first call. I *suspect* that's another dodgy optimization – I would love to know how often it saves time vs the memory required for another integer in *every* string in the system.

  9. jader3rd says:

    I am so glad that the decision was made to keep things simple for the common case. Trying to solve for 100% of the cases instead of the 90%, always results in problems and people developing an expertiese in something which shouldn't require expertiese.

  10. Mark says:

    Aren't strings in .NET technically mutable with unsafe code? I don't think it would ever be a good idea to actually do it, but it is possible, no?

    Every single bit in the user space of the process, including all of the code that implements the runtime, is mutable in unsafe code. That includes all the strings. — Eric

  11. @Jon:

    >> The other interesting decision Java took was to cache the hash code of a string on first call. I *suspect* that's another dodgy optimization – I would love to know how often it saves time vs the memory required for another integer in *every* string in the system.

    Actually, I've often thought that cashing hash code would be a very nice thing to have, but then I envisioned it being placed into syncblk – which is already there, and is already a dump of various unrelated "nice to have for some objects, but not worth paying the memory price for all of them" kind of data – which is precisely what hash code is.

  12. pete.d says:

    "But in what way is this different from UTF-8? UTF-8 suffix bytes always have the highest bit set and the second highest bit unset, so you can't mistake them for something else, either."

    Okay, my oversight. You're right.

    My previous example still holds, and there are others. The fact remains that surrogate pairs aren't necessarily going to be as much of a problem as multi-byte UTF-8 encodings.

  13. Gabe says:

    Can anybody explain why embedded zero characters are only fully supported in the 32-bit CLR? Since the string hashing algorithm on the 64-bit CLR stops when it hits '', all strings that start with '' have the same hash code as the empty string.

    As an example, executing

    new HashSet<string>(Directory.EnumerateFileSystemEntries(@"Program Files (x86)", "*", SearchOption.AllDirectories).Select(s => s.Replace('\', '')))

    is a couple orders of magnitude faster on my machine when targetting x86 than as x64. Changing the HashSet to a SortedSet makes them run equally fast.

    Incidentally, the 64-bit hash algorithm seems to be about 50% slower than the 32-bit version on my machine, too.

  14. Random832 says:

    "My previous example still holds, and there are others. " – how does your example not apply equally to UTF-8?

    Anything that you can legitimately ignore surrogate pairs for, you can legitimately ignore UTF-8 for.

    "With a UTF-8 string, to implement the substring logic, you have to scan the string all over again just to translate the character index to a position in the actual data structure's storage." is a straw man – if you can gain efficiency without losing correctness with a "character index" for UTF-16 that ignores surrogate pairs, you can do the same in UTF-8 with a byte index.

  15. pete.d says:

    "if you can gain efficiency without losing correctness with a "character index" for UTF-16 that ignores surrogate pairs, you can do the same in UTF-8 with a byte index."

    Not if you want the API to remain character-based.

    There's a difference between designing for absolute efficiency, and designing for usability. You may feel the latter is pointless, but thankfully not all people do (including the designers of .NET).

  16. oak says:

    If recall, Borland Delphi's Object Pascal's strings also worked the same way.

  17. Stilgar says:

    I never had to deal with characters beyond 2 bytes but it seems to me from the documentation that substring and indexing actually work with characters as in the 2 byte char type and not in real Unicode code points. Therefore if you really wanted to handle surrogate pairs as a single character you need to do it yourself. Is that true?

  18. DRBlaise says:

    pete.d – "Not if you want the API to remain character-based."

    Now your arguments are just getting silly. The traditional CHAR is only one byte.  The 2-byte WCHAR was introduced later.  Personally, I like UTF-8 better as it is backward compatible with ASCII and doesn't change the definition of character.  Also, since I have only worked for American companies, all the business text I have worked with has always been ASCII and I can assume no multi-byte characters.  UTF-8 is also the starndard for the web.

    Eric has a good argument for UTF-16, in that it is currently rare to have surrogate pairs, but 20 years ago the same thing could be said for UTF-8.  What about in 20 more years?  An argument can be made that code should just be written to handle multi-character code-points now, and in that case I think UTF-8 wins bits down.

  19. DRBlaise says:

    pete.d – "Not if you want the API to remain character-based."

    Now your arguments are just getting silly. The traditional CHAR is only one byte.  The 2-byte WCHAR was introduced later.  Personally, I like UTF-8 better as it is backward compatible with ASCII and doesn't change the definition of character.  Also, since I have only worked for American companies, all the business text I have worked with has always been ASCII and I can assume no multi-byte characters.  UTF-8 is also the starndard for the web.

    Eric has a good argument for UTF-16, in that it is currently rare to have surrogate pairs, but 20 years ago the same thing could be said for UTF-8.  What about in 20 more years?  An argument can be made that code should just be written to handle multi-character code-points now, and in that case I think UTF-8 wins bits down.

  20. Jeroen Mostert says:

    Caching the hash value of a string on creation definitely sounds like a dodgy optimization to me. This disregards the fact that 1) most strings are not keys for hashtables (well, maybe in your app they are, but you probably underestimate the number of strings generated) and 2) keys stored in the hashtable are hashed only once, and keys used for lookup are very likely new strings from external sources, which, again, are typically hashed only once (because they're garbage collected soon afterwards).

    Caching hash values makes sense in a lot of cases where interning also makes sense. Interned strings should have their hash values cached. For other strings, it's probably a waste of storage. This is significant, since there are plenty of applications where strings make up the bulk of the heap.

  21. Gabe says:

    Jeroen: The hashcode for a string would be cached on the first call to GetHashCode, not on the creation of a string. And a hashtable's key's hashcode is recomputed every time the hashtable has to grow or somebody iterating over the keys needs to lookup a value.

    No, the real reason it's unnecessary to cache hashcodes for strings is that hashing isn't slow enough to warrant it. My laptop can compute the hashcode of a string with 1M chars in under 1 millisecond, meaning most strings hash in nanoseconds. If it usually takes 10ns to hash a string, why bother caching the result?

  22. Bill Sorensen says:

    "Concatenation two strings" – I'm assuming you meant "Concatenation of two strings" or "Concatenating two strings". FYI.

  23. pete.d says:

    "Now your arguments are just getting silly. The traditional CHAR is only one byte."

    Calling my posts "silly" doesn't help your position any.  I certainly disagree with your unfounded mischaracterization.

    In any case, what was "traditional" isn't relevant.  .NET strings are an API unto themselves, and are Unicode character based.

    We aren't talking about some random API elsewhere.  We're talking about .NET.

  24. >> In any case, what was "traditional" isn't relevant.  .NET strings are an API unto themselves, and are Unicode character based

    But they aren't – they are 16-bit-char based, and 16-bit char is not a Unicode character.

    He is more or less correct in that the only way in which UTF-16 is better than UTF-8 is that it makes it easier to justify ignoring proper handling of all code points, because most of those that you'll see in practice will not require surrogate pairs to encode, and you can just assume 1 char = 1 Unicode character and pretend that this works for everyone. It's more about hiding the problem by solving the most prominent manifestations and swiping the rest under the carpet, rather than dealing with it in its entirety.

    That's in the perfect world, though. In practice, when .NET appeared, Win32 has been using 16-bit chars in all its Unicode-enabled APIs for, what, 7 years already? As it is, you can pass .NET strings to Win32 functions with no conversion at all – P/Invoke marshaler just pins the string and passes the direct pointer to the buffer. That's fast. Returning strings can be handled similarly with StringBuilder, as well. And remember that e.g. WinForms did a _lot_ of P/Invoke!

    Even setting performance aside, there's also the issue of learning curve and familiarity for users of existing tools. Win32 already had it that way, and it was not alone – Java is another language which does it this way, and Qt is a framework with a similar choice. If I remember correctly, Cocoa NSString is UTF-16 encoded, as well. The only UI framework that I can think of that uses UTF-8 for strings is Gtk+ – pretty much everyone else is on UTF-16.

    It's just one of those things that became a de facto standard back in the day when all we had was BMP and UCS2 was all you needed.  For better or worse, it is what we have, and changing it now is a fairly radical, breaking change – while benefits of such a change would be very minor in comparison.

  25. ajdotnet says:

    String itself aside, there are some areas related to strings that might merrit some optimization. Even if this is still only relevant in extreme cases (as long as it doesn't come with a penalty). See string.Split: ajdotnet.wordpress.com/…/the-cost-of-string-split

    Another example is StringBuilder with long strings (memory going to the LOH) and naive usage (i.e. not properly initializing it with the required size). Some way of at least telling the developer about the anti-pattern (e.g. some trace, or a counter with the number of reallocations) could help.

  26. Gabe says:

    AlexanderJ: The implementation of strings is a trade-off. If you use the CLR's implementation of strings, then performing a Split requires creating smaller strings that just about add up in size to your original large string. The bonus of that, however, is that the larger string can then be garbage collected and only the smaller strings that you actually use need to stick around.

    If you look at an implementation of strings like Java's where strings can reference other strings, then doing a Split is relatively cheap — you just create an array of references to the original string with offsets and lengths of the substrings. The unfortunate problem with that, however, is that the original large string cannot be garbage collected until every single one of those smaller strings is no longer referenced. Every call to Split is a potential memory leak!

    And what's the anti-pattern with StringBuilder? A StringBuilder is actually a linked list, where they try very hard to keep each item in the list out of the large object heap.

  27. Alex says:

    Lua's got quite a nice string system – every string is hashed on creation, and interned (importantly with a weak reference – the CLR intern system really fails here).

    Raw equality is a simple pointer comparison then, and hash lookups of the interned strings extremely efficient (the hash code is stored, the equality is a pointer compare).

    Whenever I'm going dictionary heavy… I'm kind of saddened that the CLR doesn't do the same. Especially as I rarely use strings for data manipulation etc.. prefer to leave that for StringBuilders.

  28. Random832 says:

    ""if you can gain efficiency without losing correctness with a "character index" for UTF-16 that ignores surrogate pairs, you can do the same in UTF-8 with a byte index." Not if you want the API to remain character-based."

    If you're ignoring surrogate pairs, your UTF-16 api _isn't_ character-based. You're ALREADY arguing in favor of a non-character-based API for UTF-16. So why are you insisting that the UTF-8 one has to be character-based?

  29. ajdotnet says:

    @Gabe: The fact that string.Split produces new strings for the parts was not my point. My point is that string.Split internally wastes twice as much memory as the original string needs.

    Regarding StringBuilder you're right, they've changed the implementation in .NET4 (until 3.5 StringBuilder maintained one buffer that was reallocated on demand). My fault.

  30. Here's an interesting fact: SyncBlk for CLR objects _already_ has a hashCode field. It's used by stock (reference-identity) Object.GetHashCode. This begs to be reused for String.GetHashCode. In fact, if CLR ever gets the notion of immutable objects, it would be nice to universally apply this optimization to GetHashCode for all of them.

  31. Michael Starberg says:

    oak recalled that Delphi strings worked the same.

    But if I remember correctly, the Delphi string looked like this:

    [refcount][capacity][length[0]][1][2][3][4][5][6][N]

    Now it has been many moons since I had to subtract 3 + 4 from a Delphi string to get its capacity, and luckely it is still 32-bit on a x64. *peew* And aligning so you could get a 8-bit 255 length from pointer[0] is not that elegant. Delphi having 255 length strings was later known as the big waste of time theory

    But why do strings have to be immutable and non-perstiant always?

    I work mostly with 'de interwebz', so thanks to IIS and ASP.NET, thread safety is always an issue. Especially when you have to cache stuff. But for just one request/response I am allocating very very many small strings that gets added to a stream.

    refCount is out when we have a garbage collector, but a memory pool keeping strings with a capacity would be good? I manually coded a 'stack/pool' of 'buffers' with a linear search for a buffer of the right size in Delphi where their strings was not enough, and it worked with blazing speed.

    Now this is Fabulous Adventures in Coding, C# and CIL. But I wonder how the runtime, the actual runtime deals with this? Surely tricks like this are used even if not in the specs? I wonder if Mono on a mac behaves differently than on windows 2008 server x64. Does anyone knows more about this?

    On a sidenote; I know this is not microsoft connect. But most linq immediate methods like .ToList() .ToArray() and .ToDictionary() are so missing an overload. Initial capacity.

    var customers = SomeQuery.ToList(128); // oh, there about a 100 of those. Peanuts

    var allOrderItems = someOrderItemQuery().ToArray(100000000); // oh, looks like heavy duty

    On the other hand, I have yet to see my work go bad because of memory allocations. When I find bottlenecks, it is often more because of a brainfart than from string allocations.

  32. Mahdi says:

    Excellent Article !

    but one more Question , why  String Indexer is Read only ?what happen if it was possible we change a String Object's Character through Indexer ?

    like this

    string mystr="hello";

    mystr[0]='w';