Psychic debugging: Why can’t StreamReader read apostrophes from a text file?


As is customary, the first day of CLR Week is a warm-up. Actually, today's question is a BCL question, not a CLR question, but only the nitpickers will bother to notice.

Can somebody explain why StreamReader can’t read apostrophes? I have a text file, and I read from it the way you would expect:

StreamReader sr = new StreamReader("myfile.txt");
Console.WriteLine(sr.ReadToEnd());
sr.Close();

I expect this to print the contents of the file to the console, and it does—almost. Everything looks great except that all the apostrophes are gone!

You don't have to have very strong psychic powers to figure this one out.

Here's a hint: In some versions of this question, the problem is with accented letters.

Your first psychic conclusion is that the text file is probably an ANSI text file. But StreamReader defaults to UTF-8, not ANSI. One version of this question actually came right out and asked, "Why can't StreamReader read apostrophes from my ANSI text file?" The alternate version of the question already contains a false hidden assumption: StreamReader can't read apostrophes from an ANSI text file because StreamReader (by default) doesn't read ANSI text files at all!

But that shouldn't be a factor, since the apostrophe is encoded the same in ANSI and UTF-8, right?

That's your second clue. Only the apostrophe is affected. What's so special about the apostrophe? (The bonus hint should tip you off: What's so special about accented letters? What property do they share with the apostrophe?)

There are apostrophes and there are apostrophes, and it's those "weird" apostrophes that are the issue here. Code points U+2018 (‘) and U+2019 (’) occupy positions 0x91 and 0x92, respectively, in code page 1252, and these "weird" apostrophes are all illegal lead bytes in UTF-8 encoding. And the default behavior for the Encoding.UTF8Encoding encoding is to ignore invalid byte sequences. Note that StreamReader does not raise an exception when incorrectly-encoded text is encountered. It just ignores the bad byte and continues as best it can, following Burak's advice.

Result: StreamReader appears to ignore apostrophes and accented letters.

There are therefore multiple issues here. First, you may want to look at why your ANSI text file is using those weird apostrophes. Maybe it's intentional, but I suspect it isn't. Second, if you're going to be reading ANSI text, you can't use a default StreamReader, since a default StreamReader doesn't read ANSI text. You need to set the encoding to System.Text.Encoding.Default if you want to read ANSI text. And third, why are you using ANSI text in the first place? ANSI text files are not universally transportable, since the ANSI code page changes from system to system. Shouldn't you be using UTF-8 text files in the first place?

At any rate, the solution is to decide on an encoding and to specify that encoding when creating the StreamReader.

This exercise is just another variation on Keep your eye on the code page.

Comments (24)
  1. mikeb says:

    So, to summarize, the default encoding for a StreamReader is not Encoding.Default.

  2. Garry Trinder says:

    > First, you may want to look at why your ANSI text file is using those weird apostrophes.

    Ok, using My psychic debugging abilities, I’d say it was probably originally written using MSWord.

  3. Gav says:

    Thanks for fixing a bug that was confusing me! Or at least pointing me in the right direction… I was receiving ASCII character streams over a network (its an old protocol) and outputting to a rich text box.

    Unfortunately the streams contained user input which included accented characters and I was using ASCIIEncoding.Default.GetString() on them, which gave me a nice long string of nonsense. Changing that to Encoding.UTF8.GetString() has fixed it completely :)

  4. nathan_works says:

    Yay for character map to help me out, since I was wondering what a normal apostrophe is.. Turns out, it’s 0x0027 (I think). Looking up the ones Raymond gave lists them as "(left|right) single quotation marks," not exactly apostrophes.

    (And I think James might be onto the real culprit here, too..)

  5. Dan says:

    it’s only 0x0027 in UTF-16, IIRC.  UTF-8 and ANSI is 0x27.

  6. Mark says:

    Look out… the pedants are back.

  7. Jonathan says:

    Definitely Word, or Outlook with Word as editor. It is quite difficult to completely disable this behavior as well – even if you disable "smart quotes", autocorrect auto-text replacements have them too (for example dont->don’t), so you have to go through all of them and change as well. And there are a lot.

    I also noticed that Raymond’s writing is devoid of these wierd quotes (it’s, "weird"), while the quote from the customr has them (can’t, does—almost).

    And on the same note, more than once I’ve caught these chars in interpreted text, such as "smart-quotes" for parameters with space, and em-dashes for command-line options. Presumably these commandline samples were copied into Word and mutilated there. Note that the difference is invisible in the default console font.

  8. jondr says:

    There are beaucoup academics and wanna-bes who intentionally use ` and “ to begin single and double quotes, and end with ‘ and ”.  This is so weird. It is not only StreamReader that gets baffled at this.

    (Maybe it is the AR in them that really must want the slants on the opening quotes. Or they want to pretend to "have been there" before the shift key was invented.)

  9. Poochner says:

    jondr: I think that is related to TeX / LaTeX markup.  Some journals require submissions to be in some dialect of this (the American Mathematical Society publications, do IIRC).  Yeah, it seems odd, but it’s an easy way to express typesetting information using only a standard US keyboard.

  10. Karellen says:

    mikeb > The default encoding for a StreamReader is not the *system* default. It is the *internet* default. (UTF-8 has been adopted as IETF STD 63[0])

    To summarise, the default system encoding is not the default internet encoding.

    Unfortunately, it is impossible to make the default system encoding the default internet encoding, as MS doesn’t support UTF-8 as an "ANSI"/"OEM" multibyte codepage, and has no plans to try to. :(

    (At least, according to public statements made by some of its prominent bloggers, it has no plans to try to.)

    [0] http://www.ietf.org/rfc/rfc3629.txt

  11. Ulric says:

    it’s only 0x0027 in UTF-16, IIRC.  UTF-8 and ANSI is 0x27.

    Am I laughing inappropriately at this comment?

  12. I don’t mind straight apostrophes.  What bugs me is curly apostrophes that go the wrong way – get a deal on our ‘08 Cadillacs! (’Tisn’t ‘08; ’tis ’08.)

    Adding link as a public service:

    http://en.wikipedia.org/wiki/Apostrophe#Entering_typographic_apostrophes

  13. Miral says:

    This is why the naming of the Encoding.Default property bugs me.  It would have been better had it been called CurrentAnsi or something (since it’s not necessarily even the "system ANSI" codepage, as programs like AppLocale can modify it).

    Still, guess it’s too late to change it now.  (Where’s that time machine?) :)

  14. MadQ1 says:

    @Maurits: typing 8217 (yes, I know it’s 0x2019 in hexadecimal) on the numeric pad while holding down the Alt key (as per wikipedia) only works in RichEdit controls on my system. In Edit controls I get a down-arrow (↓). CharMap (sorry, Raymond!) tells me to type Alt+0146 for the Right Single Quotation Mark (’). I’d consider editing the page, but I’ll leave it to the experts.

    And just to show that I can pick nits with the best of them:

    The numeric value of the Apostrophy character in ANSI is actually 39, the same as in both UTF-8 and UTF-16. The "0x" notation for hexadecimal numbers is platform independent, and thus 0x27 is equal to 0x0027 and even to 0x00000027, regardless of endianness. (Similarly, in base 10, 0039  is equal to 39, but we don’t usually bother with the leading zeros in real life.)

    So you’re both correct. But, as Napoleon (the porcine one) might say: "All statements are correct, but some are more correct than others."

  15. Worf says:

    @jondr: Ah, but backtick (`) is 0x60 – legal UTF-8. And it better be, because it’s necessary especially on UNIX-like systems (probably why UTF-8 is default – i18n for free in filenames with zero code changes (only two illegal characters – NUL, and slash (/)).

    But for documentation, that’s a TeX thing (pity everyone’s moving away from the nice look that is TeX… but writing in Word is "easier" but typesetting is a pain.)

  16. >it’s only 0x0027 in UTF-16, IIRC.  UTF-8 and ANSI is 0x27.

    *buzzer*

    Sorry, the correct answer is that it’s only 00 27 in UTF-16 BE.  In UTF-16 LE it’s 27 00.

  17. Mo says:

    Those “weird apostrophes” (or rather: single quotation marks) will crop up as a result of pretty much anybody who cares about typography, has copied & pasted text from a web page or Word document or e-mail where ‘smart quotes’ have been used, or from anything written by anybody on a Mac who has Alt(+Shift)+] muscle-memoried (on Mac OS X, Alt(+Shift)+[ is double-quotation-marks and Alt(+Shift)+] is single-quotation-marks).

    I must confess I’ve never really understood why, when NT was heralded as being entirely Unicode-savvy (which of course has been carried through to Vista, and beyond), a UTF-8 multibyte codepage was never introduced.

  18. Larry Lard says:

    Look out… the pedants are back.

    When did we^Wthey leave?

  19. Aaron says:

    People who "care about typography" shouldn’t be pasting their type into plain-text files and expecting them to come out properly.  Do they think that the boldface text and bulleted lists are also going to be fine?

    And I got a good laugh out of the complaint that "a UTF-8 multibyte codepage was never introduced."  I don’t even know where to begin.

  20. Michael J says:

    Is there such a thing as "ANSI" text?  I presume you mean "ASCII".

    Does that qualify as a nitpick?

  21. Spire says:

    ASCII is a 7-bit encoding. The term "ANSI", as commonly used in Windows, refers to an 8-bit superset of ASCII in which the other 128 characters vary depending on the code page in use.

  22. mikeb says:

    > Is there such a thing as "ANSI" text?  I presume you mean "ASCII". <<

    ANSI means: "That thing that isn’t Unicode" (http://blogs.msdn.com/oldnewthing/archive/2005/10/27/485595.aspx).

  23. Jonathan says:

    This blog, found by chance, saved my life… well…maybe not my life… but days of debugging possibly.

    Thanks.

  24. My advice stands.

    Exceptions should be for exceptional cases, not expected errors. Of course, a function should and should be able to return error/warning conditions and it should be checked by the programmer.

    In this respect, I see that UTF8Encoding can be configured to throw on errors, and so, no it does not follow my advice, strictly speaking. Actually, in that case, I don’t agree with the default behavior.

    ASCIIEncoding does not provide error detection. Does not make sense to me. You may say don’t use it, but if it’s there, it shouldn’t be half baked IMHO.

    Reading an ANSI file with UTF8 encoding can throw errors, but reading the same file with ASCII encoding can’t! (As far as I understand it, reading on MSDN, I don’t use .Net).

    Anyway, I’d have ASCIIEncoding return errors, including suggestive warnings if a BOM is found. That would conform to my advice. Exceptions would be reserved for file i/o errors etc.

    On the other hand, this is .Net. A scripting language, managed, most .Net users won’t even know about memory management, considerable some won’t know about code pages and god knows how many does RTFM… Design philosophy for .Net shouldn’t expect them to check for errors, or, turn on error detection manually.

    In summary, my UTF8Encoding would return errors/warnings -not as exceptions, and I’d check the success of any operation. There would be exceptions for exceptional, unexpected cases. For a .Net class library class, it would return all errors as exceptions and error detection will be present for all encodings and turned on by default.

Comments are closed.