Why is there an invisible U+202A at the start of my file name?


There's something strange about this property sheet page:

IMG31415 Properties ×
General
Security
Details
 
Object name:       C:\Users\Bob\Desktop\IMG31415.jpg
Group or user names:
👤  👤 SYSTEM
 👤 Bob
👤  👤 Administrators
 
 
 
To change permissions, click Edit. Edit...
Permissions for SYSTEM Allow Deny
Full control  
Modify  
Read & execute  
Read  
Write  
Special permissions    
     
Advanced... For special permissions or advanced settings, click Advanced.
Apply Cancel OK

Okay, that was a trick question, because the thing that's strange is not visible to the eye.

Use the mouse to click in the object name field (the thing with the file path), then press Home, followed by Shift+End to select the entire text, then Ctrl+C to copy it to the clipboard.

Now things get interesting.

Fire up Notepad, paste the path into the Notepad document, and save it to the desktop with the name tricky.txt.

Huh? Notepad says, "This file contains characters in Unicode format which will be lost if you save this as an ANSI encoded text file."

What Unicode characters are we talking about? There are no accented letters here. All the characters in the file name fit in the ASCII repertoire.

Go to a command prompt and type

C:\Users\Bob> copy "

and then paste the path from the clipboard, then close the quotation mark, and hit Enter.

C:\Users\Bob> copy "?C:\Users\Bob\Desktop\IMG31415.jpg"
The filename, directory name, or volume label syntax is incorrect.

Wait, what? Where did that rogue question mark come from?

The answers to the two questions are the same: The mysterious Unicode character, which is invisible in Notepad, and which appears as a question mark on the command line, is U+202A (LEFT-TO-RIGHT EMBEDDING).

We saw some time ago that you can, as a last resort, insert the character U+202B (RIGHT-TO-LEFT EMBEDDING) to force text to be interpreted as right-to-left. The converse character is U+202A (LEFT-TO-RIGHT EMBEDDING), which forces text to be interpreted as left-to-right.

The Security dialog box inserts that control character in the file name field in order to ensure that the path components are interpreted in the expected manner. Unfortunately, it also means that if you try to copy the text out of the dialog box, the Unicode formatting control character comes along for a ride. Since the character is normally invisible, it can create all sorts of silent confusion.

(We're lucky that the confusion was quickly detected by Notepad and the command prompt. But imagine if you had pasted the path into the source code to a C program!)

Comments (19)
  1. DebugErr says:

    All "Oh my god, it's not a picture, it's text" comments go here…

  2. Entegy says:

    On Windows 7, I am not getting the character in question. I copy to Notepad and the Command Prompt with no ANSI/Unicode warning or ? appearing.

    Also, to others reading, do this on a real property sheet. Even if you turn on Caret Browsing to select the name from Raymond's example, it won't work (at least it didn't for me).

  3. steven says:

    That's because the character is not in the actual HTML for the dialog. It is just the label followed by some non-breaking spaces. On a real property sheet, it works.

    Oh, and that has to be an image of pi.

  4. Brian_EE says:

    I am going to echo @Entegy's comment. I am on Win7 x64 Enterprise and the character isn't there. I tried pasting (from a real file property sheet's Security tab) into both Notepad and into a code editor that has a hexadecimal viewing mode. The only thing there is just the ASCII filename.

  5. Bradley says:

    Did you create the html for that dialog by hand, or is there some program that can generate it from a real dialog?

  6. Kai Schätzl says:

    I remember seeing this kind of path in the registry for many Office paths in the past. Inserted by the Office Installer. At least for Office 2003. Didn't check later (e.g. I didn't have problems with newer versions).

  7. Dan Bugglin says:

    I'm seeing it on Windows 8.1. Neat!

  8. Joshua says:

    In this case, embedding the Unicode character as &#202A; in the HTML would have been a good idea. Oh well.

  9. Engywuck says:

    Win7 x64 Pro: character is not there (notepad does not complain), Win10 x64 Preview Build 10074: character is there. Looks like it was introduced in Win8.

  10. Richie Hindle says:

    This is why Windows 8 got the flat look – so that Raymond could construct disconcertingly accurate HTML screenshots.

  11. Andreas Rejbrand says:

    @Richie: My thought exactly. But from the viewpoint of semantic hypertext documents, Raymond should use SVG instead of nested HTML TABLEs, DIVs, and SPANs. (And U+202A is not present on Windows 7.)

  12. cheong00 says:

    Will there be any plan to ignore leading RTL/LTR characters on filesystem level, or should be write applications with provision on this?

  13. Reseul says:

    I remember we added these to chat area in Communicator as well (for names, dates, phone numbers, all of these in chat headers or generated notifications). We had to also filter them out when users selected "save to file” + "ANSI encoded”, otherwise they would’ve always been prompted with “chat area contains Unicode etc.” even when the real chat had none of these.

  14. Gabe says:

    It seems like the text control needs a property to tell it to always render RTL or LTR, regardless of the contents on the control. That way your file names, email addresses, and such don't need to have this extra confusing character in them.

  15. 12BitSlab says:

    Separate and apart from Raymond's HTML skills (which are impressive), I have to admit that I rather liked the name of the file that was being examined.

  16. Marc Sherman says:

    I recently received via email an Active Directory distinguished name that had a three-per-em space (U+2004) separating the user's first and last name in the CN portion of the DN. I copy/pasted that into a C++ source file in Visual Studio to reproduce a bug that only occurred for that particular user. I didn't know in advance that the U+2004 character was there. When I attempted to save the source file, Visual Studio popped up the same warning as notepad did for you. This raised a red flag for me and pointed me in the right direction for figuring the bug.

  17. Alex Cohn says:

    I don't understand why this prefix was introduced. Unless "Object name" and its value are one TextBox (I don't have Windows 8 to check right now), it should be enough to use style flags to ensure correct reading order for the window. Well, WS_EX_LTRREADING is not a "real flag", but the article msdn.microsoft.com/…/ee264314%28v=vs.85%29.aspx carefully describes "How to Ensure Text is Displayed with the Correct Reading Direction".

  18. ta.speot.is says:

    "Properties" dialog for a digital certificate does the same thing with at least the "Thumbprint" property. It took me so long to find the issue, because I copied and pasted the thumbprint out of that dialog and into web.config.

  19. Holger says:

    Now I know why Win8 is so ugly and has no aero – it's drawn in html….

Comments are closed.

Skip to main content