What’s the difference between the zero width non-joiner and the zero width space?


In Break it up, you two!: The zero width non-joiner I discussed the purpose of the zero width non-joiner, which is to request that two adjacent characters be rendered without a ligature. Conversely, the zero width joiner requests that they be rendered with a ligature. Of course, it is up to the rendering engine to make the final decision as to how the rendering is done, but at least you can make your intentions clear.

Another character that seems very similar to the zero width non-joiner is the zero width space. Both of them have no width, and both of them break up ligatures. So what's the difference?

Well, one of them is a space, and the other one isn't.

The zero width space is used to indicate where one word ends and another word begins, even though there should not be any space rendered at the word boundary. This is significant for languages which have the concept of words, but not of spaces. For example, the Thai and Korean languages use multiple characters to represent words, but traditionally do not insert spaces between words. The words just run together, and readers are expected to use their experience with the language to know where one word ends and the next begins.

(This sounds kind of unfair to people who are still learning the language, but learning a language is already unfair. After all, in normal speech, people typically do not make discernable pauses between words. All the words run together, and you are expected to use your experience with the language to know where one word ends and the next begins.)

The Unicode Standard calls this out:

Zero-Width Spaces and Joiner Characters. The zero-width spaces are not to be confused with the zero-width joiner characters. U+200C zero width non-joiner and U+200D zero width joiner have no effect on word or line break boundaries, and zero width no-break space and zero width space have no effect on joining or linking behavior. The zero-width joiner characters should be ignored when determining word or line break boundaries.

In English, these special characters don't have much use. You could use the zero width non-joiner to break up a ligature, say to break up the "fl" ligature in the word "wolf‌like", but there's usually not much call for it in English.

Note that if you had used a zero width space instead of a zero width non-joiner, then you are telling the layout engine that "wolf" and "like" are two separate words, that happen to run together without a space. This means that it is possible for a line break to be inserted between them.

Comments (22)

  1. Brian says:

    Another good use for the zero-width space is when you are rendering something with markers to indicates spaces (for example, a raised-dot (U+2e33). Tossing a zero-width space in after the dot allows the renderer to do line-breaks after words properly.

  2. Entegy says:

    Zero width joiner is used to combine emoji as well. 🤷🏼‍♂️ is an emoji rendered by combining 🤷, 🏼, and ♂️ using ZWJ.

  3. Don Reba says:

    > After all, in normal speech, people typically do not make discernable pauses between words.

    They do in English, and languages that really do omit pauses, such as Spanish and Russian, separate words using stress.

    1. Yuri Khan says:

      Zero-width spaces are also useful when marking up source code for rendering in HTML, so that the renderer knows where it’s okay to break up std::numeric_limits<std::iterator_traits<std::vector::const_iterator>::difference_type>::max().

      1. However it means that when people copy/paste your code, it doesn’t compile due to the presence of invisible characters.

        1. Brian_EE says:

          Then they should use an old-school source editor that doesn’t support Unicode characters. Works 4 Me™

          1. mikeb says:

            I’d settle for an editor that intelligently deals with the stupid ‘curly quotes’ and fancy dashes that Word (and therefore Outlook) likes to insert.

            Every couple of months that useless feature causes confusion and aggravation for me or someone I’m communicating with.

        2. ranta says:

          I had a similar problem with soft hyphens in CamelCase identifiers. Solved by replacing them with empty span elements and CSS span.shy::before { content: “\00AD” }. Handling clipboard events might have worked, too.

          Re space characters, I tried to use U+2007 FIGURE SPACE to align a table of numbers today but it turned out to be wider than the actual digits. I wonder if the software substituted a different font for a missing glyph.

        3. Alex Cohn says:

          This may be a good reason to use such characters, especially in code samples marked “this is a wrong way to do it! This code leaks memory!”

      2. Joker_vD says:

        I also heard that ZWS is used to separate bytes in the modern memory chips instead of plain 0x20 space that was used in the older RAM. Sadly, the debuggers hasn’t caught up with this yet and print them as plain ASCII spaces.

    2. Muzer says:

      Not for every word. Just reading out that first sentence of yours and putting spaces where there were noticeable pauses in my natural speech pattern, it came up as “They doinEn glish”

      (that is, the notable pauses are caused not by grammar, but by the presence of plosive consonants. This can sometimes manifest itself as a glottal stop between two vowels in separate words, but not always – and the faster the speech, the more likely these are to be omitted.)

    3. Aghast says:

      If you think people make discernable pauses between words in English, you’ve never been to New York! nowudimean?

  4. Clockwork-Muse says:

    Japanese (and presumably Chinese?) fits into the “there are no spaces between words” camp as well.
    Which made it interesting when I tried reading Japanese children’s books where the put spaces in only occasionally (phrase boundaries, mostly)

    1. cheong00 says:

      Indeed, but we have no habit to put “zero width space” between words at all.

      If anything, sometime the newspaper will put space (not zero-width one) between every-single-character, mostly in title but sometimes in the detail as well, in order to adjust the “density of characters” on the page.

      1. Gee Law says:

        This is an old trick. If you turn align text with “Adjust” mode, a Web page renderer can change the width of a normal space so that each line runs the same length. On printed things, people don’t often put spaces, except for names listed in a column to ensure all names (regardless of number of characters they contain) run the same length.

        I don’t think zero-width space provides good resolution to these Asian scripts. Take Chinese for an example, there are “composed words”, which can be (gracefully) broken on sub-word boundaries. A complete, thorough solution would be to represent sentences as its parsing tree, so that the engine can determine “scores of breaking” from the tree. However, for Chinese, line breaks can be inserted mostly anywhere (except for some punctuation location). I guess that’s the reason why nobody bothers putting these spaces in a Chinese article.

        1. cheong00 says:

          The spaces are put on printed material because if the article contains English words or numbers, the line will prematurely breaks at there.

  5. pc says:

    Being an American who finds this internationalization stuff really interesting but doesn’t actually deal with non-Latin languages on a day-to-day basis, I’m a little confused. If Thai & Korean don’t put spaces between words when writing on paper, would they actually type zero-width spaces between them then writing on a computer? Does line wrapping only happen between words, so they expect to type zero-width spaces to show where line wrapping can happen even though it’s mostly invisible? Do people usually have their editors show these spaces in some way even though they wouldn’t show in the finished product?

    It reminds me of the “soft hyphen”, which makes sense as an engineering concept of “the user has to tell the computer where hyphens are allowed to go since it’s an aspect of the language and how a word is pronounced”, but I haven’t seen a good usable way of getting the user put them in everywhere they might want to, since the vast majority of the time it’d just be invisible so people wouldn’t bother for such a small benefit.

    1. I can’t speak for Korean, but when typing Thai, I don’t type out the ZERO WIDTH SPACE explicitly, and as far as I can tell, my editor or word processor doesn’t insert them implicitly either.

      I guess that you could add them explicitly as hints to encourage a text layout engine to perform line breaking, but even that is esoteric at best; most text layout engines on computers do a pretty good job of inferring where line breaks are and are not appropriate.

      For example, if you look at the source of this Thai language news webpage: https://www.voathai.com/a/north-korea-cheerleaders/4254961.html
      you’ll find only a few ZERO WIDTH SPACEs, all between otherwise empty <p></p> element markups (so definitely not being used for their intended purpose).

    2. Siegfried says:

      Korean uses spaces to separate words, but it doesn’t use hyphens to indicate a word was split at line end. This gets confusing rather quickly when you try to read a book that uses two or more vertical columns.

  6. Siegfried says:

    Actually, Korean uses spaces to separate words. That was one new concept my Chinese and Japanese classmates had to learn, when I went to language school in Korea. There are some Koreans who don’t use spaces, but that’s only because they don’t know where to put them.
    AFAIK Japanese does not use any spaces, because it uses Kanji (Chinese characters) for words and Hiragana for grammar particles, so it’s easier to know where words end.

  7. mikeb says:

    Ancient/antique European texts often did not use spaces between words (scriptio continua). I think spaces were a good invention – almost as good as the zero digit.

  8. cheong00 says:

    Btw, I do hope there exist a marker like “rtl”(U+200f) or “ltr”(U+200e) so all characters exists after that marking is treated as if there is soft break in between. It should be pretty handy for CJK environments.

Skip to main content