On word breaking in Chinese and Japanese


In Western languages, you can generally break a line at whitespace. (You can also break a line within a word, subject to language-specific hyphenation rules, but let's not get into that.) People unfamiliar with other language families sometimes wonder what's up with line breaking in other languages. In particular, line breaking in Chinese and Japanese tend to elicit confused responses.

When I put text in a static control and it does not fit, the behavior is different depending on whether I'm using Chinese characters or Latin characters. Why does the Chinese string wrap to the second line, but the Latin string does not?

ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏㄐ
ABCDEFGHIJKLMNOPQRSTUVWXYZ.

In Chinese and Japanese, there are no spaces between words, so if you're going to wait for a space before inserting a line break, you're going to be waiting a long time. Instead, to a first approximation, line breaks are permitted after almost any character. (You can learn the finer points of line breaking from Wikipedia.)

The static control uses Uniscribe to decide where to insert line breaks, and Uniscribe understands that in Chinese and Japanese text, you can break after almost any character. That's why you're seeing a line break in the static control with Chinese text. On the other hand, the static control cannot find a valid word break in the Latin string, so it all gets jammed onto one line (and the excess gets clipped).

The Draw­Text function also has rudimentary understanding of line breaks in Chinese, Japanese, and Korean text. You can override the default line breaking rule of "line breaks allowed after any full-width character" by passing the DT_NO­FULL­WIDTH­CHAR­BREAK flag, which forces the Draw­Text function to break only at whitespace. (Basically, have it treat CJK characters as if they were Latin.)

The documentation for DT_NO­FULL­WIDTH­CHAR­BREAK notes that it may be useful to pass this flag if you know that the text is Korean, because Korean does put spaces between words, and preferring to break Korean text at whitespace can result in more attractive results. (The Draw­Text function is not very clever and does not try to autodetect whether the string is Korean. It is legal to mix Chinese characters into Korean text, and trying to figure out whether the string is "Mostly Korean with Chinese characters mixed in" or "Mostly Chinese with Korean mixed in" would require too much fuzzy logic for the simple Draw­Text function.)

Bonus chatter: You thought Chinese, Japanese, and Korean line breaking is hard. Thai is even harder. In Thai, words are run together with no spaces, but line breaks are permitted only between words. This means that in order to break lines properly, you need a Thai dictionary.

Bonus bonus chatter: On that last page I linked to, there is a reference to the Windows Intelligent Font Emulator, which went by the acronym WIFE. Somebody probably worked really hard to retrofit that acronym.

Comments (12)
  1. I was once working on a cross-platform game which we were localizing into Japanese. The engine we were using did not support Japanese word breaking, so the bulk of the engineering work for that localization project was implementing the Japanese word breaking rules (the game was pretty much done otherwise and had previously been localized into EFIGS). It was not at all trivial.

  2. oakfed says:

    My wife tried to explain how to break Thai writing to me a couple times. I still don't understand :-/ I suspect that it's very hard to get used to reading/writing text without spaces between words when you're older, and are conditioned to using spaces to show where the words are.

  3. Mason Wheeler says:

    If Chinese and Japanese have no spaces, how do they deal with possible ambiguous combinations?

    For example: WEFOUNDSOMEBODY

    Did we find somebody (a person) or some body (a corpse)?

    1. In general you can figure it out either by context or using a word that removes the ambiguity, like "WEFOUNDACORPSE" or "WEFOUNDAPERSON" instead.

    2. Brian_EE says:

      Mason,
      They are two different sets of glyphs.

      我们发现一些身体
      我们发现有人

      1. Mason Wheeler says:

        Obviously I wasn't referring to those same literal words in another language, but rather to scenarios analogous to this one in English, because surely they exist. You'll have issues like that in any sufficiently complex natural language.

        1. cheong00 says:

          Try to think "strokes" in Chinese characters as alphabets and the "Chinese characters" as "words" in English and you'll find more sense. (in "辭海" you'll see there are almost 20,000 Simplified Chinese characters, although for commonly used ones it's reduced to around 3,500. If in Traditional Chinese, the numbers would be around 48,000 and 5,000 respectively)

          Chinese words can be made with one or more Chinese characters like in English (e.g.: fire engine), just that in Chinese, words made with 2 or more characters are more common.

    3. The same way ambiguous sentences are handled in other languages: Disambiguated from context, or used intentionally for poetic or humorous effect.

    4. HomeCloset says:

      > For example: WEFOUNDSOMEBODY

      The Japanese language has some classes of characters. So at a glance it looks more like WeFoundSomeBody or WeFoundSomebody rather than WEFOUNDSOMEBODY.

  4. The MAZZTer says:

    "Somebody probably worked really hard to retrofit that acronym."

    The word you're looking for is "backronym".

  5. Martin Bonner says:

    For line breaking Thai *properly*, a dictionary isn't necessarily going to be good enough. You are going to need a natural language parser to understand that (something equivalent to) WEFOUNDSOMEBODY sounds unnatural as "we found some body", so it much be "we found somebody", so the break must go before the "some" and not the "body".

    Or you can cheat and insist that the words are delimited with U+200B ZERO WIDTH SPACE. That is probably the right answer for something like a UI where there usually isn't *that* much text, and it is being prepared by experts who understand the strange requirements of L10N.

  6. jgh says:

    Japanese is slightly more subtle. When writing horizontally it is good style to line-break at the end of hiragana word endings, which in most cases can be spotted programatically as a change from hiragana character to a kanji character, viz 白い川 you would break 白い/川.

Comments are closed.

Skip to main content