Paragraphs and Paragraph Formatting

What paragraphs are and how they are formatted are questions that continually come up both inside and outside of Microsoft. So this post describes Word/RichEdit paragraphs in general. A subsequent post will describe the “math paragraph”, which is part of a regular paragraph and is used for displayed equations, as distinguished from inline mathematical expressions.

The paragraph is a very important structure in written language. About six years ago, I developed the RichEdit binary format, which shipped with RichEdit 5.0 (Office 2003) as RichEdit’s preferred copy/paste format and was used by OneNote 2003 and 2007. In the design stage I talked with Eliyeser Kohen, of TrueType, OpenType, LineServices, and Page/Table Services fame. I was inclined to have four parallel streams: plain text, character formatting, paragraph formatting, and embedded objects, a format corresponding to the internal RichEdit representation. Eliyezer agreed such parallel streams were important, but insisted that they should be broken up into paragraphs. At the time, this seemed like extra overhead to me and I naturally didn’t want to slow things down. But I followed his advice and it’s right on! First, what’s a paragraph? Then what’s paragraph formatting? Then what’s a “soft” paragraph? And finally, what’s the final EOP?

What’s a paragraph?

 From a natural language point of view, a paragraph is one or (preferably) more closely related sentences that naturally belong together without becoming too long. From the Word/RichEdit point of view, a paragraph is a string of text in any combination of scripts and inline objects, including possible “soft” line breaks and “math paragraphs”, with uniform “paragraph formatting” up to and including a carriage return. The carriage return (CR) is given by the Unicode character U+000D, which you insert by typing the Enter key. In plain text on PC’s, the paragraph is usually terminated by a CRLF (U+000D U+000A) combination, but not ordinarily inside a Word document or RichEdit instance. Just the CR is used.

<rant> It’s quite convenient to use a single character. It takes up less space than the CRLF and it’s easier to parse/manipulate, since it’s an atomic entity. In fact, Unix already used a single character, the line feed (LF—U+000A), back in 1972, several years before the PC operating systems were developed. Unfortunately, the PC with its DEC heritage preferred CRLF, a holdover from the old teletype days, and Word and the Mac shortened it to CR instead of LF. Windows NotePad still isn’t able to display Unix/Linux LF terminated paragraphs correctly after all these years (note that 2008 > 1972). I’m on a mission to fix that, but please don’t hold your breath! Anyhow I like CR better than LF, mostly because of habit. Clearly it would have been better to have a single standard. In this connection, it’s interesting to note that Word and RichEdit can handle CR, LF, and CRLF terminated paragraphs, even though they prefer CR. </ rant>

What’s paragraph formatting?

A key characteristic of a paragraph is its formatting, which is represented by a pretty large set of properties. Most of these properties are settable using a paragraph formatting dialog. In particular, there’s alignment (left, right, center, justify, along with a variety of East Asian options), space before/after, line spacing (single, double, multiple, at least, exactly), left/right margins and wrapped line indent, line/page breaks, tabs (oh, how I wish HTML had tab support!), and bullets/numbering. Internally, paragraphs and their formatting get overloaded with such entities as tables and drop caps, but let’s not get distracted. Using hot keys like Ctrl+E for centering or the paragraph formatting dialog, you can set the formatting for the paragraph(s) in which the current selection occurs. If you just have an insertion point (the blinking caret), only the paragraph containing the insertion point gets the new formatting.

What’s a soft paragraph?

When you create a numbered list, you may want to have an entry with one or more line breaks but no new number or bullet. To insert a line break without ending the paragraph, type Shift+Enter, which inserts a Vertical Tab (VT—U+000B). Even though you get a line break, you don’t end the current paragraph, so no new line number appears. All the paragraph properties remain the same with the new line and the space-before property doesn’t apply to the new line, since the line is inside the paragraph. Sometimes it’s handy to refer to a sequence of lines terminated by such a line break as a “soft paragraph”. In HTML, these “soft line breaks” are represented by the <BR> tag, whereas “hard” paragraphs are identified by the <P> tag.

Thinking of numbered entities, you might want to change the character formatting of the number or bullet out in front. For example, you might want to use a larger font size or a different font. To do this, change the appropriate character formatting of the CR that ends the paragraph.

Final EOP

To provide a place to attach paragraph formatting for the last paragraph, every Word document and every RichEdit rich-text instance has a “final EOP” (end of paragraph), represented by a CR (CRLF in RichEdit 1.0). You cannot delete the final EOP, nor can you move the insertion point past it. In the Word and RichEdit object models, the ranges can select up through the final EOP, but they cannot collapse to an insertion point that follows the final EOP. The farthest they can go is up to just before the final EOP. Similarly messages like EM_EXSETSEL cannot make the RichEdit selection go beyond the final EOP.

RichEdit also supports plain-text controls, which are characterized by uniform paragraph formatting and don’t need, or have, a final EOP. An empty plain-text control is really empty, whereas a rich-text control always has at least one character, the final EOP.