Not long after the Wired article about
href="http://wired.com/news/mac/0,2125,63848,00.html">Word 5.1, someone
sent me a link to this href="http://wiredblogs.tripod.com/cultofmac/index.blog?entry_id=344456">post
on the Cult of Mac Blog. It
mentions what I had thought to be a rather widely known fact: that most of the
people who first worked on Word came from the Xerox PARC. In fact, the one
person most responsible for most of the design ideas that permeate Word is
Richard Brodie who also worked on Bravo while at the Xerox PARC.
Not surprisingly, a number of ideas that were first explored in Bravo found
their way into Word. The Cult of Mac mentions the file format, but the
statement that’s quoted isn’t quite accurate—at least insofar as it
leaves out some important details.
The basic design goal behind Word’s file format was to be able to read in
only that amount of information that was necessary to fill the document window
with text. You can see the fruit of this today by conducting a little
- Boot Word.
- In a new document, type ‘=rand()’
- Save this document as “SmallDoc”, and close it
- Open a new untitled document
- Type ‘=rand(100)’
- Type <Cmd>-y about twenty times (until you have more than 100 pages of text).
- Save this document as “BigDoc”, and close it.
You could, if you wanted to, grab a stopwatch and time the next few steps,
- Select “SmallDoc” from the “File” menu
- Select “BigDoc” from the “File” menu
If you’re timing this, start the stopwatch when you mouse-up in each file
name on the menu, and stop the stopwatch when you first see the insertion
The first thing you’ll note is that there is no appreciable difference
between the amount of time it takes Word to open BigDoc and the amount of time
it takes Word to open SmallDoc—this despite the huge difference in sizes.
In my experiment, BigDoc is over 1 MB in size while SmallDoc is barely more
than 24K. BigDoc is 40 times larger than SmallDoc, but I can’t tell the
difference when I open the files.
Now, there are a few data structures that are stored in the file and are
proportional to the amount of text in the file, but the actual data in them is
so small that reading them in approaches constant time relative to the amount
of time it takes to read in the actual text and formatting. The result is,
even today, a file open time that is proportional to the size of your document
window, not proportional to the size of your document.
The post on the Cult of Mac Blog quotes Bruce Damer’s claim that, “Bravo and
BravoX stored out files by essentially just dumping the memory heap,” which is
really a gross oversimplification. If the file format consisted of a straight
dump of the memory heap, then opening a document would still take time
proportional to the size of your document.
The formatting in a Word file, however, is allocated in blocks of 512 bytes.
Formatting information is added to each of these blocks until they fill up, in
which case new blocks are allocated. These blocks are written to the file as
full 512-byte blocks whether they’re full or not, which is the only sense in
which a Word file consists of a dump of the memory heap.
Damer attributes his claim to something Charles Simonyi said, but it’s
almost certain that either Damer didn’t fully understand what Simonyi was
saying or that Simonyi wasn’t entirely clear that this “memory dump” aspect of
Word’s file format is limited to the disk pages that hold formatting