Recently a couple people have questioned the decision to leave the legacy Excel date behavior in the Open XML formats. This was primarily triggered by a post from IBM’s Rob Wier. While those folks have referred to it as an Excel bug, it’s actually something that was purposely designed into Excel since the beginning. Whenever you work on a feature, you eventually get to a point where you have to make difficult decisions and there are positives and negatives on each side. Let’s go though a couple points here around this date issue: (1) how the dates are stored; (2) history of the bug; (3) backwards compatibility concerns; (4) drawbacks.
How dates are stored
Let’s first look at the basic issue of how dates are stored in file formats.
When it comes to the storage of dates, there are a number of different approaches you could take. The key issue to first decide is whether you want to store it in a presentation friendly format, or one designed for easy storage and processing. For example, you could just store it as it’s displayed to the user (ie “10/23/2006” or “October 23, 2006”), or maybe using ISO 8601. This could actually be a fine approach for certain types of formats (like a wordprocessing application, or maybe for metadata), but in the case of a spreadsheet that’s probably not the best way to go. In order to quickly consume and create the files, you should probably choose one simple consistent storage method as you may be dealing with hundreds of thousands of dates in one file and you don’t want to have to deal with any complicated parsing.
In the case of spreadsheetML, where you have the possibility of hundreds of thousands (if not millions) of dates existing in a single workbook, you need to think carefully about how those dates are going to be used. You also need to think about the fact that there may be formulas that take those dates into account. For example, you might have a table like this:
The “total cost” column will have a function that is essentially:
=([end date] – [start date]) * [daily cost]
The simplest way to store these dates (both internally and in the formats) is to just pick a specific date as the base date and then every other date is just an index from there. This is similar to the Julian date system. So, if you decided that 9/13/2006 was the base date, then the data stored in the table above would be:
This makes the storage and parsing of the dates a lot easier, and it also makes any formulas that work with dates easy to work with. The dates are always just stored as an index, and then it’s just a matter of formatting to determine how the date is presented to the user. So the following dates (9/13/2006; Sept. 13 2006; Wednesday, September 13, 2006) are all the exact same values, they are just displayed to the user differently. This approach also has the advantage of working with other calendar systems since the storage of the date doesn’t account for months, years, or days of the week, but is instead just a counter of how many days from a given date have passed. Then any unit of measurement smaller than a day (hours, minutes, seconds, milliseconds, etc.) are just a fraction. So 1 second would be 1.157e-5
History of the date “bug”
It’s because of this “index” approach that the bug in question presents itself. Excel uses January 1, 1900 as the base date, meaning January 1, 1900 is stored as “1”. The problem is that 1900 was not a leap year because while it passes the first test of being divisible by 4, it doesn’t pass the second test which is that it must a multiple of 100 that is not divisible by 400. 2000 actually was a leap year, but 2100 will not be. Rob Wier’s blog post has some great information on the history of this behavior in our modern calendar system.
Unfortunately this bug occurs because, February 29, 1900 is actually treated as a date when the indexes are calculated. This means that every date after February 29, 1900 is off by one (or you could also say that the base date was supposed to be 12/31/1899 and every day before March 1, 1900 is off by one). This is only the case for the year 1900 though. 2100 and on are treated properly. Why did this happen in the first place? It’s actually covered pretty well in the following Microsoft KB article:
When Lotus 1-2-3 was first released, the program assumed that the year 1900 was a leap year, even though it actually was not a leap year. This made it easier for the program to handle leap years and caused no harm to almost all date calculations in Lotus 1-2-3.
When Microsoft Multiplan and Microsoft Excel were released, they also assumed that 1900 was a leap year. This assumption allowed Microsoft Multiplan and Microsoft Excel to use the same serial date system used by Lotus 1-2-3 and provide greater compatibility with Lotus 1-2-3. Treating 1900 as a leap year also made it easier for users to move worksheets from one program to the other.
Although it is technically possible to correct this behavior so that current versions of Microsoft Excel do not assume that 1900 is a leap year, the disadvantages of doing so outweigh the advantages.
If this behavior were to be corrected, many problems would arise, including the following:
- Almost all dates in current Microsoft Excel worksheets and other documents would be decreased by one day. Correcting this shift would take considerable time and effort, especially in formulas that use dates.
- Some functions, such as the WEEKDAY function, would return different values; this might cause formulas in worksheets to work incorrectly.
- Correcting this behavior would break serial date compatibility between Microsoft Excel and other programs that use dates.
If the behavior remains uncorrected, only one problem occurs:
- The WEEKDAY function returns incorrect values for dates before March 1, 1900. Because most users do not use dates before March 1, 1900, this problem is rare.
NOTE: Microsoft Excel correctly handles all other leap years, including century years that are not leap years (for example, 2100). Only the year 1900 is incorrectly handled.
While it’s no longer as important to maintain compatibility with Lotus 1-2-3, it is important to maintain compatibility with the billions of existing spreadsheet files out there. Barclay’s Capital was one of the members of the Ecma TC45, and they helped make it crystal clear how important this backwards compatibility was. Imagine how upset our customers would be if we changed the values returned from functions like =WEEKDAY.
I think that one of the reasons some folks think this is just a “stupid decision” made by TC45 is that they aren’t looking at all the places where the behavior of a spreadsheet could be affected. Remember, we don’t just want the date values to be interoperable, but spreadsheet functions have to be as well. We spent a lot of time documenting how every single spreadsheet function would work so any application can come along and properly consume and generate SpreadsheetML files.
If we decided to fix this bug and shift each date value down by one, how many spreadsheet formulas out there would we break? Here’s a really simple example, where the following function that had worked in previous versions would no longer work:
=IF(TODAY()=39013, “Due Today!”, “Not Due Today!”)
I will admit that the above function is probably not the best approach for comparing dates, but it’s something that’s always worked. Believe me, I’ve seen much crazier stuff in spreadsheets. We can’t control how our customers use the product, and if something like this worked before, we can’t break it now. If we changed our behavior so that 39013 no longer represented 10/23/06 and instead represented 10/24/06, that would completely break the function above.
It’s true that we could probably write a solution that parses through all functions out there and tries to determine if the number in the function is supposed to represent a date, but that would be seriously buggy, and to be honest it wouldn’t be worth it.
One way I like to think about this whole issue is to look at this the other way. Let’s instead say that the base date system for SpreadsheetML was supposed to start on December 31, 1899. There’s just a bug where dates between then and March 1, 1900 are off by one. So while we have this odd behavior for the first two months of the last century, everything after March 1, 1900 is stored correctly. I’m actually only half kidding here… since the date is just stored as in index, there is no real problem caused by this bug, other than maybe looking a bit foolish. In fact, OpenOffice has a behavior vary similar to this. In OpenOffice and Microsoft Excel, the numerical value for 9/13/2006 is equivalent. The numberical values for dates map all the way back to March 1, 1900; and then from that point back they are off by one (so in OpenOffice, 12/31/1899 maps to “1”). So the key thing to notice is that this behavior is fully documented, and there is no reason it should prevent anyone from properly consuming and generating valid SpreadsheetML files.
As I said in some earlier comments, this did come up in the TC45 meetings fairly early on. We didn’t spend too much time on it though as it was pretty obvious that while it looked a bit silly, there was no harm. And we all recognized that if we did try to change this, it could seriously change the meaning of existing files.
Remember, this format was not an attempt at building the ultimate generic Office application file format from the ground up. If we did that, and didn’t take our customers existing documents into account, then none of our customers would use the format. I think at times people misunderstand the design goals behind the Office Open XML formats, and I’m sorry if I haven’t been clear enough on that. The Office Open XML formats were definitely designed with the existing base of legacy Microsoft Office file formats in mind, and it’s important to remember that when you look at the format. There are definitely designs in the Open XML formats that clearly show their origins. A key focus of TC45 though was to make sure those behaviors were completely implementable on any other platform. This is covered in more detail in the TC 45 whitepaper.
We not only wanted to create an open format that folks could build solutions on top of, but we wanted the format to be something that our customers would actually use… otherwise what’s the point? We didn’t want this to just be another optional format that only some people would use, it’s the new default format and we hope that all of our customers will use it.