Why can’t VarDateFromStr parse back a Hungarian date that was generated by VarBstrFromDate?

A customer liaison reported a problem with date parsing.

Ugh, date parsing.

The customer is receiving date information from a scanner that they want to parse. They are using the COle­Date­Time::Parse­Date­Time method. The customer reports that clients in Hungary (locale 1038) are unable to parse dates. The call to COle­Date­Time::Parse­Date­Time fails with false. That method internally uses Var­Date­From­Str, and calling Var­Date­From­Str directly fails with the error DISP_E_TYPE­MISMATCH.

This problem is observed only for Hungarian.

The customer included a demonstration program that calls methods on COle­Date­Time, but I've stripped away the wrapper below, so we can focus on the problem better.

DATE date = ...; // something
BSTR str;
hr = VarBstrFromDate(date, hungarian, VAR_DATEVALUEONLY, &str);
// The call to VarBstrFromDate succeeds and returns something like
// "2010. 12. 05". Now let's try to parse it back.

hr = VarDateFromStr(str, hungarian, VAR_DATEVALUEONLY, &date);
// The attempt to parse back to a date fails with DISP_E_TYPEMISMATCH.

The customer noted that this change in behavior was relatively recent.

The reason is that the localization team in Windows 10 made a change to the date formats for Hungarian. In earlier versions of Windows, the call to Var­Bstr­From­Date produced "2010.12.05". Notice the difference?

The date separator changed from a period to a period followed by a space.

This highlights that culture data is not stable. Any code that generates Hungarian-formatted dates will produce different results on Windows 10 compared to earlier versions of Windows.

Of course, one should also note that the date formatting preferences can also be customized by the user at any time, so the statement is even stronger: Any code that generates locale-sensitive formatted dates may produce different results at any time, even within a single run of the program.

So if your goal is to format the date as a string, with the intention of parsing it back, then you don't want to use anything that is locale-sensitive. Instead, use a locale-insensitive format, such as ISO 8601.

The customer said that they were getting the information from a scanner, but it wasn't clear where the scanner was getting it from.

If this is a timestamp generated by the scanner itself, then they should try to configure the scanner to generate timestamps in a locale-insensitive format.

If the timestamp is coming from the document being scanned, then you need to work out who is generating the document. If the document was generated by the same program that is trying to parse the result back (which the sample code seems to be suggesting), then you should fix the program that generates the document so it uses a locale-insensitive format. For human readability, you could have it generate a locale-sensitive version of the date next to the locale-invariant version. On the other hand, if the document was generated by an external source, then you may want to implement a custom parser that handles the date format that the external source uses.

And if you don't know what date format the external source is using, then you're kind of stuck. After all, a date of the form 12-05-2010 is ambiguous. It might be generated by somebody whose locale settings specify a date format of MM-DD-YYYY, or somebody whose locale settings specify a date format of DD-MM-YYYY.

Okay, so we've addressed the customer's problem of not being able to round-trip a date-to-string-to-date conversion. But why specifically does changing the date separator from "period" to "period and space" cause Var­Date­From­Str to be unable to parse back a string that it generated itself?

The string 2010. 12. 05. parses back like this:

  • "2010" is a year, no problem there.
  • ". " is a period followed by a space, no problem there.
  • "12" is a month, no problem there.
  • ". " is a period followed by a space, no problem there.
  • "05" is a day, no problem there.
  • "." is a period not followed by a space, which does not match the date separator, so this parse is rejected.
  • Next, a special-case rule for "." kicks in and says, "Okay, well, if normal parsing rules failed, but I see a period after a complete date, then treat it as a time separator."

  • And then parsing fails, because a time separator is not allowed due to the VAR_DATE­VALUE­ONLY flag.

There is also some special-case code for Hungarian trailing period, but that code path is no longer being hit, probably because of the change from a one-character date separator to a two-character date separator.

It turns out that the date parsing code has a ton of special-case rules for various languages. (I'm looking at you, Polish, with your genitive month forms.)

Now it looks like it needs a ton plus one.

Comments (29)
  1. Karellen says:

    «”.” is a period not followed by a space»

    …but none of the examples you quoted have a trailing period, irrespective of whether periods are followed (or not) by spaces. You have «// “2010. 12. 05”. Now» (with spaces) and «produced “2010.12.05”. Notice» (without spaces), but no trailing periods in the actual date.

    1. Brian_EE says:

      Yeah I wondered that myself. If I had to speculate, I would suspect he meant the comment line to read:
      // “2010. 12. 05.” Now let’s try to parse it back.

      1. Karellen says:

        I find that doubtful. Standard (non-technical) usage is to put punctuation within quotes[0], and I’ve seen plenty of pedantic hacker types stick to this usage in technical situations where it’s genuinely confusing, rather than using the quotes at literal delimiters of the thing being quoted.

        That someone would flaunt standard usage, and do so in a context that is also technically incorrect, seems highly implausible.

        [0] http://www.catb.org/~esr/jargon/html/writing-style.html

        1. BobVul says:

          Ahh, “standards”. Funnily enough, the whole punctuation-within-quotes thing is more specific to the US – the UK and others do the opposite. And that’s always felt more natural to me.

          That said, Raymond _is_ in the US, so I suppose he’s most likely to follow US style.

  2. MV says:

    …and this is why “.ToString()” (and it’s equivalents) should have been defined as producing ISO standard date formats. And there should be a “.ToStringForHumanConsumption()” that uses the locale.

    1. Clockwork-Muse says:

      Frankly? ToString() should only ever be used for debug information. Which, yes, should likely be in ISO for those things that have relevant standards, but – and here’s the big catch – should never be used to output data to the end user even then. Not sure if I’d explicitly document the format as always being that format, though, since I wouldn’t want people to rely on it in any way (IOW, the intent is that it provides useful information for debug, but may not be stable enough for release to users).

      If you want to pass a string to the user, it _must_ go through an explicit cultural formatting. Even if that formatting ends up being the ISO standard. Outputting raw ToString() is pretty much never the right answer, perhaps especially on primitives.
      I keep asking myself about defining a language such that ToString() calls are elided for release builds (and throw errors if that leaves something undefined), just to help programmers spot this kind of dangling localization issue.

      1. cheong00 says:

        > ToString() should only ever be used for debug information

        Oh, really? Our code base have tons on code that use .ToString() to convert integer value to string for use in select boxes.

        And then there’s also DateTime.ToString() overloads that accepts a formatting string plus CultureInfo, that I always plug CultureInfo.InvariantCulture into.

        It seems nothing wrong to me.

        1. Clockwork-Muse says:

          We can mostly ignore the overloads in various classes that take formatting parameters, since you would need an equivalent method (that is, the fact it’s called ToString() is unimportant, what’s important is that it can be deterministically controlled).

          Additionally, the fact that int.ToString() returns a (documented, and often relevant) stable result is unimportant. Not all cultures format negative numbers the same way; if you’re outputting numbers for customer viewing (like those select boxes….) you need to take that into account. For that matter, not all ‘negative’ numbers in the same culture are formatted the same way (accounting, anyone?)

      2. Joshua says:

        There really should have been Object.ToString(IFormatProvider). And no it can’t be made with an extension method.

        1. Clockwork-Muse says:

          Really? Why not? In most cases usefully formatted data will involve only publically visible (or derivable) information. In the somewhat rare case you’re actually outputting to a string, that is (as opposed to something on a web page or in a program). Could you possibly do it faster from the inside? Yeah, maybe. But even in most instances of string serialization you’re not going to use ToString(), but something that generates more specific output.

          This is in contrast to a debug-only ToString(), which I would expect to possibly give private data, including transient, non-reproducible state.

      3. MV says:

        The problem is that so many “frameworks” (using the term loosely) use .ToString() as the default (and sometimes the ONLY) way to convert data. Have a generic serialization mechanism that turns an arbitrary data structure into XML? I bet it uses .ToString()! Or “Value=”+x where x is a date? ToString() again! Or string.Format(“Value={0}”,x)? ToString() again!

        The fact that the formatting mechanism produces ambiguous outputs by default is almost criminal. It makes it WAY too easy to think your program is working right when in fact you’ve created horrible localization / data interop problems. I worked for a place once that required Windows to be set for US locale even when deployed in Europe – because some code explicitly formatted/parsed dates as “mm/dd/yyyy” and some code formatted/parsed dates using the locale. But no one noticed the problem in the US, and by the time they started expanding into European markets (10+ years later) there was WAY too much legacy code to find and fix it all.

        Another place I worked at had a problem where a seldom-used program blew up sometimes, and eventually it was noticed that it only blew up between the 13th and 31st of the month. From the 1st to the 12th it “worked” – but silently gave incorrect outputs.

        Like I said, it’s almost criminal how many problems have been created by this. It’s a “pit of failure”, because the easy obvious thing is exactly the wrong thing to do – and yet it seems to work at first, so it makes it through testing and into production before anyone discovers the problem.

        1. Viila says:

          I remember such a problem in old Borland visual toolkits. It used culture sensitive parsing when reading config files. Finnish locale by default uses “,” (comma) as the decimal separator. So I’d several times seen programs explode on init with parse errors when they tried to read their own configs that, of course, had ‘.’ (period) separated decimals.

          I don’t like comma separated decimals anyway (plus they make it impossible to paste any numbers to spreadsheet programs from elsewhere since most everyone else uses periods), so I typically edit my locale back to period, but that’s still a thing that shouldn’t have ever happened.

  3. alv says:

    “So if your goal is to format the date as a string, with the intention of parsing it back, then you don’t want to use anything that is locale-sensitive. Instead, use a locale-insensitive format, such as ISO 8601.” – There are cases when this can’t be done; for example, when the formatted date is put into a text box, in which the user would like to see and edit it in the localized format. I guess quite a number of applications would fail if that didn’t work. (I’m using .NET on Windows 10 for development, and, interestingly, .ToString() DateTime.Parse() seems to work reliably, so this bug seems not to be hit there.)

    (Useless trivia: in Hungarian, the trailing period is not the time separator, but part of the date itself. ‘X.’ is the Hungarian equivalent of ‘Xth’ in English, so ‘2010. 12. 05.’ is to be interpreted as ‘the 2010th year’s 12th month’s 5th day’. The time separator would be a following space. Offiicially, the time itself should be written as ‘10.56’, but most people write it as ’10:56′ – Windows uses ‘:’ as well, following the people not the rules).

  4. Entegy says:

    Interesting stuff. I always customize my date and time formats, but they’re what I would imagine being pretty standard stuff: MMM d yy for short date and MMMM d, yyyy for long date, and the time always changed to 24H format.

    1. French Guy says:

      I guess you’re from the US, since you use month/day/year. If I had to customize a date format, I would go day/month/year for human consumption (since that’s the order people use where I live) or year/month/day (which is the most logical order, given the way we write our numbers) for machine consumption (since it makes comparisons easier. Time would always be in 24h format if present, since it’s what we’re used to (and 12h format is a pain to sort – what do you mean 12 is before 1?). If a machine-oriented format was also meant to be human-readable, I’d put in separators (while I’d probably skip them in a purely machine-oriented format to save space).

      1. Joshua says:

        I guess you don’t care about Y10K.

        1. Nick says:

          RFC 2550 already solved for y10k in the yyyyMMdd scenario, you just have to implement it. I’m sure that will become important in another 6000 years or so

          1. Ben Voigt (Visual Studio and Development Technologies MVP with C++ focus) says:

            You can only cite RFC 2550 one day a year, and today ain’t its anniversary.

      2. Wombat says:

        There is a nice standard date format called ISO 8601 – I’d suggest using that, particularly seeing it’s human and machine readable.

        I think Raymond (and others) have commented in the past about people reinventing stuff that has already been solved.

        1. Nicolas says:

          Some parts of ISO 8601 are human-readable – but barely anyone will understand which day you’re talking about when you write “2010-351” or “2007W416”. Loads of machines have problems with those too.

    2. Richard says:

      Entegy, you’ve joined the members of the “Ooops I broke everything by accident” team.

      There is exactly one standard date format – ISO 8601.

      All other formats are user-specific and can change at any moment, by Government fiat or user whim.

      ISO 8601 also changes as it’s an international standard that gets updated – but will at least give notice and be sufficiently backwards compatible for most uses.

      The only safe assumption you can make is that the time and date storage and manipulation your software does is outright wrong in some subtle way that you’ve yet to find.

      1. Beldantazar says:

        @Richard I think Entegy means that he always customizes his OS display format for date to be like that. As an example of user customization that would break things.

        1. Richard says:

          Good point. Could be.

  5. Marcin says:

    Well, days do belong to a month… :) I was taught that “The 5th (GEN:)December” is to be understood as “The 5th [day of] (GEN:)December”, while “The 5th (NOM:)December” actually means “The 5th (NOM:)December [ever]”. As in “December of 5 AD”.

    And to the point: I always believed cultural data belongs to the presentation layer. Storage should use something universal. And then came Excel, Object Pascal* and others.

    *I do not remember if I had problems with data storage in TurboPascal. Might have, just repressed.

    1. Marcin says:

      I seem to have lost context here. This comment should start with ‘As for the Polish genitive months (“5 grudnia” vs “5 grudzień”), genitive is about possession and, well, days do belong to a month…’

  6. Lars says:

    Surely, Raymond is not complaining that the genitive exists or that its use is nonsensical. But it does require a morphologically aware VarBstrFromDate, which in this case is not one but two extra tables (genitive singular and plural, due to the way numbers work in Slavic languages). Other special snowflake languages have other needs, and it all adds up. That is the complaint, I think.

    1. Indeed. I could’ve written “For example, Polish has genitive month forms.” but that would have been boring. The problem is unavoidable, because we are trying to impose uniformity on something that developed independently in multiple places around the world. (It’s not like the Japanese and the Poles and the Finns all got together in a meeting room thousands of years ago and said, “Okay, so here’s our plan for making date parsing as complicated as possible.”)

  7. Dave says:

    >I’m looking at you, Polish, with your genitive month forms.

    Poland cannot into dates!

    1. Richard says:

      Being myself Czech, I kind of envy the amount of attention our Polish friends got in the windows source code :-) . The Czech grammar is exactly the same hell when it comes to dates formatting and parsing. I would not dare to code “linguistically proper” parsing of months myself. If that was included I bet whole windows installation would be 1GB larger :-)

      And a funny message to all our english native coleagues. As we were kids, the 2nd program we all wrote after “Hello World” was:
      >Enter your name: John
      >Hi John!

      CAN YOU IMAGINE HOW DIFFICULT IS THIS ONE IN SLAVIC LANGUAGES??? Nearly impossible. :-) …. that is why I always preferred math application.

      I wish nice Christmas time to Raymond and you all.

Comments are closed.

Skip to main content