Adventures in application compatibility: The bogus memory calculation

One of my colleagues shared with me one of his application compatibility stories.

There was a program that would fail on some computers but not others, and it wasn't clear why. The problem was traced to the size of an internal cache. Now, Global­Memory­Status officially returns unsigned values, but if the calling application is not marked /LARGE­ADDRESS­AWARE, then Global­Memory­Status reports a maximum of 2GB − 1 bytes of memory, for compatibility purposes.

You'd think that this would be enough to keep old programs happy, but apparently not. This particular program wasn't content with the values that it got from Global­Memory­Status. Instead, it took the dwTotalPhys and added it to the dwTotalPageFile, and treated the result as a signed value. This means that on systems with more than 2GB of memory, the addition will produce a total of 0xFFFFFFFE, which is a negative value when interpreted as a signed result, which in turn causes the program to crash.

My colleague fixed the program by patching out the instructions that added dwTotalPhys to dwTotalPageFile, and had the program operate solely on dwTotalPhys, which is probably what it should have been operating on in the first place.

You see, even though the field in the MEMORY­STATUS structure is named dwTotalPageFile, it doesn't actually give you the size of the page file. The documentation of dwTotalPageFile says

The current size of the committed memory limit, in bytes. This is physical memory plus the size of the page file, minus a small overhead.

Yes, this is a case of bad naming. (You can come up with your own theories how we ended up with the bad name.)

By adding dwTotalPhys and dwTotalPageFile, the code was double-counting the physical memory.

The conclusion my colleague drew from this exercise was that there are still programmers out there who are working hard to skip the documentation, come up with bad ideas, and implement them poorly.

I admire the program's dedication to getting everything wrong despite the operating system's efforts to save them from themselves.

Comments (21)
  1. fm says:

    > I admire the program’s dedication to getting everything wrong despite the operating system’s efforts to save them from themselves.

    And yet, users won’t have an opportunity to appreciate the value created for them by Microsoft, because they didn’t even understand that the application was doing something stupid in the first place.

    This means users are trained to believe that the app is “right”, since it works, and when they upgrade their OS and it doesn’t work any more (shim not applied, 16 bit support not in the OS any more, tighter security policy of new version, etc etc etc) it’s Microsoft’s fault that it broke.

    This is why name & shame should always have been how the OS handled application bugs.

    Now repeat for the web stack where browsers do weird compatibility things silently instead of telling the user the site is stupid due to webkit only APIs, and a shim is applied to make it work.

    1. Brian says:

      WebKit is the new IE6. Developers write to its proprietary features and users wonder why other browsers don’t work “correctly”

      1. cheong00 says:

        I’d say Chrome is the new IE6 when you try to compare it with webkit. On the web application I’m maintaining, more than a dozen CSS and DOM scripting workaround is for Chrome where they don’t get the standard right.

        The latest issue I’ve experience on is on DOM object processing of TextArea. On IE and Firefox I can get correct value on jQuery(obj).val(), on old version of Chrome it’ll just return null which is non-issue because I’ll go ahead to try jQuery(obj).text() which returns correct value. But on latest version of Chrome jQuery(obj).val() return “old value of control” instead of “current value”, that leaves me in “rock and a hard place” and I’m not sure how to workaround it without explicit browser version check or breaking old version of Chrome.

        (Browser version check is not desirable because it would break when a newer version come that actually fixes the problem)

        1. Drak says:

          Sounds more like the framework you are using (for compatibility reasons, hopefully) isn’t being very compatible with Chrome?

          1. cheong00 says:

            If you mean jQuery, well…

            Btw, jQuery is supposed to help me workaround those platform incompatibility, so I agree that I shouldn’t have had to apply those workarounds.

        2. Viila says:

          Wasn’t Chrome supposed to seamlessly autoupdate so that there wouldn’t be old versions in the wild?

          1. cheong00 says:

            Not if HTTP connection to internet is blocked, and webbrowsers are strictly for LOB web applications only.

            This is a common configuration for companies in financial sector.

    2. ChDF T says:

      Let’s ignore the potential legal trouble for the moment.
      The common user will always blame the person/enterprise/application that is most convenient to blame. This happens to be the OS most of the time, however it doesn’t have to be: I know some users who like to blame everything on the vendor (which is not Microsoft) of a *very* expensive application, which they happen to use all day long. So let’s ignore the common user as well.

      You propose that the OS should point out that the application is to blame for the malfunction. This assumes two things: 1) the malfunction is a known problem and 2) it can be detected. Not every API-misuse results in an app-crash, so (2) might be considerably hard to achieve.
      Let’s say this behavior would be implemented – the user would use application A and suddenly a modal message box opens: “incorrect API usage detected, this application will be terminated; click here for more details”. Note that we can safely display technical information, after all only advanced users are left. However one of these users may still wonder “The OS can detect the misuse and explain it, which implicates that it can infer what the applications intentions were. Why doesn’t the OS fix the app?”
      In my opinion (which very well may not be yours) this is the argument that kills your proposal. If someone (be it Microsoft or whoever else) goes through all the trouble to track down an issue good enough to safely detect it, they may as well fix it, though the fix might be ugly or ultimately not worth the trouble sometimes.

      Before you say “the developers have to learn how to do it right”: Yes they do, however breaking code in production (i.e. after it was sold & shipped) doesn’t help anyone. There are tools to find such trouble spots before shipping the application, however lower level frameworks usually trust the developer to do the right thing and require more more care and experience in most phases of software development – this does include “testing”. It is on the developer to invest the appropriate effort into each phase – or in other words: this isn’t a technical problem, don’t try to solve it with technology.

      1. Anon says:

        It helps the rest of the world when our entire computing infrastructure doesn’t crash and send us back to the Dark Ages.

        But no, it’ll be fine, I’m sure. There’s no reason to bother preventing architects from designing ragged shanties with no safety code enforcement, either.

    3. IanBoyd says:

      You can always download the Application Compatibility Toolkit, and view the list of over a thousand programs have each have their own carefully entered compat shims.

      I seem to remember a note about children’s programs being important, and there certainly are a lot.

      The 32-bit database is quite the trip down memory lane.

  2. ChDF T says:

    Based on them adding dwTotalPhys and dwTotalPageFile, I guess that it was intended to fill all available memory with the cache. What would be the use for such a cache? After all a cache that has to be read from the hard disk through the page file isn’t any faster than reading the data that was the source for the cache from the hard disk.
    I’m also wondering what would happen if two applications did that (trying to fill all available memory with a cache)…

    1. pc says:

      Well, basing your cache size on a percentage of total physical+swap could almost make sense if the data you’re caching isn’t from the hard drive, such as if it’s from some slow network connection or the like. I could imagine there being a use case, like some server application that is recommended to be the only application on the server, caching some really slow data source, where it’d make sense to want to swap out to local disk for some kind of cache. We’re not really told anything in the story about the use case for this application.

      Of course, the explanation of people that “come up with bad ideas, and implement them poorly” is more likely.

      1. Viila says:

        Even in that case I’d imagine you probably should maintain an explicit cache rather, since you can’t exactly control which bits of your application get paged out with any kind of precision.

        Throughput will probably also be better, since you can use bulk transfers since you know exactly the domain you’re interested at that point instead of having to fault on every page individually and you know when you’re no longer interested so you can immediately cache it back out.

  3. cheong00 says:

    However, I suppose even jQuery can do very little to the cases like when Chrome choose to entirely ignore the min-height CSS style for table and the sort.

    1. Anon says:

      The behaviour of min-height on a table is undefined, according to the CSS specs. If you want consistency, you’ll have to choose a style with defined behaviour.

      And then you can complain about all the browsers that STILL get it wrong even when there’s a defined spec.

      1. cheong00 says:

        No, I’m just giving example on troubles caused by Chrome that is unlikely to be handled by jQuery. Btw, I’ve seen SO suggest that “height” should be used instead, but I tried that and it does not function correctly either on Chrome. (We’re having multiple TRs where some of them is “display:none” in the beginning but shown on some criteria. We set all TRs except one with fixed height, and the remaining TR without height specified, so this row should expand to fill the remaining space. This works on all browsers except Chrome)

        The other WTF-ty things are, when setting background image on TR as GIF, when there and TD with content covering it, somehow the TD will get the background image’s color with opacity = 1, so the gradient is not shown. But when I try to add background-color: transparent to the TD, the TR themselve will somehow become transparent when the page loads. (somehow because when you select the table, the background-image is shown again. And it’s not related on network latency because it happens on localhost and the images are small)

        1. xcomcmdr says:

          And that kind of bullshit is why I hated Web dev’.
          That, and Javascript.
          I’ll stick to non-Web applications, thank you very much.

  4. smf says:

    >The conclusion my colleague drew from this exercise was that there are still programmers out there who are working hard to skip
    > the documentation, come up with bad ideas, and implement them poorly.

    They may not have ignored the documentation, they may just have poor comprehension skills.

    The majority of programmers are only competent to a level that makes them slightly dangerous.

    1. Viila says:

      We’re all also lazy. If I see a member variable that promises to contain what I want (eg. Intellisense popup) I might trust it to actually do what it says on the tin without double checking.

      Remember Raymond’s adage: most software isn’t written to be hostile towards people reading the source. But it does suck when it is by accident like this.

  5. Roel says:

    Looks like it wasn’t always a case of bad naming. It appears the definition has changed at some point in time: my old Visual Studio 6.0 MSDN Library defines dwTotalPageFile as

    “Indicates the total number of bytes that can be stored in the paging file. Note that this number does not represent the actual physical size of the paging file on disk.”

    Maybe this is an old program, and they implemented that functionality when the documentation used the old definition?

    (I don’t think it’s a good idea to use functions like GlobalMemoryStatus for any purpose other than purely informative, but that’s another story)

  6. The bad naming is fairly obvious – if you design memory management so that physical memory is just a view into the page file (so every byte of physical memory is backed by a byte of page file), you simplify the task of paging – at any time, you can free up physical memory for another use by writing its contents to its associated bit of page file.

    Then, because disk space is limited, design the page file such that it’s able to change size to match the committed memory limit; if I’ve got 2 MiB of RAM, and 40 MiB of disk, you limit the page file to 2 MiB if I stay within RAM, but allow it to grow if I run programs that need more memory between them. As long as I have disk space, I can commit more memory – I’ll just see disk thrashing if I do too much multitasking in that little RAM. Equally, though, I can depend on having 16 MiB commit available for apps that care, as long as I keep 16 MiB of disk free; that might be as simple as 16 apps running, each app needing 1 MiB each, but only two in active use at any time. No need to close background apps unless I run low on disk space – they just end up in the page file.

    Then you get an application that depends on the size of the page file matching the size of the committed memory. You could change the meaning, but then you break it – nothing needs the “correct” meaning, so when you change VM handling to decouple dwTotalPageFile from the committed memory limit, you leave the old meaning in place, even though it’s now a lie. Nothing breaks, some things don’t get broken, we’re all happy.

Comments are closed.

Skip to main content