Why are the module timestamps in Windows 10 so nonsensical?

One of the fields in the Portable Executable (PE) header is called TimeDateStamp. It's a 32-bit value representing the time the file was created, in the form of seconds since January 1, 1970 UTC. But starting in Windows 10, those timestamps are all nonsense. If you look at the timestamps of various files, you'll see that they appear to be random numbers, completely unrelated to any timestamp. What's going on?

One of the changes to the Windows engineering system begun in Windows 10 is the move toward reproducible builds. This means that if you start with the exact same source code, then you should finish with the exact same binary code.

There are lots of things that hamper reproducibility. One source is the language itself. For example, anonymous namespaces may not have a programmatically-accessible name, but since the objects within it have external linkage, they need to have a name nonetheless, and the name must be different for different source files. How does it ensure the names are unique? Does the compiler use a random number generator to generate these names? Is it a hash of the file name?

Another source is the compiler's internal code generation algorithms. For example, if a compiler chooses between two optimizations depending how much RAM is available, or how powerful the processor is, then that prevents the result from being reproducible because two systems with different hardware configurations may end up producing different outputs. Or if the optimizer has a failsafe switch that abandons an operation if the algorithm is still running after 500ms. Or if the optimizer uses a non-deterministic register allocation strategy. Or if the compiler uses a deterministic algorithm ("sort all local variables") but uses a non-determinstic criterion ("... by the heap address of the data structure we use to keep track of each variable.").

There are also inputs to the system outside the compiler that hamper reproducibility. For example, the full path to the file being compiled will show up in __FILE__ preprocessor directives, which will cause problems when built from different machines with different names for the root directory that holds the source code. (Or even from the same machine with two copies of the source code.) There may be files auto-generated by the build process that go into the compiler (for example, the output of compiler-compilers); those need to be deterministic too.

Timestamps are another source of non-determinism. Even if all the inputs are identical, the outputs will still be different because of the timestamps.

Okay, at least we can fix the issue with the file format. Setting the timestamp to be a hash of the resulting binary preserves reproducibility.

"Okay, but why not set the file timestamp to the the timestamp of the source code the binary was created from? That way, it's still a timestamp at least." That still breaks reproducibility, because that means that touching a file without making any changes will result in a change in binary output.

Remember what the timestamp is used for: It's used by the module loader to determine whether bound imports should be trusted. We've already seen cases where the timestamp is inaccurate. For example, if you rebind a DLL, then the rebound DLL has the same timestamp as the original, rather than the timestamp of the rebind, because you don't want to break the bindings of other DLLs that bound to your DLL.

So the timestamp is already unreliable.

The timestamp is really a unique ID that tells the loader, "The exports of this DLL have not changed since the last time anybody bound to it." And a hash is a reproducible unique ID.

Comments (25)
  1. Brian says:

    And, as many folks know, ” a 32-bit value representing … seconds since January 1, 1970 UTC” is the traditional “Unix time” that has been abandoned by everyone who remembers “Y2K”. It will be rolling back to 1970 in about 20 years.

    1. SimonRev says:

      Actually, as it is a signed 32 bit integer, it will be rolling back to 1902 or somewhere thereabouts.

  2. kantos says:

    at least as of n4659 says: An unnamed namespace or a namespace declared directly or indirectly within an unnamed namespace has internal linkage. All other namespaces have external linkage. I know that MSVC didn’t use to respect this but as far as I’m aware as of VS 2013 this was being respected IIRC.

    1. Tanveer Badar says:

      And I am quite sure that Windows is not built using MSVC. It might be something based on MSVC, but the compiler would be different for internal consumption only.

      1. Peter Doubleday says:

        As far as I am aware, the NT kernel (and I could list the other important bits of Windows, but I won’t) are indeed built by the compiler that sits underneath MSVC. And tested against that compiler. And indeed debugged against that compiler (I have done this on occasion).

        I’m not entirely sure why anybody would think otherwise. The timelines might be different: my group was stuck on VS10 for quite a lot of years, mostly because the tool-set for builds hadn’t caught up and/or been signed off, and the actual build tools are based on the command line rather than on a GUI (this is hardly a surprise), but — nope, the compiler is the compiler.

        1. kantos says:

          Peeking at the headers for ntdll and gdi32 it shows the linker version as 14.10 which would be a VS2013 update, as far as I know the external linkage for unnamed namespaces was fixed prior that version (I removed a lot of static function prefixes when I migrated and verified this wasn’t happening any more). I would not be surprised if insider builds are testing against newer compiler versons. Even windows devs want new C++ features. Given that parts of the vNext release SDK depends on compiler version 15.3 or later (c++ coroutines in the CppWINRT sdk components) this may be changing.

    2. Okay, then lambdas. Or template functions which take as type parameters types in unnamed namespaces.

  3. pc says:

    I’m curious about the reason behind needing reproducible builds, given all the challenges involved. Is it a trust thing, where somebody has audited a particular version of the source code, and needs to make sure that they can create the same compiled bits that are deployed somewhere? If so, I think you’d need a compiler that was also made from a reproducible build, preferably at least two from different vendors so that you can check their work with each other to ensure that there isn’t anything nefarious hidden inside the compiler.

    Seems like quite the challenge, though I know some fields (gambling machines, cryptocurrency back-ends, and so forth) find it worth the hassle.

    1. Brian says:

      One example is the gaming (gambling) industry. Gaming machines (slot machines, etc.) get their software certified by a commission who tests that the software is “fair”. Any change to that software must be recertified (which is an expensive process). As a result, you need a way to rebuild the same source into a PE file with the same hash as the original build. I assume there are other industries with similar rules

      1. Tanveer Badar says:

        Reproducible outputs also have the side effect of ensuring compiler always hits the same bug, in case it has any.

        1. Peter Doubleday says:

          Which makes it massively easier to debug the problem, reason about the bug, and fix it.

          Sans a provably bug-free release of any program whatsoever, I can’t see why you feel this is a worrying feature.

          1. Tanveer Badar says:

            How did you arrive at the conclusion that I think it is a worrying feature? It is a very good feature.

      2. pc says:

        Right. I guess what I was asking was whether they needed to do this because there are, for example, some slot machine manufacturers that want their product to run on Windows 10, though I suppose that’s not a question that I’m going to actually get an answer to here.

    2. https://reproducible-builds.org/ has a pretty good explanation of the motivation behind reproducible builds: primarily that you can be sure that a given set of binaries comes from a given set of source files (and toolchain). If the binary changes each time you rebuild the source (even if it’s unchanged), you lose the ability to make that determination.

    3. Reproducible builds are essential for build caches and reducing testing time. Hash all the inputs to a build tool, see if you’ve built that thing already, if so, then grab the precompiled result. Consider: You added a #define to windows.h, so every C file needs to recompile, but almost nobody uses that #define, so the OBJ files are nearly all the same, and therefore you still get tons of cache hits on the linker. And nearly all of the resulting EXEs are byte-for-byte identical, so you don’t need to re-test them. If the output of the compiler were not reproducible, then you wouldn’t get a cache hit on the OBJ files, and you end up having to rebuild and re-test the entire system, for a #define that almost nobody uses!

      1. DWalker07 says:

        Wow, that’s interesting. I hadn’t thought of those aspects — not having to re-test binaries that are byte-identical with a tested binary.

      2. pc says:

        Thank you very much for the reply. I was trying to figure out if there was any motivation beyond just the legal/trust ones, and those links show me that there’s a lot with the build process of large systems that can benefit.

    4. Dave says:

      Same here. In the security field we specifically want non-reproducible builds, or at least builds that are as different to each other as possible, because having everything laid out exactly identically in each build of a binary makes things much easier for attackers.

      1. smf says:

        You also want that when building DirectX for the xbox, to prevent someone from patching the games to use desktop windows dll’s

    5. On top of what everyone else has said, we’ve found that for our work on Firefox having a reproducible build makes reasoning about changes to the build much easier. If you can generate the same exact binaries from the same source, then you can diff the build outputs when you make changes to the build and you don’t get lost in a sea of noise.

  4. JB says:

    This seems like it’s throwing out a number of fairly useful things for “reproducable builds” which are of little or no interest to most people.

  5. IanBoyd says:

    For people who have been trying to reverse engineer the PDB format, this little tidbit finally explains a problem: why module timestamps were so non-sensical in Windows 10.

    For a while i wondered if it was changed so that 1 tick is 2 seconds, and the high-bit is set to indicate this custom behavior.

    This is nice to have.

  6. alegr1 says:

    The timestamp was once used to identify the PDB file and to store it on a symbol server. These days it’s a GUID combined with generation (“age”) count. The “age” counter increments every time the binary gets built. The GUID persists until you delete the whole build directory. When you build a binary from scratch, the PDB gets a new GUID.

  7. 640k says:

    When was it ever a good decision to include this much random noise into a compiled binary? What were you thinking? If you would have been a bit more careful before implementing these noise generators in your tool chain, you wouldn’t have needed to refactor your whole tool chain from scratch. Of course, even with a tool chain claiming to generate reproducible results, there’s no guarantees, noone can count on it, you have the assume worst-case anyway, i.e. random noise in every other byte.

  8. Bruce Dawson says:

    What about the GUID/age that are embedded in the binary? Are these also reproducible, or are they exempted from the reproducibility checking?

    > The timestamp is really a unique ID that tells the loader

    Well, that is *one* use of the timestamp. Another use is as part of the unique ID used when inserting PE files into a symbol server, as described here:

    In particular note that the path for a PE file on a symbol server is generated like this:

    “%s\%s\%s%s\%s” % (serverName, peName, timeStamp, imageSize, peName)

    Since the peName stays the same for a particular binary, and the imageSize frequently stays the same, the timeStamp *must* be different for every build to avoid collisions. And, 32 bits isn’t really enough – if you hit about 60,000 builds of the same binary on the same symbol server you have to start worrying about collisions, or hope that the file size is growing.

Comments are closed.

Skip to main content