GUIDs are globally unique, but substrings of GUIDs aren’t


A customer needed to generate an 8-byte unique value, and their initial idea was to generate a GUID and throw away the second half, keeping the first eight bytes. They wanted to know if this was a good idea.

No, it's not a good idea.

The GUID generation algorithm relies on the fact that it has all 16 bytes to use to establish uniqueness, and if you throw away half of it, you lose the uniqueness. There are multiple GUID generation algorithms, but I'll pick one of them for concreteness, specifically the version described in this Internet draft.

The first 60 bits of the GUID encode a timestamp, the precise format of which is not important.

The next four bits are always 0001, which identify that this GUID was generated by "algorithm 1". The version field is necessary to ensure that two GUID generation algorithms do not accidentally generate the same GUID. The algorithms are designed so that a particular algorithm doesn't generate the same GUID twice, but without a version field, there would be no way to ensure that some other algorithm wouldn't generate the same GUID by some systematic collision.

The next 14 bits are "emergency uniquifier bits"; we'll look at them later, because they are the ones that fine tune the overall algorithm.

The next two bits are reserved and fixed at 01.

The last 48 bits are the unique address of the computer's network card. If the computer does not have a network card, set the top bit and use a random number generator for the other 47. No valid network card will have the top bit set in its address, so there is no possibility that a GUID generated from a computer without a network card will accidentally collide with a GUID generated from a computer with a network card.

Once you take it apart, the bits of the GUID break down like this:

  • 60 bits of timestamp,
  • 48 bits of computer identifier,
  • 14 bits of uniquifier, and
  • six bits are fixed,

for a total of 128 bits.

The goal of this algorithm is to use the combination of time and location ("space-time coordinates" for the relativity geeks out there) as the uniqueness key. However, timekeeping is not perfect, so there's a possibility that, for example, two GUIDs are generated in rapid succession from the same machine, so close to each other in time that the timestamp would be the same. That's where the uniquifier comes in. When time appears to have stood still (if two requests for a GUID are made in rapid succession) or gone backward (if the system clock is set to a new time earlier than what it was), the uniquifier is incremented so that GUIDs generated from the "second time it was five o'clock" don't collide with those generated "the first time it was five o'clock".

Once you see how it all works, it's clear that you can't just throw away part of the GUID since all the parts (well, except for the fixed parts) work together to establish the uniqueness. If you take any of the three parts away, the algorithm falls apart. In particular, keeping just the first eight bytes (64 bits) gives you the timestamp and four constant bits; in other words, all you have is a timestamp, not a GUID.

Since it's just a timestamp, you can have collisions. If two computers generate one of these "truncated GUIDs" at the same time, they will generate the same result. Or if the system clock goes backward in time due to a clock reset, you'll start regenerating GUIDs that you had generated the first time it was that time.

Upon further investigation, the customer really didn't need global uniqueness. The value merely had to be unique among a cluster of a half dozen computers. Once you understand why the GUID generation algorithm works, you can reimplement it on a smaller scale:

  • Four bits to encode the computer number,
  • 56 bits for the timestamp, and
  • four bits as a uniquifier.

We can reduce the number of bits to make the computer unique since the number of computers in the cluster is bounded, and we can reduce the number of bits in the timestamp by assuming that the program won't be in service 200 years from now, or that if it is, the items that were using these unique values are no longer relevant. At 100 nanoseconds per tick, 2^56 ticks will take 228 years to elapse. (Extending the range beyond 228 years is left as an exercise, but it's wasted effort, because you're going to hit the 16-computer limit first!)

You can get away with a four-bit uniquifier by assuming that the clock won't drift more than an hour out of skew (say) and that the clock won't reset more than sixteen times per hour. Since you're running under a controlled environment, you can make this one of the rules for running your computing cluster.

Comments (40)
  1. Adrian says:

    Nice analysis.

    I never knew about the algorithm identifier in the GUID.  I remember when there was a privacy concern since GUIDs could be tracked back to the machine that generated them (by the network card address).  As I understand it, the algorithm used by Windows was changed.  It’s neat to see that GUIDs generated by different algorithms can’t collied.

  2. Joe Enos says:

    Wouldn’t a truly random GUID generator be better suited?  If all 128 bits are random, then you could take a subset of them and still be reasonably unique – not as unique as the full 128, but still pretty good.

    The non-random GUID doesn’t really make a lot of sense to me.  If truly random, the odds of finding a duplicate are incredibly ridiculously low – if you have it based on a timestamp, network card, etc., you can run into duplicates much easier, especially as CPUs get faster, and as more and more pieces of hardware get added to the global network.

  3. .. says:

    You basically cannot implement a truly random generator without specific hardware support.

    You can always improve your 128bit UUID adding a random generated number to it (of at least 64bit itself.. remember birthday paradox); adding a second random number to an already random number will not improve the situation that much unless taken from a completely different generator.

  4. Nate says:

    Ok, say it with me:

    "GUIDs are not defined to be random; they are defined to BE UNIQUE in space & time!"

    Joe, the customer didn’t ask for a random number, they asked for a unique number and more specifically about chopping a GUID in half.  Raymond explained why that was a bad idea. And he gave a way to generate a CLUSTER unique number with 0 risk of a collision when taken within the customer specified spatial and temporal domains.

    Also, using a pseudo-random number generator doesn’t help with uniqueness if all the computers use the same seed  (through the use of time or some other error of coding).

    My $0.02.

  5. The "random" GUIDs have an "algorithm identifier" of 0100 (4), FWIW.

    Don’t get me started on the KSDATAFORMAT_SUBTYPE_PCM and KSDATAFORMAT_SUBTYPE_IEEE_FLOAT pseudo-GUIDs.

  6. Josh says:

    Even better.  Sorry, didn’t know the time interval off the top of my head. :-)

    [I didn’t either, but I bothered to follow the link to the GUID spec. -Raymond]
  7. nathan_works says:

    Looks like they are up to algorithm 5, according to a newer spec than the one Raymond linked to: http://www.faqs.org/rfcs/rfc4122.html

  8. Brian says:

    It’s all fine and dandy until someone builds the 140,737,488,355,328th network card.

  9. mikeb says:

    @josh:

    "Not bad if I say so myself."

    So, you’re the creator of the GUID generation algorithm? :-)

  10. Ron Parker says:

    @Josh:

    Where do you get the number 16 from?  That seems to come from the 4-bit clock sequence number Raymond suggested for the 64-bit cluster-unique ID.  GUID algorithm 1 uses 14 bits for the clock sequence number, for 16,384 unique GUIDs per MAC address per 100ns interval, no?

    (Provided, of course, that your clock never goes backward.  If it does, and you’re generating GUIDs at a rate of 16,384 every 100ns, your risk of collision rises to near 100%.)

  11. njkayaker says:

    [I didn’t either, but I bothered to follow the link to the GUID spec. -Raymond]

    Great! A well that will never run dry!

  12. Alexandre Grigoriev says:

    Provided, of course, that your clock never goes backward

    Stupid clock correction in Windows makes it possible, unfortunately.

  13. Alexandre Grigoriev says:

    >[Actually, it’s 16 guaranteed unique GUIDs per 100-nanosecond interval, or 160 million unique GUIDs per second. -Raymond]

    You mean your tick rate is 10 MHz? 100 ns is just kernel time unit, not actual resolution, or rate at which it changes. It only changes about 15 times per second, depending on HAL.

    [If you read the GUID spec (which I linked to) you can see how you can generate timestamps with greater precision than the clock tick. -Raymond]
  14. mikeb says:

    @Ron Parker:

    The clock sequence field is not intended to be used to allow more than one GUID to be generated within a 100ns interval.  The clock sequence field is specifically intended to handle the situation where the clock ‘goes back in time’ (due to a resetting the clock, restarting the machine or whatever).

    What Raymond calls 56 bits of time stamp with 4 bits of ‘uniquifier’ is really a 60 bit time stamp field with 100ns resolution. What’s being described as 56 bit of timestamp with 4 bits uniquifier is probably that the clock being used to get the timestamp does not have a 100ns resolution, and the implementation is using a technique like the one described in RFC 4122 section 4.2.1.2 to get up to 16 UUID’s per native clock tick.

    Basically with algorithm 1 (according to RFC4122) for a single generator instance, you get at most one UUID per 100ns interval. If you need more, the RFC suggests adding more MAC addresses so you can have several instances of the UUID gen going concurrently.  If you can’t do something like that you have to stall or return an error.

    See RFC 4122 4.2.1.2.

  15. Ron Parker says:

    @Alexandre:

    If you can get anything that runs Windows to generate 16,384 GUIDs every 100ns, tell me what the stock market looks like there in the 160+ GHz future.  And where do you store them all?  If you’re not generating them at a sustained rate that’s a significant fraction of that, the risk of collision goes way down, even if the clock goes backwards.

    RFC 4122 actually has a provision for clocks that don’t tick every 100ns, allowing you to use the lower bits of the timestamp as a serial number.  It’s not clear to me on first reading how that serial number interacts with the time sequence number, so the number of GUIDs you can generate on such a system without causing collisions might not be quite as high.

  16. Ron Parker says:

    @mikeb:

    I see.  The RFC says "If the state was available, but the saved timestamp is later than the current timestamp, increment the clock sequence value" but Raymond said "When time appears to have stood still … or gone backward … the uniquifier is incremented."  When I read the RFC, I missed the subtle difference.  So at most one GUID every 100ns it is.

  17. mikeb says:

    [If you read the GUID spec (which I linked to) you can see how you can generate timestamps with greater precision than the clock tick. -Raymond]

    But you can’t generate timestamps with a higher resolution than 100ns – that’s what the timestamp field is defined to be.  So the spec discusses, for example, how to deal with the fact that your clock can only generate ticks with 10 microsecond resolution and you want your generator to be able to generate more than one UUID within the span of 10 microseconds.

    If you need to generate more than one UUID within a 100ns span (on average), you either need to have addition MAC addresses to ‘uniquify’ the node ID field or you need to have your generator stall or return an error.

    On the whole though, I don’t think there are too many people having problems with generating too many GUIDs in too short a time period.  My machine has less than a 4GHz clock – so that’s fewer than 400 clock cycles per 100ns interval here (if my cipherin’ is correct).

  18. Dave says:

    some other interesting notes about GUIDs that try to use MAC addresses as unique values:

    1) many network cards and devices now allow you to set the MAC address to whatever you want; some cable modems and that kind of thing rely on this functionality, apparently.

    2) for a while, at least, the dummy network adaptors created by Windows for dial-up connections and the like all had the same MAC address (i think it was "DEST" or something).  there were a lot of GUIDs floating around that had that decidedly non-unique value…

  19. Anonymous says:

    A related post by Larry Osterman:

    UUIDs are only unique if you generate them…

    http://blogs.msdn.com/larryosterman/archive/2005/07/21/441417.aspx

  20. eff Five says:

    @Thor

    Very nicely put. I’m adding “what are the requirements on which a solution is based and can any of them be relaxed” to the “words to code by” list. I especially like it because its got understanding the requirements baked in

    Similar items on the list

    “never fix a bug you don’t understand” –http://www.links.org/?p=327

    And

    “A cache with a bad policy is another name for a memory leak” – http://blogs.msdn.com/oldnewthing/archive/2006/05/02/588350.aspx

    “Random number generation is hard. [..] leave it to the experts. – http://blogs.msdn.com/oldnewthing/archive/2004/04/12/111596.aspx

  21. Judge Dredd says:

    > “never fix a bug you don’t understand” –http://www.links.org/?p=327

    I disagree with the post..

    Better is

    Never leave ambiguous "bug-smelling" code around, at least not uncommented, and never rely on undetermined system behavior (OpenSSL relies on the memory being uninitialized.. which is a false assumption because it’s not part of any contract and both OS, the loader, libc et al are free to zero-out – or something else – the uninit. memory).

    If it smells like a bug, it’s a bug, at least in form. Correct code should never raise a suspect.

  22. Josh says:

    @Joe:

    Because there is a *chance* of collision, even with pure randomness.  Remember, the birthday problem (aka birthday paradox) means that the time before unacceptably high probability of collision is much lower than the number of possible options.  e.g. For truly random 128 bit numbers (3.4e38 options), you’d have a 50% chance of collision after generating about 22 quintillion GUIDs (2.17e19).  Problem is, this assumes we are *really* random.  And generating numbers that are truly random, to the point of uniqueness, is nearly impossible.  For example, one means of generating very good random numbers is having an antenna count ticks in the cosmic background radiation over time.  If an odd number of ticks occur in a time frame, you get a 1, even, a 0, continue until the number is long enough.  But what if two antennas are generating at the same time in close proximity?  Odds of collision increase greatly.  And of course, it would take a while to collect the arbitrary randomness.  Virtually every form of random number generation would be in some way dependent on the time and location at which it was generated.  And any less extensive randomness generator is likely to be slightly non-random, leading to collisions occurring *much* more often.

    The non-random GUID, if you read the description, allows for any given machine to pump out 16 *guaranteed* unique GUIDs per second.  And it even manages to conserve 6 bits that aren’t part of the uniqueness, to allow for future advances in GUID design.  Not bad if I say so myself.  Keep in mind, this isn’t “based on” a timestamp or network card, it is a time stamp and specific network card ID.  You can permanently remove a whole block of GUIDs from existence by smashing network cards.  No risk of duplicates, period.

    [Actually, it’s 16 guaranteed unique GUIDs per 100-nanosecond interval, or 160 million unique GUIDs per second. -Raymond]
  23. Miles Archer says:

    Judge Dred,

    I hope you don’t work on any of my projects. There’s a risk to fixing any code. If you don’t weigh the risk vs reward you’re asking for trouble. If you worked on my project and fixed code that "smells" without justification, their would be hell to pay.

  24. Thor says:

    Beyond the specifics of the GUID generation, I love the initial problem statement.  (I am fairly new to the blog and haven’t finished going back and reading them all but this must be a pattern.)  

    The customer’s basic question is– "for this problem, there is a solution, but I want it for less".

    Without changing the a basic requirement (like local uniqueness versus global uniqueness in this example), it seems the only way this would be possibile is if the original solution was suboptimal.  While I frequently think that a given solution seems suboptimal, I almost always find that, no, whoever found it knew *tons* more about the problem than I had any idea existed.

    Granted questioning an optimal solution is how technology improves, but in a case like this, unless you are master of the intricacies of unique number generation, the chances seem quite slim that you will achieve the requirement (unique) with fewer resources (half the byters).  

    So, the root of the question pattern may be– what are the requirements on which a solution is based and can any of them be relaxed?

  25. Dean Harding says:

    "OpenSSL relies on the memory being uninitialized"

    If you’re referring to the "Debian incident", that is not true – OpenSSL does not rely on memory being uninitialized.

  26. Jeff Tyrrill says:

    I really do wish hardware random number generators were standard order in modern processors, so we didn’t have to rely on these elaborate algorithms simply to guarantee uniqueness, which are riddled with privacy implications. (I realize newer algorithms are much improved from the one in this post, but the point remains that not having access to guaranteed entropic data means a slew of pragmatic assumptions must be made and carefully weighted in the development of any unique number generation algorithm.)

    Using a cryptographic PRNG and guaranteed entropic source data, you don’t need silly constructs like an "algorithm" counter to prevent collisions between "algorithms". The size of the bitspace (128 bits) is guarantee enough, even with arbitrarily many algorithms, as long as they all have entropic source data and are reliable CSPRNGs.

    (Whatever the probability of number collision, it will be higher with a less entropic algorithm than a cryptographic PRNG anyway, so even if 128 bits isn’t enough bitspace, you’re still better off using a CSPRNG. A CSPRNG satisfies a greater requirement than uniqueness, but it does satisfy uniqueness also.)

    http://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator

  27. Yuhong Bao says:

    Similarly, hashing 2 7-byte values aren’t as secure as hashing 1 14-byte value, as the LM hash showed.

  28. Narr says:

    Whatever the probability of number collision,

    it will be higher with a less entropic

    algorithm than a cryptographic PRNG anyway

    Global registry? That would actually guarantee uniqueness. A 1:1, one-way hashed hardware ID + timestamp comes pretty close, too.

    A CSPRNG satisfies a greater requirement than

    uniqueness, but it does satisfy uniqueness also.

    How? It’s entirely possible for a collision to occur, so there’s no guarantee, and it trivially does not satisfy. Also, what is your ‘greater requirement’?

  29. Xepol says:

    Or they could stop being data-cheap and just use a GUID….

  30. SuperKoko says:

    From mikeb:

    "

    But you can’t generate timestamps with a higher resolution than 100ns – that’s what the timestamp field is defined to be.

    "

    On my computer, QueryPerformanceCounter has 10ns accuracy (the function call time itself is irrelevant). It uses the on-board clock which runs at approximatively 100Mhz.

    Why would you use GetTickCount?

    From Jeff Tyrrill:

    "

    I really do wish hardware random number generators were standard order in modern processors, so we didn’t have to rely on these elaborate algorithms simply to guarantee uniqueness

    "

    On Linux, there’s /dev/random, based on keystrokes entropy, and things like noise recorded from the sound card.

    Birthday’s paradoxes remain.

  31. AndyC says:

    @Jeff Tyrrill

    "I really do wish hardware random number generators were standard order in modern processors, so we didn’t have to rely on these elaborate algorithms simply to guarantee uniqueness"

    If your random numbers are unique, your random number generator is broken.

  32. mikeb says:

    @SuperKoko:

    > On my computer, QueryPerformanceCounter has 10ns accuracy <<

    That’s nice, but it doesn’t solve the theoretical problem of generating more than one UUID every 100ns (on average) – the timestamp field in the UUID is in units of 100ns.

  33. pZy says:

    German Wikipedia says for privacy reasons newer versions of Windows would not make use of the MAC address for creating GUIDs.

    Otherwise it would be possible to draw conclusions about the creator.

    Unfortunatley I didn’t find that statement in the english wiki.

  34. SuperKoko says:

    "

    That’s nice, but it doesn’t solve the theoretical problem of generating more than one UUID every 100ns (on average) – the timestamp field in the UUID is in units of 100ns.

    "

    That’s still better than one GUID every 10 milliseconds with GetTickCount.

  35. mikeb says:

    @SuperKoko:

    > That’s still better than one GUID every 10 milliseconds with GetTickCount. <<

    But you’re not limited to a single UUID per 10ms tick, even if that’s the hardware clock you might be stuck with. That’s one of the things that section 4.2.1.2 of RFC 4122 discusses:


    A high resolution timestamp can be simulated by keeping a count of the number of UUIDs that have been generated with the same value of the system time, and using it to construct the low order bits of the timestamp.  The count will range between zero and the number of 100-nanosecond intervals per system time interval.


    So you can still get a 10000 unique UUIDs per 10ms tick.  But my point wasn’t what timer resolution is best to use, but that by definition you can’t get better resolution than 100ns resolution using Algorithm 1 to generate UUIDs.  Oh, and that the clock sequence field is not intended to be used to simulate a higher resolution – it’s intended to deal with clocks getting out of ‘sync’ (or going back in time).

  36. Andy R says:

    Your reduced GUID format is a good idea, not used often enough, but it makes one more assumption… that there is only one thread or process running on each computer.

  37. SuperKoko says:

    "

    Your reduced GUID format is a good idea, not used often enough, but it makes one more assumption… that there is only one thread or process running on each computer.

    "

    No. It just assumes that you use only one generator function (e.g. in a dynamic or static library). In that case, it’s free to use shared variables and mutexes to safely manager GUIDs generation.

  38. The list is a little longer today because of not posting last week. Enjoy! Microsoft Advanced Windows

Comments are closed.