In the search for the subtle source of the problem, you sometimes overlook the obvious one


A customer was encountering a problem with lots of duplicate GUIDs. How is that possible? The whole point of the GUID generation algorithm is to work hard to avoid duplication. Was one of the fundamental assumptions of the algorithm broken? Maybe there was a duplicate MAC? Was the clock regressing?

One of my colleagues pointed out that in the search for the subtle source of the problem, you sometimes overlook the obvious one. In fact, this is the most common source of problems with so-called duplicate GUIDs. As he so tersely puts it: “A GUID can easily be duplicated by simply copying it.”

In other words, you have a duplicate GUID because you duplicated the GUID. This can happen, for example, if you have a Clone method on an object which creates an exact duplicate of the object. Since the GUID is a property of the object, it too gets cloned. Then you add the clone to the same database as the original.

Boom, instant duplicate.

Comments (36)
  1. David Walker says:

    We have some GUIDs in a SQL database.  We have more than one development database, and a QA database, and a user-acceptance-testing database, and a user-training database (with some "anonymized" data), and a production database.  

    Some of the GUIDs are the same across these databases, even when they are used as record IDs, since they mean the same things from database to database.  And some of the GUIDs are created on the fly and are not meant to be the same anywhere.

    It’s tempting to dismiss using GUIDs as database keys, since they are "so long" (128 bits), but once you have realized that if someone passes you an ID and they claim it’s a customer number, and you can tell them no, it’s not a customer number, it’s an order number, then you may realize their value.   You can’t do that if all tables have IDs that are auto-incrementing numbers.

  2. Anonymous says:

    Reminds me of the time somebody copy-pasted some code that he considered "boiler plate".  It contained a GUID that really had to be unique for multiple of these "boiler plate" sections to coexist.  This was, of course, realized much later when it actually became a problem.

  3. MS says:

    So what you’re saying is that the subtle source of the problem is that dummy in front of the computer, otherwise known as a PEBKAC error.

  4. Bob says:

    There’s at least a few different "versions" of the globally (allegedly) unique ID – and thanks to privacy concerns, the MAC address is no longer a popular choice.

    And unless you have a source of uniqueness, it actually is possible (though unlikely) to get duplicate IDs generated.

    Dumb developers just use "the GUID generation algorithm" and expect it to work. Good developers understand it, and know the tradeoffs involved in the many different GUID algorithms they’re choosing from.

    Of course, the subtle problems aren’t the only ones – really dumb developers blindly copy GUIDs in all sorts of crazy places.

  5. David Walker says:

    @bob: "the MAC address is no longer a popular choice"

    I don’t believe you.  I think the MAC address is part of the algorithm.  But I could be wrong.

  6. Scott Berry says:

    @David Walker: probably instead of saying "I don’t believe you", it would be good to check: http://en.wikipedia.org/wiki/Globally_Unique_Identifier

  7. Carl Daniel says:

    @David:

    http://msdn.microsoft.com/en-us/library/aa379322.aspx

    "For security reasons, UuidCreate was modified so that it no longer uses a machine’s MAC address to generate UUIDs. UuidCreateSequential was introduced to allow creation of UUIDs using the MAC address of a machine’s Ethernet card."

  8. someone else says:

    @Ken Cox:

    Ok, you’re probably just kidding, but there are of course ways to prove if a GUID is, in fact, globally unique. The exact proof depends on the method used.

  9. Michael Stum says:

    And just to add: Using Random() four times to generate the 4 parts of a GUID is NOT a good method. And yes, I’ve seen code for GUID generating that did this. Actually, I’ve seen many really bad "GUID" generators as code for languages that don’t have anything built in.

  10. Hey guys,

    Don’t try to convince me that GUIDs are unique until you test *all* of them against each other. Until you do, their uniqueness is merely idle speculation! <grin>

    Ken

  11. Worf says:

    I remember having to write a GUID generator from scratch – there were very few facilities I could use – I barely had a C library.

    I used the MAC as it was unique, and the onboard fast-running counter – that’s all I had.

    Coding PXE support into a bootloader…

  12. Worf says:

    Addenda to above:

    No RTC – the clock ticked from zero every power-on. Presumably it ran NTP or something once the OS booted.

    PXE was used to load in a bigger bootloader with more support – the boot flash was tiny and we only had around 64kB to be able to load the big loader from (much larger) NAND storage, validate it, and if not (e.g., corrupted or not present), do PXE to find a serve that could serve us the right bootloader, and write it into NAND.

  13. 640k says:
    1. Mail account GUIDs in Exchange are not unique? Why?

    2. Several stupid people I know are copying project files (vbproj). VS doesn’t warn about this when loading av multiproject solutions (doesn’t check for dupes i suppose), but behaves strangely/erroneously when debugging. Why?

  14. Thomas says:

    Ok, you’re probably just kidding, but there

    are of course ways to prove if a GUID is, in >fact, globally unique. The exact proof

    depends on the method used.

    Hogwash. A GUID has 128 bits. If you generate 2^129 GUIDS you will definitely have duplicates. How are you going to prove that nobody on the planet has ever generated a certain GUID?

  15. Kevin Eshbach says:

    What’s really scarey is when a lazy developer just randomly increments one of the guid’s digits.

  16. someone else says:

    @Thomas:

    With the right method, the sun will likely burn out before you run out of GUIDs. Do you have any idea how big a number 2^129 is??

  17. Ooh says:

    @someone else: You’re practically right, there are so many possibilities that the probability of a duplicate is low. However this doesn’t mean that it can’t happen.

    That’s why Thomas is right: you can’t prove that there won’t ever be duplicates or that anybody else has not already generated the same GUID. Try to prove it and you’ll get:

    Step 1: In total there are 2^128 possible GUIDs.

    Step 2: Thus as soon as 2^128+1 GUIDs have been created (by all people on all computers on the whole world) then at least one of them has to be a dup (pigeonhole principle).

  18. David Walker says:

    OK, I may have been wrong about the MAC address.  However, that link also says: "Computers with ethernet/token ring addresses generate UUIDs that are guaranteed to be globally unique".  How does it guarantee them to be globally unique if some portion of the Ethernet or token ring hardware is not part of the equation?

    As for the size of the GUID address space, while I’m sure some commenters were kidding, it’s claimed to be large enough to assign a unique GUID to every atom in the universe.  

    This universe, of course; we don’t consider multiverses here.

    Which means, obviously, that you can’t even store them all.  The address space is many, many orders of magnitude larger than you can even imagine.

  19. Maurits says:

    While it is true that the pigeonhole principle doesn’t kick in until we hit 2^128 GUIDs, probabilistic methods suggest we’ll hit our first (non-trivial) duplicate GUID after 2^64 or so (birthday paradox.)

  20. someone else says:

    Ok, let’s look at this GUID algorithm that I just made up (because it works):

    The first 48 bits will be the generators MAC address. The second will be a timestamp.

    An 80-bit counter is enough for 38,334,786,263 years if you generate a GUID every nanosecond. Per computer. That should do.

    Of course, there are still privacy concerns. Just think of a machine hooked up to the Internet where everybody can get one if needed and which is not supervised by anyone (we hope).

    [Well, you can’t use all 80 bits for the counter because some bits of the GUID are reserved (in order to prevent collisions among various algorithms), and you have to worry about two processors generating a GUID at the same time, plus various clock-related tweaks. But basically you reinvented the classical GUID algorithm. -Raymond]
  21. someone else says:

    “[Don’t forget to handle the case when somebody asks for two GUIDs in rapid sucession (less than 1 microsecond apart). -Raymond]”

    Um… Wait for a microsecond? Actually, I was thinking more along the lines of “the GUID generator generates GUIDs as fast as possible (and necessary; if none are needed, none are generated), and even if it generates a million per second, it will still last <insert long time span here>.” If you need GUIDs faster than it can generate, use a second one.

    [“If you need GUIDs faster than it can generate, use a second one.” If you have two GUID generators, how do you ensure that they don’t both spit out the same GUID? Indeed the simplified algorithm virtually guarantees that they will! -Raymond]
  22. Menno van Lavieren says:

    Birthday paradox indeed. But you should also consider the probability that two GUID’s will ‘meet’.

    Pesonaly I think for GUID to implement Clone() is a bug. It doesn’t makes sense and is wrong. Clone is about creating a different object, so that when I poke in one the other stays unaffected. The objects have their own identity from the cloning on.

    Come to think of it, you can abuse clone() to implement transactional behaviour.

    [It’s not that GUID implements Clone(). The structure inside which the GUID is embedded has a Clone() which does a memberwise copy. Watch:
    S *CloneStructure(const S *p)
    {
     S *q = new S;
     *q = *p;
     return q;
    }
    

    -Raymond]

  23. someone else says:

    Oh, that should have been microsecond, not nano. Damn prefixes …

    A GUID generator should be single threaded. And failsafe (first increase the counter, *then* generate the GUID). And even a 64-bit counter would last long enough (584,942 years at one GUID per microsecond).

    (I didn’t look up the exact algorithm because that wasn’t necessary for the proof.)

    [Don’t forget to handle the case when somebody asks for two GUIDs in rapid sucession (less than 1 microsecond apart). -Raymond]
  24. Ben Voigt says:

    @someone else:

    That doesn’t even begin to guarantee global uniqueness.  What it tries to do is guarantee uniqueness between different users of the same software.  But consider what happens when someone without an Ethernet card generates a GUID.  Or someone else’s GUID software uses (timestamp, MAC) instead of (MAC, timestamp).  You can easily have collisions with other algorithms.  Which is why there cannot be a proof of uniqueness.

  25. David Walker says:

    Don’t forget that “when someone without an Ethernet card generates a GUID”, it’s not likely that anyone will ever know, hence there won’t be a collision in a meningful sense.  The computer without an Ethernet card is pretty much limited to living in its own little world.

    Unless the computer communicates with the outside world in some non-Ethernet or non-Ethernet-like way.

    [In the old days (before laptops had network hardware built-in), this was pretty common. Your machine has no network card, but you generate CLSIDs for your COM object from it. When you return to the office, you dock your laptop (connecting it to the network) and upload your project to the network. You can still see some GUIDs created this way in the registry today.-Raymond]
  26. David Walker says:

    You’re right, Raymond, I overlooked that possibility.

  27. Jeroen Mostert says:

    GUIDs are not absolutely unique across all situations. They have 128 bits, not an infinite number of bits. The appropriate question is not to ask if there can be collisions (there can be) but how likely they are in a given situation. No matter how crafty your algorithm, you could always be tripped up by some clown with another algorithm designed to collide with yours. In particular, a trivial way of doing this is by repeating already generated GUIDs. Which is the subject of this post, so that’s coming full circle for you.

    That reminds me of a debate I had on how to create a temporary file (ignoring the existing library functions for a moment). I proposed using a GUID for the file name and failing if a file by that name couldn’t be created, and was chewed out by two people who pointed out that GUIDs, despite the name, are not necessarily unique, not even on the same machine. In particular, it was pointed out that this didn’t take into account the existing files in the directory that might clash, despite the fact that this is much less likely to happen with GUIDs than with most other schemes (like GetTempFileName(), which is easily thwarted by a mere 65,536 pre-existing files of a particular name), even taking malice into account.

    Instead they unequivocally agreed that using an algorithm that would deterministically fail only when the available names ran out was a better choice, even if that algorithm was less efficient. My argument that the *likelihood* of the GUID not being unique was so incredibly small that you would much sooner need to worry about your file system, your computer or your solar system breaking down was scoffed at; the possibility of failure by the algorithm itself, no matter how jawdroppingly minuscule, was unacceptable. They went to great lengths to point out my idea was "playing the odds" and fundamentally flawed.

    I learned from that that some (most?) programmers do not appreciate probabilistic arguments and will only respect "certainty", even if the "certain" algorithm will "certainly" fail in reality much sooner than the "unpredictable" algorithm will, and even if the "certain" algorithm has other undesirable properties. This distrustful attitude towards small probabilities is healthy in general (you don’t want to hear someone argue that "this race condition is so unlikely") but elevating it to dogma is going too far.

  28. someone else says:

    “[“If you need GUIDs faster than it can generate, use a second one.” If you have two GUID generators, how do you ensure that they don’t both spit out the same GUID? Indeed the simplified algorithm virtually guarantees that they will! -Raymond]”

    Because they have different network cards, and hence, different GUIDs. By “generator,” I meant an actual physical thingy, not a program.

    [Oh, good luck getting people to buy a GUID generator device whenever they need a GUID. -Raymond]
  29. someone else says:

    “[Oh, good luck getting people to buy a GUID generator device whenever they need a GUID. -Raymond]”

    That’s what we have the Internet for. Have some trusted parties set up public GUID generators. Yes, I firmly believe in (flat, reliable and reasonably fast) Internet access for everyone.

    [I’m pretty sure those public GUID generators would be asked to generate GUIDs faster than 1 per microsecond. “Your GUID requests are important to us, and they will be serviced in the order they were received.” (I can see the bug reports now. “Your program runs slowly because the GUID generator is overloaded.”) -Raymond]
  30. mikeb says:

    Using my patent-pending GUID generator physical thingy internet serivce, you can get all the unique GUIDs you want for the low, low price of US$0.05 each.  

    Remember that each GUID is completely unique – no two are alike.  Therefore their value as collectables can only increase over time.

    To get you started with your own GUID collection, here are 6 GUIDs *absolutely free*!!!.  

       – 1084b315-e21f-4fed-86e4-b2550c054b31

       – 9a02559b-af6d-41b2-9e85-f25d0440d732

       – fded0036-69da-498d-a000-24f04b47027c

       – ed6fae15-a98c-4375-b359-b53e55334bcd

       – 3e4f612b-9e8d-4894-b4e3-128b006db38d

       – bce26c02-ea8f-4c64-9cd3-0b8eeda827be

    You’ll never find an offer like this anywhere else or at any other time.  Hurry before they run out!!!

    Visit http://www.guid-ducks.com

    Operators are waiting!

  31. someone else says:

    “[I’m pretty sure those public GUID generators would be asked to generate GUIDs faster than 1 per microsecond. "Your GUID requests are important to us, and they will be serviced in the order they were received." (I can see the bug reports now. "Your program runs slowly because the GUID generator is overloaded.") -Raymond]”

    Cache, distribute, compress. Instead of transmitting every single GUID, you get the OK from the generator to get GUIDs [fixed part].N to [fixed part].(N+1000) (that dot is concatenation). And you can query several generators at the same time.

    (And since pseudorandom GUIDs can be predicted reasonably well, there is little reason not to use consecutive numbers).

  32. non-unique global identifer says:

    Then what if you want to genereate 2^128+1 uuids in 1 second? The whole concept with generating "unique" values with a fixed bitlength is flawed.

  33. Jeroen Mostert says:

    "Then what if you want to genereate 2^128+1 uuids in 1 second?"

    A system fast enough to generate them would be only six orders of magnitude away from doing operations in Planck time, which is the smallest meaningful unit of time under current known natural laws. There is no conceivable use for that many unique identifiers in such a short timespan. By the time hardware arrives that’s capable of this and scenarios that require it, though (if ever) we should have the GUID2, which will have 256 bits.

    I agree that a fully general solution (with truly unique values) needs an infinite number of bits. Obviously, this approach comes with its own implementation issues, but in many cases it can be approximated by using the binary representation of whatever you want to assign a unique identifier to.

  34. avek says:

    There is no conceivable use for that many unique identifiers in such a short timespan.

    Not only that, but it also requires ridiculous amounts of memory to store all that IDs. Something like 500000 times the weight of Earth with today’s technology.

    But it doesn’t matter, after all. Variable-length string IDs won’t be duplicated by Clone() method, just the same as fixed-length GUIDs will not be. No matter how long the identifier is, if it’s blindly propagated into copies without change, it won’t be unique in those copies.

  35. Gabe says:

    Menno, I think it would be a bug if GUID didn’t implement Clone(). Don’t forget that sometimes (most of the time) you WANT the GUID to be copied. It’s only when you’re generating a new object from an old object with a GUID that you want a new one to be generated.

    For example, if I have an object with GUIDs for identity and parentIdentity, I might want a copy of the object to have a new identity but I want it to always have the same parentIdentity.

  36. someone else says:

    “Not only that, but it also requires ridiculous amounts of memory to store all that IDs. Something like 500000 times the weight of Earth with today’s technology.”

    And at least 6.9 zettajoule of energy. That’s 128 * 2^128 * 1 eV or almost 2 exawatt-hours or over 275,000,000,000 years of Three Mile Island’s reactor output.

Comments are closed.