Let's recap: a GUID is a 128 bit integer that is used as a globally unique identifier. GUIDs are not a security system; they do not guarantee uniqueness in a world where hostile parties are deliberately attempting to cause collisions; rather, they provide a cheap and easy way for mutually benign parties to generate identifiers without collisions. One mechanism for ensuring global uniqueness is to generate the GUID so that its bits describe a unique position in spacetime: a machine with a specific network card at a specific time. The downside of this mechanism is that code artifacts with GUIDs embedded in them contain easily-decoded information about the machine used to generate the GUID. This naturally raises a privacy concern.

To address this concern, there is a second common method for generating GUIDs, and that is to choose the bits at random. Such GUIDs have a 4 as the first hex digit of the third section.

First off, what bits are we talking about when we say "the bits"? We already know that in a "random" GUID the first hex digit of the third section is always 4. Something I did not mention in the last episode was that there is additional version information stored in the GUID in the bits in the fourth section as well; you'll note that a GUID almost always has 8, 9, a or b as the first hex digit of the fourth section. So in total we have six bits reserved for version information, leaving 122 bits that can be chosen at random.

Second, why should we suppose that choosing a number at random produces uniqueness? Flipping a coin is random, but it certain does not produce a unique result! What we rely on here is probabilistic uniqueness. Flipping a single coin does not produce a unique result, but flipping the same coin 122 times in a row almost certainly produces a sequence of heads and tails that has never been seen before and will never be seen again.

Let's talk a bit about those probabilities. Suppose you have a particular randomly-generated GUID in hand. What is the probability that a *specific* time that you randomly generate another GUID will produce a collision with your particular GUID? If the bits are chosen randomly and uniformly, clearly the probability of collision is one in 2^{122}. Now, what is the probability that over n generations of GUIDs, you produce a collision with your particular GUID? Those are independent rare events, so the probabilities add (*); the probability of a collision is n in 2^{122}. This 2^{122 }is an astonishingly large number.

There are on the order 2^{30} personal computers in the world (and of course lots of hand-held devices or non-PC computing devices that have more or less the same levels of computing power, but lets ignore those). Let's assume that we put all those PCs in the world to the task of generating GUIDs; if each one can generate, say, 2^{20} GUIDs per second then after only about 2^{72} seconds -- **one hundred and fifty trillion years** -- you'll have a *very high* chance of generating a collision with your specific GUID. And the odds of collision get pretty good after only thirty trillion years.

But that's looking for a collision with a specific GUID. Clearly the chances are a lot better of generating a collision *somewhere else* along the way. Recall that a couple of years ago I analyzed how often you get any collision when generating random 32 bit numbers; it turns out that the probability of getting any collision gets extremely high when you get to around 2^{16} numbers generated. This generalizes; as a rule of thumb, the probability of getting a collision when generating a random n bit number gets large when you've generated around 2^{n/2} numbers. So if we put those billion PCs to work generating 122-bits-of-randomness GUIDs, the probability that two of them somewhere in there would collide gets really high after about 2^{61} GUIDs are generated. Since we're assuming that about 2^{30} machines are doing 2^{20} GUIDs per second, we'd expect a collision after about 2^{11} seconds, which is about an hour.

So clearly this system is not utterly foolproof; if we really, really wanted to, we could with high probability generate a GUID collision in only an hour, provided that we got every PC on the planet to dedicate an hour of time to doing so.

But of course we are not going to do that. The number of GUIDs generated per second worldwide is not anywhere even close to 2^{50}! I would be surprised if it were more than 2^{20 }GUIDs generated per second, worldwide, and therefore we could expect to wait about 2^{41} seconds, for there to be a reasonable chance of collision, which is about seventy thousand years. And if we are looking for a collision with a specific GUID, then again, it will take about a billion times longer than our initial estimate if we assume that a relatively small number of GUIDs are being generated worldwide per second.

So, in short: you should expect that any *particular* random GUID will have a collision some time in the next thirty billion trillion years, and that there should be a collision between *any two* GUIDs some time in the next seventy thousand years.

Those are pretty good odds.

Now, this is assuming that the GUIDs are chosen by a perfectly uniform random process. **They are not**. GUIDs are in practice generated by a high-quality **pseudo-random** number generator, not by a **crypto-strength** random number generator. Here are some **questions that I do not know the answers to**:

- What source of entropy is used to seed that pseudo-random number generator?
- How many bits of entropy are used in the seed?
- If you have two virtual machines running on the same physical machine, do they share any of their entropy?
- Are any of those entropy bits from sources that would identify the machine (such as the MAC address) or the person who created the GUID?
- Given perfect knowledge of the GUID generation algorithm and a specific GUID, would it be possible to deduce facts about the entropy bits that were used as the seed?
- Given two GUIDs, is it possible to deduce the probability that they were both generated from a pseudo-random number generator seeded with the same entropy? (And therefore highly likely to be from the same machine.)

I do not know the answers to any of these questions, and therefore it is wise for me to assume that the answers to the bottom four questions is "yes". Clearly it is far, far more difficult for someone to work out where and when a version-four GUID was create than a version-one GUID, which has that information directly in the GUID itself. But I do not know that it is *impossible*.

There are yet other techniques for generating GUIDs. If there is a 2 in the first hex digit of the third section then it is a version 1 GUID with some of the timestamp bits have slightly different meanings. If there is a 3 or 5 then the bits are created by running a cryptographic hash function over a unique string; the uniqueness of the string is then derived from the fact that it is typically a URL. But rather than go into the details of those more exotic GUIDs, I think I will leave off here.

**Summing up:**

- GUIDs are guaranteed to be unique but not guaranteed to be random.
**Do not use them as random numbers.** - GUIDs that are random numbers are
**not cryptographic strength**random numbers. - GUIDs are only unique when everyone cooperates; if someone wants to re-use a previously-generated GUID and thereby artificially create a collision, you cannot stop them.
**GUIDs are not a security mechanism.** - GUIDs have an internal structure; at least
**six of the bits are reserved**and have special meanings. - GUIDs are
**allowed to be generated sequentially**, and in practice often are. - GUIDs are
**only unique when taken as a whole**. - GUIDs can be generated using
**a variety of algorithms**. - GUIDs that are generated randomly are
**statistically highly unlikely to collide**in the foreseeable future. - GUIDs
**could reveal information**about the time and place they were created, either directly in the case of version one GUIDs, or via cryptanalysis in the case of version four GUIDs. - GUIDs might be generated by some
**entirely different algorithm**in the future.

(*) As the comments point out, this is an approximation that only holds if the probability is small and n is relatively small compared to the total number of possible outcomes.

It should also be taken into account that really collisions are a problem only if they happen in the same domain.

For example, if you use a GUID as an ID in a database, a conflict with a COM component identifier has no effect whatsoever. This reduces the space for the accidental conflict to a point where a collision is practically impossible.

I said *accidental* because maliciousness/stupidity/lazyness/ignorance/naivety/whateverelse may raise this probability to 100%. For example there are COM components shipped (!) with the GUID which was contained in the sources for some sample component.

At least in Java random GUIDs are generated using the default crypto-rng. I wouldn't be surprised if that was the case in Windows API (and .Net which just uses the WinAPI) as well.

Since you usually make pop- (and sub-pop-) culture references that I expect, I was surprised that you talked about the probability of 122 coin tosses producing non-unique results without mentioning the possibility of being "within un-, sub- or supernatural forces" (see en.wikipedia.org/.../Rosencrantz_and_Guildenstern_Are_Dead).

That "rule of thumb" is called Birthday paradox: en.wikipedia.org/.../Birthday_paradox

My colleague wrote a C# implementation of the algorithm for generating version 3 or 5 GUIDs, linked at the bottom of this post: code.logos.com/.../generating_a_deterministic_guid.html

Yes, if you flip a coin just 20 times, and write down the sequence of heads and tails that you got, you can then calculate the odds that you would have gotten that sequence.

You'll conclude that a miracle just occurred.

Indeed, a miracle occurs (something that is extremely unlikely) every time you generate a GUID. Or do any daily activity, if you take into account your specific muscle movements....

@Eugene Good that you mentioned it, because in the original article that Eric linked there's only halve the article about exactly that phenomenon ðŸ˜‰

Personally I've never seen anything but Type1 and 4 GUIDs, which means I'd actually be quite interested where those are used in practice (the URL example makes sense and I think as soon as someone invents a timemachine I'll have to go back in time and rewrite some code.. )

I have an understanding of entropy as it applies in thermodynamics (okay, a weak understanding). But I do not see how it cross-applies to pseudo-random number generation. Is it a direct correlation, or is it a convenient term that works well to get a handle on the concept in numerics? Eric, if you don't mind, can you explain or link to a good explanation?

Thanks.

Entropy is by definition the amount of randomness or disorderliness in a system. When building a deterministic device that generates random or pseudo-random numbers, there have got to be some random, unpredictable inputs to the system for the outputs to be unpredictable. Weak random number generators use bad sources of entropy, like the current time. Crypto strength random number generators use good sources of entropy, like keystroke timings. Exceptionally good random number generators use exceptionally entropic inputs, like the outputs of geiger counters or cosmic background radio signal receivers. Getting good entropy is expensive.

To be a bit more rigorous, I'm using "entropy" here like this: suppose there are 256 possible inputs to our pseudo-random number generator, and the seed is chosen by some random process that really does choose each one of the 256 possibilities with equal likelihood. Such a system is said to have "eight bits of entropy". If some of those 256 possibilities are more likely than others then we have less than eight bits of entropy. You might make a buffer that contains one thousand bits worth of timing data about keystroes. But keystroke timings are not entirely random; there are patterns. Therefore there could be less than a thousand bits of entropy in those thousand bits of seed, depending on how the bits were actually captured. -- Eric

Entropy is simply used as a term for the amount of randomness.

@Aaron Eshbach: "used as a term for the amount of randomness." I disagree, at least with respect to how Eric is using it. What you describe is the non-rigourous definition for entropy in thermodynamics. The way Eric has used it, it has a source (what source of entropy is used), and it has a number of bits (how many bits of entropy are used). I bet he is talking about Information Theory Entropy (en.wikipedia.org/.../Entropy_(information_theory). So it looks like I have answered my own question. (Sorry, this is not my field, but I still want to learn a little about it.)

@Paul As someone with only the most basic understanding of entropy in physics, I must say I still do think that there are quite some parallels. There's actually an - well I'd say interesting but that'd be a lie - article on this topic on wiki: en.wikipedia.org/.../Entropy_in_thermodynamics_and_information_theory

But I think it's much simpler to think of entropy in our case as the amount of information some message has instead of drawing parallels to physics. A perfectly random generated 128bit number has 128bits of entropy since there are 128bits of useful information to it. If there was some pattern in the key (say 1s and 0s always have to alternate) we get less entropy (in the alternating example, the entropy is suddenly just 1bit..)

The other thing you need to include is this assumes that the GUIDs are all being used for exactly the same purpose.

To once again be a pedant, I'll note that the number of bits used for version information could be 5 or 7 in rare cases.

@voo - .NET uses a type 3 GUID fÃ¼r types which don't have an explicit GUID assigned. Type 3 and 5 GUIDs are really useful if you need a unique identifier regenerated consistently.

If I remember correctly they use a hash of

- a seed GUID

- the strong name key (if the assembly has one)

- the fully qualified type name

I think the source for this was in the rotor shared source implementation, but I also accidentally stumbled upon it during a native debug session involving some COM interop.

The seed GUID makes it unlikely to conflict with anyone else, I think it's a quite clever interpretation of Type 3 GUIDs (which normally use domain names to avoid conflicts).

To produce real good Guids, PC's should be equiped with a proper Lava lamp.

See en.wikipedia.org/.../Lavarand

Apologies if this is duplicated; I'm not sure whether my previous comment has been lost due to non-posting or moderation.

Your assertion

"Now, what is the probability that over n generations of GUIDs, you produce a collision with your particular GUID? Those are independent events, so the probabilities add; the probability of a collision is n in 2122."

is incorrect.

Probabilities of independent events do not add. The probability of getting a head on either of two independent coin tosses is not (1/2 + 1/2 = 1).

You are correct; I should have called out that I was making an approximation that holds for rare events. I'll update the text. -- Eric

I'm curious what is probability of a GUID collision if you run two and more separate threads that generates guid sequences? Is algorithms use thread Id or anything like that?

And is there any way to choose what algorithm will be used when I call Guid.NewGuid() ?

Yes, I noticed the same as Dan T, but unfortunately the spam filter ate my calculations. ðŸ™‚ The point is that independent probabilities can be multiplied and in this case the easier way is to multiply the possibilities of not having a collision. So it involve a lot of substraction from 1, but i'm afraid to write it down, because of the spam filter. ðŸ™‚

@Dan T

You're technically right. But for realistic sizes of n, it is a *very* good approximation. The error only becomes significant if you create more that 2^100 GUIDs, and that just doesn't happen.

It should be: 1 - n*(2^122 - 1)/2^122

@CodesInChaos

The conclusion of the article is of course not ruined by this mistake, but i think the addition of independent probabilities is misleading to others who will read this post later.

@Brett: Nope, that'll be negative for n > 1. It should be 1 - ((2^122 - 1)/2^122)^n

@CodesInChaos: It _is_ a very good approximation, and I thank you for prompting me to think for a minute to work out why. Still, as Zsolt says, just saying "independent probabilities add" without qualification is not good.

A PRNG with insufficient internal state is a common problem. Consider that wherever you start with a PRNG will determine the entire sequence--the seed selects a starting point in the PRNG sequence which is limited in length by its internal state. If the PNRG has only 32 bits of state, then there are only about 4 billion GUIDs that it would ever generate.

Consider all those casual card games like Solitaire and Freecell. They typically can only shuffle the deck into one of only 2^32 arrangements. A true random shuffle yields something like 8 x 10^67 possible arrangements.

If one needs GUIDs that are securely random, one could implement an alternative to `NewGuid()` based on `RNGCryptoServiceProvider`. That's not very hard, and `RNGCryptoServiceProvider` produces good secure numbers, and you're guaranteed to get a V4 GUID. If high performance is required, you'll need to buffer the output of `RNGCryptoServiceProvider`, since it has a huge per-call overhead, whereas per-byte performance is actually pretty decent.

@Adrian

PRNGs with such a small internal state typically have horrible biases anyways. Luckily *that* mistake is falling a bit out of favor nowadays.

Unfortunately low quality *seeds* are still very common. .net's `System.Random` is one example of that. It has 2 billion different seeds at best. But in practice the situation is even worse than that, since `GetTickCount()` is biased towards small numbers on computers that don't run for weeks, and also only increments every few milliseconds. MS considers this flaw "by-design".