GUID generation algorithm 4 fills the GUID with 122 random bits. The odds of two GUIDs colliding are therefore one in 2¹²², which is a phenomenally small number. When you are dealing with rates this low, you have to adjust your frame of reference.

Some people don't like algorithm 4 because the algorithm is not designed to ensure uniqueness. There is a chance, admittedly small, that a collision will occur. On the other hand, algorithm 1, for example, uses time-space coordinates to ensure zero chance of duplication, unless someone screws up the generation process.

Hang on a second. "Unless someone screws up the generation process." Ah, so it's not zero after all, because there is of course a tiny chance that something will get screwed up in the generation process.

According to this research paper (pdf), memory chips suffer 25,000 to 70,000 errors per billion hours per megabit. Let's do some back-of-the-envelope calculations.

25,000 to 70,000 errors
10⁹ hours × 2²⁰ bits |
≈ |
2¹⁵ errors
2³⁰ hours × 2²⁰ bits |
= |
1 error
2³⁵ hours × bits |

Now let's see how this error rate converts to "corrupted GUIDs".

1 error
2³⁵ hours × bits |
× |
2⁷ bits
1 GUID |
× |
1 hour
60 × 60 × 10⁹ nanoseconds |

Now, 60 × 60 = 3600 is approximately 4096 = 2¹², and 10⁹ is approximately 2³⁰.

≈ |
2⁷ errors
2³⁵ × 2¹² × 2³⁰ GUID × nanosecond |
= |
1 errors
2⁷⁰ GUID × nanosecond |

The odds that a GUID will be corrupted in memory in the one nanosecond between the time it is created and the time the value is copied somewhere safe is therefore around one in 2⁷⁰.

This is 2⁵² ≈ 10¹⁵ = a quadrillion* times more likely than a collision between two GUIDs generated by algorithm 4.

And that's just mechanical failures in the RAM chips. There are other ways that the generation process can get screwed up. For example, there could be a failure in the CPU itself, maybe due to overheating or overclocking. Or a bit error on the hard drive. Or a programming error.

When you start talking about numbers as small as 2⁻¹²², you have to start looking more closely at all the things that you thought were zero.

* Or "a thousand billion" if you use the European scale.

**Bonus chatter**:
Note that I'm not saying that a collision caused by a random bit flip
is more likely than a collision caused by a v4 GUID happening to match
some other v4 GUID.
I'm saying that it's wrong to say that the v1 GUID generator will never
generate a collision,
because there's a nonzero probability that the v1 GUID generator
will deviate from the algorithm.
Sure, that probably may be minuscule,
but when you're dealing with numbers like
2⁻¹²²,
these minuscule probabilities become significant.

Yes, I get odd stares when I tell people that the odds of winning the Powerball when you buy two tickets are the same as the odds you have if you buy one ticket (twice infinitesimal is infinitesimal). Of course, the odds of winning if you buy a ticket are infinitely greater than if you don’t buy a ticket (infinitesimal is infinitely greater than zero). Weirdness lives at the very edges of the number line.

Your odds of winning the Powerball if you don’t buy a ticket are still nonzero: someone could give you a winning ticket, or you could find one on the ground.

Well, if you say they’re exactly the same, as opposed to them not being significantly different, then the odd stares are deserved. While the odds are still infinitesimally small, they are in fact different. There isn’t a huge difference between 1:292,201,338 (3.42e-9) and 2:292,201,338 (1.71e-9), but there is a difference.

In fact, in the base-2 reckoning we software types so often deal with, the odds of winning having bought two (non-identical) tickets are an order of magnitude greater than the odds of winning having bought one.

Yes, different. But essentially the same as far as the probably impact on your life. As always, Randall says it best: https://xkcd.com/1252/

A book I remember reading once noted that statistically, all species are extinct. :)

I have seen this meme repeated all over lately and I have no idea why people think it’s true.

You at least are playing word games with “infinitesimal” (which is a mathematically invalid concept, viz. the history of the calculus) but I’ve seen others claim that the odds are literally unchanged.

Unless you are buying up a substantial portion of the total stock of potential winning numbers, the relationship between number of tickets bought and chances of winning is almost exactly linear.

What are you talking about.

The relationship between number of tickets bought and chances of winning is *always* exactly linear, regardless of how many you buy (of course assume you don’t repeat numbers).

Unless it is not a random draw.

The real numbers don’t have infinitesimals, (or, I think some definition counts 0 as one but I don’t remember why) but the surreal numbers do, as well as a number of other systems. (e.g. the surreals with birthday less than omega^omega I think? I think that forms a field? I think that might be the smallest ordered field which has the reals and infinitesimals, but I’m not sure.)

Tickets can share the same numbers in Powerball (if those numbers win, the winnings are split proportionally by number of winning tickets between the ticket holders). So as long as every ticket you buy is unique, the relationship is exactly linear, I think, no matter how many you buy.

Where did that extra factor of 2^21 come from in the denominator of the last equation?

Math is hard.

Let’s go shopping?

Wouldn’t algorithm 4 have the same chance of corruption, though?

Then again, I suppose collision isn’t the same thing as corruption.

Yes, but the odds of corruption in v4 causing collision are very small, in v1 each generated guid differes by only a few bits, a bit flip and you got a collision, in v4 a bit flip gives you another random number that probably never were generated before.

But corruption may screw up you in other ways.

“The odds of two GUIDs colliding are therefore one in 2¹²²” This is not so. This is the chance of guidA colliding with guidB. The odds of there being any collision is the birthday ‘paradox’. A quick look shows for 128 bits that if you generate 10^77 guids there is a 50% chance of a collision. It will be a lot less for 122 bits (96 bits -> 10^57 are required for 50%)

What do you mean by “This is not so.”?

Your argument is not what the article mentions. You don’t have 10^77 GUIDs. You have TWO. And the odds that they collide are 1 in 2^122.

“The odds of two GUIDs colliding are one in 2^122” is as true of a statement as your “the odds of at least two GUIDs among 10^77 colliding are more than 1/2”.

Not only that. The calculations made by Raymond suppose a perfect random generator, correctly seeded and with perfectly uniform distribution. Sadly, often this isn’t the case, because computers use pseudo-random generators. The degree of predictability of a pseudo-random generator lowers significantly the entropy of the random number, and makes collisions more frequent. That’s without taking into account that most programmers will use the language’s standard random generator, which often uses a bit length of 23 or 64 (making a best-case collision rate of 2^-32 or 2^-64). Oh, and the memory/storage media corruption problem applies to random GUIDs, too.

IMHO, type 1 GUIDs are a better solution because they are simpler to generate, and thus it’s a lot harder to botch the generation. Type 4 GUIDs, on the other hand, leaves you depending on the used random generator (which may or may not be up to the job) or even invites you to write your own generator (which is surprisingly easy to get wrong).

Yes, the the point of GUIDs is that they are meant to be globally unique. What you really want to know is the chance of a collision in the pool of all generated GUIDs.

I don’t know where you got the 10^77 figure from, that’s too high. Given a 122 bit hash/random number, there is a 50% chance of a collision after generating 2^61 numbers, which is the square root of 2^122. This is the birthday paradox.

Comparing this with the probability of a memory flip if you have enough memory to store 2^61 GUIDs is left as an exercise.

This is not about “The odds that a bit flip can result in a collision in a pool of all generated GUIDs.” This is just saying that people worry about the probability of a v4 collision, but they don’t consider the probability of the v1 algorithm breaking down.

Note: The new layout broke the graphics markup used on this blog (for example here: https://blogs.msdn.microsoft.com/oldnewthing/20030905-02/?p=42653 ). Neither Firefox 43 nor IE11 display it correctly.

No kidding! Someone didn’t do their job right.

The “zero chance of duplication” link doesn’t work.

Redirects from the old site are broken. I have an open ticket it.

Thanks.

Of course, the biggest problem with generating algorithm 4 GUIDs is probably finding 122 sufficiently random bits.

The “overclocking” and ” zero chance of duplication, unless someone screws up the generation process. ” blog links are dead :(

Which has already been mentioned, and answered, right here in these comments!

Very nice back-of-the-envelope calculations.

BER-type memory corruption is an interesting consideration, for two reasons.

First: If memory is corrupted and a GUID is read out with a single bit flipped, then the GUID is still wrong, regardless of the algorithm used to generate it – all sorts of catastrophe can result, right? I’m not sure that the chance that the misread is also a collision with another GUID (given the space density of v1 that you noted) represents a significant *marginal* risk over the simple fact that the GUID has spontaneously changed.

Second: Since v1 GUIDs are based partly on the MAC address, it’s possible to detect a BER-type memory misread that way in some cases. For example, if you know that the GUID was generated by a device with a particular MAC address, wouldn’t it be possible to compare it with the MAC address that’s included in the GUID? Granted, this is a pretty artificial scenario, but of course so is BER in and of itself….

Regardless – I think the upshot is that the importance of protecting GUIDs against memory misreads, for all of the reasons you mentioned, is compelling. Perhaps it should be good practice to checksum all GUIDs (either tacked onto the v(x)-generated GUID, or incorporated into a GUID algorithm).

It’s interesting that Powerball drawings are actually done using physical Ping-Pong-ball-like numbered balls, which are weighed frequently and carefully handled. The numbers picked will not be subject to the vagaries of any pseudo-random number generator…. although they could be subject to other things. The selected numbers are checked (using a method that is not spelled out) for true randomness (per the Powerball web site’s FAQs).

However, if you go to the store and ask the clerk to ask the lottery system’s computers to pick some random number(s) for you (Quick Picks), of course there is a PRNG involved in selecting your numbers.

I don’t really have a big point, except to say that not only is math hard, but random numbers are hard too.

And anything in gambling is very strictly checked for fairness, for legal and financial reasons. (The numbers people pick for their tickets doesn’t matter; but the method in determining the winning numbers does, very much so.)

>> The numbers people pick for their tickets doesn’t matter

I disagree. It is already well known that the expected value of a lottery ticket varies based on the numbers chosen because people who choose their own numbers do not distribute their choices evenly across the set of possibilities. If the “pseudo-random” choices also had a bias, the game could no longer be “fair”.

So this is wrong. No way around saying that really.

There are two generators you consider:

1. A uniform random generator. It’s collision probability is C:=1/2^(124).

2. Basically a sequence. It’s a sharded sequence (the mac address), and it skips values (the time component), but it is nonetheless a monotonically increasing sequence.

The sequence won’t wrap in your lifetime. So we don’t have to worry about that. The only thing that might concern us are machines that happen to have the same mac. So we can’t trust the user to generate the uuid we have to do it on machines we control, but that is true of either method because the user could always maliciously pick someone else’s published uuid.

Now add in the hardware error with probability epsilon.

The collision probability for method #1 is unchanged. You are just xoring a random error into an already random error.

For #2 there are 3 cases. Errors in both uuids, errors in 1, errors in neither. These events happen with probabilities e^2 2(1-e)e and (1-e)^2 respectively. The collision probabilities are C, C and 0 respectively.

This the collision probability is 2eC -e^23 << C. Basically you are off by a factor of 2^124.

You lose the guarantee but it is still a lot better than the uniform select. Both are insanely improbable, but are susceptible to different issues. For #1 your random seeds could be bad, for #2 you could have bad clocks on your machines.

If you are really worried do both and xor them together.

This is not about “The odds that a bit flip can result in a collision in a pool of all generated GUIDs.” This is just saying that people worry about the probability of a v4 collision, but they don’t consider the probability of the v1 algorithm breaking down. I chose a random bit flip as one of the things which can break the v1 algorithm. Another is of course a systematic MAC collision.

So?

A random bit flip on a randomly selected uuid does nothing to the collision probability because it is just xoring a random into a random. The bits that flip against you are equally likely to flip in your favor.

A random bit flip on the sequence uuid results in a uuid which is a convex combination of a random and sequential uuid. I don’t see how that has a lower collision probability than the first. So what if I lose my guarantee, I’m still less likely to collide because I’m guaranteed to start from a known different value.

If your concern is collisions you would still prefer the sequence uuid if you trusted the inputs like the system clock and mac address, even if you didn’t trust the ram.

To make an analogy your argument reads like:

People say that winning the lottery is unlikely because they have to match all 7 numbers or of the tumbler, but it is actually harder to win than that.

There is some miniscule chance that the numbers on the balls could be misread, and although you have the winning numbers the commission awards the prize to someone.

It seems like you have forgotten, that this can also work in your favor. You could just add likely be the designated winner when you in actual reality are not the proper winner.

Random bit errors can never increase the probability that a uuid will collide with a random uuid. And you have given no mechanism whereby a value known to be distinct can somehow invert the collision probabilities with one’s that have a known probability of colliding.

“Random bit errors can never increase the probability that a uuid will collide with a random uuid”

That is only true when both people generating have an equal probability of creating any uuid.

If I’m only picking balls 1-20 and you are only picking balls 21-49 then we won’t ever share a lottery win.

If you pick low numbered balls and I pick high number balls, then an error (or multiple errors), could turn my high numbers into your low numbers. So I still get collisions.

Not really sure what you are trying to get at there. Maybe the mac sharding?

I have one machine generating all my uuids and so they are close and share the first six bytes.

6/16 times the random error is on the mac address and I end up with a uuid I could never generate. 10/16 I get another uuid I might generate at some other time, but now I have to wait maybe a thousand or more years to see if I actually do.

> * Or “a thousand billion” if you use the European scale.

On the continent, we say “a billiard”, not “a thousand billion”.

As for your footnote: 10^(15) would be a Billiard (Billiarde) in German.

UUID v1 was good back when network cards were real, and number of manufacturers was low, number of devices were low.

Today, the number of duplicate MAC addresses is very non-zero, and so even if the hardware is working perfectly — UUID v1 is likely to make a duplicate between machines.