A customer liaison asked, "My customer is looking for information on the GUID generation algorithm. They need to select *N* items randomly from a pool of *M* (jury selection), and their proposed algorithm is to assign each item a GUID, then sort the items by GUID and take the first *N*." (I've seen similar questions regarding using GUIDs for things like passwords or other situations where the programmer is looking for a way to generate a value that cannot be predicted.)

The GUID generation algorithm was designed for uniqueness. It was not designed for randomness or for unpredictability. Indeed, if you look at an earlier discussion, you can see that so-called Algorithm 1 is *non-random* and *totally predictable*. If you use an Algorithm 1 GUID generator to assign GUIDs to candidates, you'll find that the GUIDs are assigned in numerically ascending order (because the timestamp increases). The customer's proposed algorithm would most likely end up choosing for jury duty the first *N* people entered into the system after a 32-bit timer rollover. Definitely not random.

Similarly, the person who wanted to use a GUID for password generation would find that the passwords are *totally predictable* if you know what time the GUID was generated and which computer generated the GUID (which you can get by looking at the final six bytes from some other password-GUID). Totally-predictable passwords are probably not a good idea.

Even the Version 4 GUID algorithm (which basically says "set the version to 4 and fill everything else with random or pseudo-random numbers") is not guaranteed to be unpredictable, because the algorithm does not specify the quality of the random number generator. The Wikipedia article for GUID contains primary research which suggests that future and previous GUIDs can be predicted based on knowledge of the random number generator state, since the generator is not cryptographically strong.

If you want a random number generator, then *use a random number generator*.

**Bonus reading**: Eric Lippert's GUID Guide, part 1, part 2, and part 3.

Using GUIDs for randomness is like using a chainsaw to hammer in a nail.

@RP: Or a glass bottle or an old shoe (weblogs.asp.net/.../408925.aspx).

Run the generated GUID through MD5 once.

Man!. Who uses a guid for anything but a guid!

If one wants unpredictable values, using a type-4 GUID that is generated by a cryptographically random method should be fine, though, right? So a customer asking for the generation algorithm for a particular GUID generator function seems like a perfectly reasonable question. An answer of "pseudo-random enough so that they're unique, not random" is a good answer, meaning that they'd then look for a different generating function that had a "cryptographically random" answer.

Anyone know if this invalidates the commonly given advice that using "ORDER BY NEWID()" in T-SQL for returning the query in a random order? I'm sure its close enough for government work, but any thoughts?

stackoverflow.com/.../random-record-from-a-database-table-t-sql for example

The example in the documentation gives a result that is of type D. So it's not guaranteed to be type 4. -Raymond]@pc: I'd argue that anyone wise enough to even ask the question "Is the type-4 GUID generation algorithm cryptographically random?" would also be wise enough to avoid GUIDs in the first place and go straight to a cryptographic RNG.

I have code that uses Java's java.util.UUID.randomUUID() to make an identifier that I would expect would not be predictable by others. I think this is fine because the documentation says that it's cryptographically strong, and should have enough bits of entropy for our purposes. I suppose it could use a different cryptographically random generator, but it was quick, easy, and should be fine. Though a comment there saying that it's only okay since it's a cryptographically secure generator might be a wise note to somebody looking at it in the future.

If your GUID generator is documented as being cryptographically strong, then that's great! But GUIDs in general are not. -Raymond]@alegr1: Predictable GUID in = predictable MD5 hash out. It's still not random, I'm not sure what you're trying to achieve.

E Featherstone: If I can look at your GUID and figure out your MAC address and timestamp, it's trivial to guess what other GUIDs you might have generated. If I can only see an MD5 hash of your GUID, I can't make any assumption about what your MAC address is or when it was generated, so I can't predict other GUIDs in the series.

That said, I recently generated a GUID for use as a password. Since it was type 4 and used in isolation, there's no chance of predicting another password based on it.

RP and Adam Rosenfield - I liken it more to using a sledge hammer to drive finishing nails. If you do everything just right, it could work. It's easy to mess it up and create some big holes. There's a much better tool for that. Why not use it instead?

@pc:

"If one wants unpredictable values, using a type-4 GUID that is generated by a cryptographically random method should be fine, though, right?"

If you have access to a cryptographically strong RNG then why use it to create a GUID? Just generate a random number from it and use that. (I'm assuming of course that software is sensible and access to the RNG isn't restricted in some way.)

I found this code once in a project that shall remain nameless:

DWORD SHRandom() {

GUID g = CoCreateGuid();

string s = GuidToString(g);

return HashData(s);

}

(Pseudocode because I'm pre-coffee.)

@Maurits: That's just a variation of the Fisher–Yates Shuffle. It's equivalent to starting with an array of N 1's and (M-N) 0's and shuffling that, just with slightly different bookkeeping.

@PC Seems like quite the unnecessary complicated way to use the perfectly simple SecureRandom class in Java. Although I assume if you needed a string representation, it saved you converting it to base64 and since Java's support for that is a bit weak (well Java5 does have it built in, but not in an especially obvious place), I can actually understand that.

But then Fisher-Yates is a trivial algorithm that really anybody with a CS degree should've heard of, so the whole "generate random numbers and sort it" approach is quite worrying. Heck most sensible languages have a shuffle implementation in their standard library.

You all are so silly. :) Just use the cryptographically secure random number generator and move on with life!

The assign-a-random-number-to-each-element-then-sort trick is clever, but it has two potential weaknesses (even assuming that the random number generator has sufficient strength for the application.)

1) There's a non-zero probability of collisions.

2) It's kinda slow (O(M log M) for the sort.)

Here's another algorithm which is O(M)

// left as an exercise

// returns a random number uniformly distributed between 0 and 1

float rand01();

// chosen is an array of M bools, all initially set to false

// on exit, precisely N of the bools are set to true

// all (M choose N) arrangements are equally likely

// it is assumed that 0 <= N <= M

void chooseNofM(bool *chosen, int M, int N) {

for (; N; chosen++, M--) {

// choose the element we're looking at

// with probability N/M

if (M * rand01() <= N) {

*chosen = true;

N--;

}

}

}

If M is large and N is small, there are other algorithms that are better.

@Adam: That's an awesome article.

So do GUIDs generated by GUIDGen.exe happen to be secure?

To be fair:

Fisher-Yates may look trivial, but it's quite easy to get wrong in subtle ways. You can end up with a shuffler where any given element is equally likely to end up in any finishing position, but yet where some ending combinations are more likely than others.

Worse, it can be quite difficult to tell whether the implementation is correct just by looking at the code, since correct code looks very similar to subtly-wrong code.

So the "sort by a random tag" algorithm has the benefit of being *obviously* correct.

@Maurits Huh? I mean Fisher-Yates is a 3liner, with the only important thing to not swap values that have already been set. Yes you can get it wrong by allowing to swap every number, but then the only interesting thing of the algorithm is the statistical analysis why you may not do that (which is not completely intuitive), so I'd hope that'd stick in mind. But yes, *testing* whether the algorithm is correct is always a problem for these things.

And as we can see you can get the "add random tag to data" algorithm wrong as well - although I agree you have to work a bit harder there ;)

Fisher-Yates assumes a continuously uniform random number generator, which does not exist in reality. If M/N >]`RAND_MAX`

, then there are some people who will never be selected, and even if you are well under`RAND_MAX`

, the selection will not be perfectly fair. (For example, if`RAND_MAX = 32767`

and there are 16383 people, then one unlucky guy will be called to jury duty twice as often as his peers.) -RaymondThe point of Fisher-Yates is that it attempts to uniformly choose one of M!/(M-N)! permutations. A common error is to generate M^N permutations and then map them to one of the M!/(M-N)! permutations. Of course, if your random number generator isn't up to uniformly choosing one of M!/(M-N)! permutations then you're going to fail anyway, but at least with Fisher-Yates you're not adding your own bias to that of your random number generator.

TC is right. Even with cryptographic PRNGs predicting the output from the internal state is possible. But you can't efficiently find out that internal state by just observing outputs. To avoid this predictability you need to reseed regularly from real entropy.

My main doubt about the windows GUID creator is that it's unclear how private it is. AFAIK MS switched to V4 GUIDs to avoid the MAC-Address and creation time leak from V1 GUIDs. But depending on what is used to seed the PRNG, there might be privacy issues even with random looking V4 GUIDs. I'm not aware of any documentation clearly stating what privacy related guarantees the generator has.

So my recommendation is to create your own V4 GUIDs based on

`CryptGenRandom`

/`RNGCryptoServiceProvider`

.> "The Wikipedia article for GUID contains primary research which suggests that future and previous GUIDs can be predicted based on knowledge of the random number generator state, since the generator is not cryptographically strong."

Is that what it actually says?

It says: "Cryptanalysis of the WinAPI GUID generator shows that, since the sequence of V4 GUIDs is pseudo-random, given full knowledge of the internal state, it is possible to predict previous and subsequent values."

Surely if you know a PRNG's internal state, you can predict all of its the following outputs - even if it is cryptographically secure (ie. a CSPRNG)? Whereas the difference between a normal PRNG, and a cryptographically secure PRNG (CSPRNG), as I understand it, is whether you can or can't predict the next output(s) from some number of *previous outputs* - not from a "full knowledge of the generator's internal state".

I like the corollary - random numbers aren't GUIDs. I've seen developers generate really long random numbers and then call those "GUIDs". When told that that is not a GUID, the response is "it's unlikely for there ever to be a collision, so it's good enough".

What I don't get is is WHY a GUID has to be "secure" and "such that it's impossible to predict the next one.

It may be because I have only ever used them for COM, but to me these "security" issues sound like someone is using a GUID as something else than an ID.

@xpclient: Only if you constantly reassure them that they're fine as children.

@Raymond True, so you need at least Log2(N!/(M-N)!) bits of internal state in your generator, because otherwise one can obviously not generate all possible permutations. That's quite a large number for even small sets - hush, that's why I don't implement statistical libraries ;)

Not sure about your example though: Yes, just taking the modulo will result in biased values, but we don't have to use

`nextRand() % maxSize`

to produce a number (although even the correct solution that doesn't introduce bias, still reduces the internal state a bit, so it's back to point 1)Still the algorithm is simple to implement correctly - it may just not do what one would hope it to.. those "bugs" are much worse to find in practice though.

@RandomsArentGuids: Actually, a 128-bit random number from a high-quality RNG is perfectly suitable as a unique ID (but don't call it a GUID, cause it's not).

This reminds me of a "classic" post by Jeff Atwood:

http://www.codinghorror.com/.../shuffling.html

Specifically the bottom bit:

Unfortunately for Jeff's readers, he doesn't follow Raymond's good example of reminding everyone that what he posts is not designed to be copy-pasta'd into production code.

@Simon Buchan: A truly random number generator has to have some chance of generating the same number consecutively (because each number is generated independently, you cannot deny the chance of (1/n)^2 the same number be produce on consecutive trials, where n is the number of possible outputs. Granted the chance is small, it's not impossible.). So by defination, a good RNG is not a suitable source of unique number unless you put additional measure to prevent that.

I thought this problem sounded familiar - it's the subject of Column 12 and Column 13 in Jon Bentley's classic "Programming Pearls" (an excellent book.)

The "better algorithm" (better than Fisher-Yates, in particular) he gives for the case when N is much less than M is as follows:

set<int> S;

while (S.size() < N) s.insert(randomly selected item);

// where s.insert is given to be a no-op if the element is already in the set.

Note that a conspiratorial random number generator can make this into an infinite loop, but under the assumption that N is much less than M this will almost surely run in O(M log M) time and use O(M) space.

er, I mean O(N log N) time and O(N) space, of course (Bentley reverses N and M and I confused myself)

@cheong00: OK, so now go compute how many GUIDs it would take to make collision feasible, and how many it's feasible to generate before programming is obsolete :). There's a reason type 4 GUIDs are just RNG output other than the type tag.

@Simon Buchan: Since all generations are random, you can just get two random numbers once and found they are the same. The "number of generation to make it feasible" is irrelevent (or I should say, incorrect fundamentally), it is the "number of generation to make sure it's impossible to NOT collide". Statistical randomness need to ensure no recognisable pattern in number it generated, and we know uniqueness is a known pattern. So if you run 65536 trials to such a RNG, you can predict to see some number have zero counts, most has one and a few has two or even three occurance.

Oh, I've forgotten to add the condition of "16-bit strength" to my last comment.

@Nick: This looks like the sort key of a given item would vary over time, which is a big no-no (IIRC, the CRT throws or asserts when it encounters that). Was that the point?

@cheong00: "Since all generations are random, you can just get two random numbers once and found they are the same."

Yes, with a 1 in $BIGNUM (excuse me, that's $BIGBIGBIGNUM) probability! Who cares?

The original problem is about shuffling a jury pool, so there is no need for cryptography, passwords or any such thing. If using C/C++, rand() would work just fine. Just walk the list, exchanging each entry with a random entry. No need to sort or actually assign a number.

It may not need to be cryptographically secure, but it definitely needs to be unbiased. The jury pool probably exceeds the resolution of]`rand()`

, so you're going to have systematic bias in your selection. -RaymondThis gives me an idea for an interview question: using a rand() function with RAND_MAX = 32767, write a function which returns 1, 2, or 3 with equal probability.

@Maurits: Not to toot my own horn, but see stackoverflow.com/.../expand-a-random-range-from-1-5-to-1-7 for some interesting solutions to that (several correct, many wrong) with different numbers.

@Raymond, Can I suggest that you fix the error that I pointed out in your summary of the Wikipedia article?

> "The Wikipedia article for GUID contains primary research which suggests that future and previous GUIDs can be predicted based on knowledge of the random number generator state, since the generator is not cryptographically strong."

Wrong. The predictability does not occur because the generator is not cryptographically strong. Nor does the referenced article say that. It occurs because if you can obtain the full internal state of *any* PRNG - cryptographically strong or otherwise - you can certainly predict all following outputs, and possible the previous ones.

So an accurate summary would be more like this: "The Wikipedia article for GUID contains primary research which suggests that the sequence of V4 GUIDs is pseudo random, ie. based on a deterministic PRNG. Therefore, given full knowledge of the internal state, it is possible to predict all previous and subsequent values."

Interesting indeed.

I notice that:

5^1 mod 7 = 5 but is less than 7 so we can't use that

5^2 mod 7 = 4 so we have 4 "repeat" results

5^3 mod 7 = 6 so we have 6 "repeat" results

5^4 mod 7 = 2 so we have 2 "repeat" results

5^5 mod 7 = 3 so we have 3 "repeat" results

5^6 mod 7 = 1 so we have only one "repeat" result

5^7 mod 7 = 5 and it repeats from there

@cheong00

You can't distinguish "impossible" from "extremely unlikely" in this universe. That distinction only exists in mathematics.

You can't construct a computer that always gives the same result. And you can't construct a computer which produces two random numbers of sufficient length that happen to collide. Random errors happen, you can just make them so unlikely it doesn't matter.

There are some situations where random 128 bit numbers have a non negligible chance of happening. It's unique enough for many applications, but there are many scenarios where it's not enough.

256 bit collisions are already almost impossible for several decades even using all the computational power of earth. That's the level you practically use if a powerful adversary can create many ids, and wins if any pair of them collides.

And finally barring major advances of technology(say the invention of reversible computing, or humanity becoming a type III civilization), 512 bit random numbers are unique enough forever.

So I disagree with you. Random numbers make fine unique numbers, as long as you make sure they're long enough for your use-case.

If you need a unique identifier as primary key for something, better do not use pure random numbers. Otherwise, the random collisions can cause absolute "impossible" behavior.

Do you propose to use your principle for software used to control vital functions of a nuclear power plant? Really?

In that case there is no computer that isn't broken. You can make the error probability very small, but not absolutely zero. Cosmic ray hits your RAM, flips a bit,...

Random numbers are perfectly fine for unique ids, if you're sure that the RNG is good enough. In practice PRNGs are often seeded badly, so I'd be very careful. But if I were sure that the RNG is good enough, and the random number large enough, I wouldn't hesitate to bet critical functions on it.

The chances of a bug or random hardware failure are so much larger than the chance that two independently generated 512 bit numbers are the same, that I certainly wouldn't worry about that.

Also what alternative schemes for ID generation do you propose that are more robust? Most other schemes require some persistent state keeping track of which IDs to produce. It's very easy to make a mistake when managing this, for example when cloning a VM or restoring from backup. I'm sure that the chances of such failures are much larger than random number collisions.

Medinoc: The sort key for an item is computed only once, so even a random key won't change during the sort operation.