Factoring large numbers with quadratic sieve

Today I'm going to talk about how the quadratic sieve factoring algorithm works, giving a comprehensive description assuming knowledge of only basic university-level mathematics.

The foundation of the most popular public-key cryptography algorithm in use today, RSA, rests on the difficulty of factoring large integers. When keys are generated, efficient algorithms are used to generate two very large prime numbers and multiply them together. The person who generated the keys knows these two numbers, but everyone else only knows the product. The product contains enough information to encrypt a message to the person; the two primes allow the recipient to decrypt it. There is no known way to decrypt it without using the primes, but by factoring, we can extract the two prime factors from the product and break the encryption.

At the time that RSA was invented in 1977, factoring integers with as few as 80 decimal digits was intractable; all known algorithms were either too slow or required the number to have a special form. This made even small, 256-bit keys relatively secure. The first major breakthrough was quadratic sieve, a relatively simple factoring algorithm invented by Carl Pomerance in 1981, which can factor numbers up to 100 digits and more. It's still the best known method for numbers under 110 digits or so; for larger numbers, the general number field sieve (GNFS) is now used. However, the general number field sieve is extremely complicated, and requires extensive explanation and background for even the most basic implementation. However, GNFS is based on the same fundamental ideas as quadratic sieve, so if factoring the largest numbers in the world is your goal, this is the place to start.

We'll begin by addressing a few problems that at first glance have nothing to do with factoring, then assemble them into a working algorithm. I won't be motivating them first, but trust me - they're important.

Finding a subset of integers whose product is a square

Suppose I give you a set of integers and I ask you to find a subset of those integers whose product is a square, if one exists. For example, given the set {10, 24, 35, 52, 54, 78}, the product 24×52×78 is 97344 = 3122. The brute-force solution, trying every subset, is too expensive because there are an exponential number of subsets.

We'll take a different approach based on prime factorizations and linear algebra. First, we factor each of the input numbers into prime factors; for now we will assume that these numbers are easy to factor. For the above example set, we get:

10 = 2 × 5
24 = 23 × 3
35 = 5 × 7
52 = 22 × 13
54 = 2 × 33
78 = 2 × 3 × 13

When you multiply two numbers written as prime factorizations, you simply add the exponents of the primes used. For example, the exponent of 2 in 24×52×78 is 6, because it's 3 in 24, 2 in 52, and 1 in 78. A number is a square if and only if all the exponents in its prime factorization are even. Suppose we write the above factorizations as vectors, where the kth entry corresponds to the exponent of the kth prime number. We get:

[1 0 1 0 0 0]
[3 1 0 0 0 0]
[0 0 1 1 0 0]
[2 0 0 0 0 1]
[1 3 0 0 0 0]
[1 1 0 0 0 1]

Now, multiplying numbers is as simple as adding vectors. If we add rows 2, 4, and 6, we get [6 2 0 0 0 2], which has all even exponents and so must be a square. In more familiar terms, we want the last bit of each entry in the sum to be zero. But in this case, we don't need to store all of the numbers above, only the last bit of each number. This gives us the following:

[1 0 1 0 0 0]
[1 1 0 0 0 0]
[0 0 1 1 0 0]
[0 0 0 0 0 1]
[1 1 0 0 0 0]
[1 1 0 0 0 1]

Moreover, since we're only interested in last bits, we can perform all our addition using one-bit integers with wraparound semantics (in other words, mod 2). If we add rows 2, 4, and 6 in this way, we get [0 0 0 0 0 0 0], the zero vector. In fact, all squares correspond to the zero vector.

Let's rephrase this as a matrix equation problem. If we transpose the above matrix, so that rows become columns, we get this:

[1 1 0 0 1 1]
[0 1 0 0 1 1]
[1 0 1 0 0 0]
[0 0 1 0 0 0]
[0 0 0 0 0 0]
[0 0 0 1 0 1]

Call this matrix A. If we multiply A by the vector [0 1 0 1 0 1], using one-bit integer arithmetic, we get the zero vector. This tells us precisely which numbers we need to multiply to get a square. So, our goal is to find a nonzero vector x such that Ax=0 (remember, all arithmetic here is with one-bit integers).

If you've had a course in linear algebra, this problem should look very familiar; it's the problem of finding the null space of a matrix, the set of vectors such that Ax=0. The problem can be solved using row reduction (Gaussian elimination). We row reduce the matrix, and then assign values to the free variables in a way that gives us a nonzero solution. The other variables will be determined by these values and the matrix. You probably studied this problem using rational numbers, not one-bit integers, but it turns out row reduction works just as well for these. For example, if we add row 1 to row 3 in the above matrix, we get the following:

[1 1 0 0 1 1]
[0 1 0 0 1 1]
[0 1 1 0 1 1]
[0 0 1 0 0 0]
[0 0 0 0 0 0]
[0 0 0 1 0 1]

Completing the row reduction, we eventually end up with this matrix:

[1 0 0 0 0 0]
[0 1 0 0 1 1]
[0 0 1 0 0 0]
[0 0 0 1 0 1]
[0 0 0 0 0 0]
[0 0 0 0 0 0]

If we turn this back into a system of equations and rearrange, we get this:

x1 = 0
x2 = −x5x6
x3 = 0
x4 = −x6

Suppose we choose x5=0, x6=1. From the above equations, it follows that the first four vectors have the values 0, 1, 0, and 1 (remember, one-bit integer arithmetic). This gives us our final vector, [0 1 0 1 0 1]. If we were to choose x5=1 and x6=0 instead, we'd get a different solution: [0 1 0 0 1 0], corresponding to 24×54 = 1296 = 362.

Moreover, a theorem of linear algebra tells us precisely how many input numbers we need to guarantee that a square can be found: as long as we have more columns than rows, the null space is guaranteed to be nontrivial, so that we have a nonzero solution. In other words, we just need more numbers than prime factors used by those numbers. As this case shows, though, this isn't a necessary condition.

The one remaining problem with this method is that if one of the numbers in our set happens to have very large factors, our matrix will have a large number of rows, which requires a lot of storage and makes row reduction inefficient. To avoid this, we require that the input numbers are B-smooth, meaning that they only have small factors less than some integer B. This also makes them easy to factor.

Fermat's method: factoring using a difference of squares

You might be wondering what squares have to do with factoring. The connection is the very simple factorization method known as Fermat's method. Although not efficient in general, it embodies the same basic idea as quadratic sieve and works great for numbers with factors close to their square root.

The idea is to find two numbers a and b such that a2b2 = n, the number we wish to factor. If we can do this, simple algebra tells us that (a+b)(ab) = n. If we're lucky, this is a nontrivial factorization of n; if we're not so lucky, one of them is 1 and the other is n.

The concept behind Fermat's algorithm is to search for an integer a such that a2n is a square. If we find such an a, it follows that:

a2−(a2-n) = n

Hence we have a difference of squares equal to n. The search is a straightforward linear search: we begin with the ceiling of the square root of n, the smallest possible number such that a2n is positive, and increment a until a2n becomes a square. If this ever happens, we try to factor n as (a−sqrt(a2n))(a+sqrt(a2n)); if the factorization is trivial, we continue incrementing a.

Here's an example of Fermat's method from Wikipedia. Let n=5959; a starts out at 78. The numbers 782−5959 and 792−5959 are not squares, but 802−5959=441=212. Hence (80-21)(80+21)=5959, and this gives the nontrivial factorization 59×101=5959.

The reason Fermat's method is slow is because simply performing a linear search of all possible a hoping that we'll hit one with a2n square is a poor strategy − there just aren't that many squares around to hit. A better way of going about it is to proactively compute an a having this property (actually a similar property).

The key is to notice that if we take a number of a2n values, none of which are squares themselves, and multiply them, we may get a square, say S. Let A be the product of the corresponding values of a. Basic algebra shows that A2S is a multiple of n. Hence, (A−sqrt(S))(A+sqrt(S)) is a factorization of some multiple of n; in other words, at least one of these shares a factor with n. By computing the greatest common divisor (GCD) of each with n using Euclid's algorithm, we can identify this factor. Again, it may be trivial (just n itself); if so we try again with a different square.

All that remains is, given a list of a2n values, to find a subset whose product is a square. But this is precisely an instance of the problem discussed in the last section. Unfortunately, recall that that the method we came up with there is not efficient for numbers with large factors; the matrix becomes too large. What do we do? We simply throw away numbers with large factors! Theoretical results show that there are a fairly large number of values in the sequence a2n that are smooth (recall that smooth numbers have only small factors). This gives us a new factoring method that works pretty well up to a point.

For example, consider the number 90283. If we start a at 301 and increment it up to 360 while computing a2n, we get the following values:

318, 921, 1526, 2133, 2742, 3353, 3966, 4581, 5198, 5817, 6438, 7061, 7686, 8313, 8942, 9573, 10206, 10841, 11478, 12117, 12758, 13401, 14046, 14693, 15342, 15993, 16646, 17301, 17958, 18617, 19278, 19941, 20606, 21273, 21942, 22613, 23286, 23961, 24638, 25317, 25998, 26681, 27366, 28053, 28742, 29433, 30126, 30821, 31518, 32217, 32918, 33621, 34326, 35033, 35742, 36453, 37166, 37881, 38598, 39317

None of these are squares (the first square occurs at a=398); however, if we factor each value we will discover that 7 of these values have no factor larger than 43:

6438, 10206, 16646, 19278, 19941, 30821, 35742

If we take these 7 values and feed them to the algorithm described in the last section, it finds a square: 19278×19941×30821×35742 = 423481541612104836 = 6507545942. The corresponding original a were 331, 332, 348, and 355, and their product is 13576057680. Now, we can factor the number:

(13576057680−650754594)(13576057680+650754594) = 12925303086 × 14226812274 is a multiple of 90283
GCD(90283, 12925303086) = 137
GCD(90283, 14226812274) = 659
137 × 659 = 90283.

Making it faster: sieving for smooth numbers

The factorization algorithm above is considerably better than Fermat's algorithm, but if we try to scale up the size of number we factor, we quickly encounter a bottleneck: finding the smooth numbers in the sequence. Only 7 of the 60 values we computed in our last example were 43-smooth (actually we were lucky to get a square with so few vectors). As the size of the number that we're factoring grows, so does the size of the numbers in the sequence, and the proportion of smooth numbers rapidly shrinks. Although finding smooth numbers doesn't require completely factoring every number in the sequence (we only have to test primes up to the smoothness limit), it's still too expensive to test every number in the sequence this way.

The key is to observe that the prime factors of the sequence a2n follow a predictable sequence. Let's take a look at the prime factorizations of the first ten or so numbers in our example sequence above:

318 = 2×3×53
921 = 3×307
1526 = 2×7×109
2133 = 33×79
2742 = 2×3×457
3353 = 7×479
3966 = 2×3×661
4581 = 32×509
5198 = 2×23×113
5817 = 3×7×277

The most obvious pattern is that every other number is even, beginning with the first one. This should be no surprise, since we're effectively adding 2a+1 to get each new number, which is always odd. Also, you'll notice that the first and second numbers are divisible by 3, as are the fourth and fifth, the seventh and eigth, and so on. If you look at the larger list, you'll notice similar patterns for larger primes; for example, the 3rd and 6th numbers are divisible by 7, and every 7th number after each of them as well. And, mysteriously, not one number in our entire sequence is divisible by 5!

So what's going on? The answer involves what number theorists call quadratic residues. A number a is called a quadratic residue mod p if there is some square S such that Sa is divisible by p. Half of all numbers are quadratic residues mod p, regardless of the value of p, and there's a simple formula for determining whether or not a particular number is: just take a, raise it to the power (p−1)/2, and then take the remainder after division by p. Then a is a quadratic residue mod p if and only if the answer is 1. Although this computation seems to involve very large values, in fact we can compute it quite quickly using exponentiation by squaring combined with frequent remainder operations.

This explains why none of our values are divisible by 5. If we compute 90283(5-1)/2 mod 5, we get 4, which is not 1 (remember that 90283 is our original n to be factored). Thus, there is no square such that Sn is divisible by 5; but all numbers in our sequence have this form. In practice, this means we can compute just once ahead of time which factors may occur in the sequence (primes p such that n is a quadratic residue mod p), and ignore all others.

For our next mystery, why is it that given a number in the sequence divisible by p, every pth number after that is also divisible by p? Well, simple algebra shows that if a2n=kp, then:

(a+p)2n = (a2n)+p(2a+p) = kp+p(2a+p).

But this doesn't explain why it always seems to be the case that there are exactly two different initial values of a such that a2n is divisible by p (with the exception of p=2). For example, in our sequence above the 3rd and 6th values were divisible by 7. The answer again is quadratic residues: it can be shown that the modular equation x2≡y (mod p) has exactly two solutions (if it has any), and in fact there is an efficient algorithm for computing these two solutions called the Shanks-Tonelli algorithm. I won't go into it here since it requires some background in number theory, but for small primes it isn't really needed; it suffices to test the first p numbers to see which are divisible by p. For larger primes, it becomes important to avoid this expensive scan.

Recall the Sieve of Eratosthenes, an algorithm for locating prime numbers. It starts with a list of numbers, then crosses off all numbers not divisible by 2 except 2, then does the same for 3, 5, and so on until it's done. The numbers that remain must be prime. When attempting to find a list of prime numbers, this strategy is much more efficient than running even the most advanced primality test on each number individually.

We take a similar strategy here: we begin with a table of the original values in the sequence. We then visit all the numbers divisible by 2 and divide out a factor of 2. We do the same for each power of 2 up to the size of the sequence. We then do the same for every other prime up to our smoothness bound (43 in our example). In the end, the smooth numbers and only the smooth numbers will have become 1. Since we visit less and less list elements as the prime factor increases, the overall work is much less. For example, here's our original list from the above example:

318, 921, 1526, 2133, 2742, 3353, 3966, 4581, 5198, 5817, 6438, 7061, 7686, 8313, 8942, 9573, 10206, 10841, 11478, 12117, 12758, 13401, 14046, 14693, 15342, 15993, 16646, 17301, 17958, 18617, 19278, 19941, 20606, 21273, 21942, 22613, 23286, 23961, 24638, 25317, 25998, 26681, 27366, 28053, 28742, 29433, 30126, 30821, 31518, 32217, 32918, 33621, 34326, 35033, 35742, 36453, 37166, 37881, 38598, 39317

We visit elements 1, 3, 5, and so on, dividing out 2. Here's the list after this first pass is complete:

159, 921, 763, 2133, 1371, 3353, 1983, 4581, 2599, 5817, 3219, 7061, 3843, 8313, 4471, 9573, 5103, 10841, 5739, 12117, 6379, 13401, 7023, 14693, 7671, 15993, 8323, 17301, 8979, 18617, 9639, 19941, 10303, 21273, 10971, 22613, 11643, 23961, 12319, 25317, 12999, 26681, 13683, 28053, 14371, 29433, 15063, 30821, 15759, 32217, 16459, 33621, 17163, 35033, 17871, 36453, 18583, 37881, 19299, 39317

Here it is after dividing out the prime factors 3, 5, 7, 11, 13, and 17:

53, 307, 109, 79, 457, 479, 661, 509, 2599, 277, 1073, 7061, 61, 163, 263, 3191, 1, 10841, 1913, 577, 6379, 1489, 2341, 2099, 2557, 1777, 1189, 5767, 2993, 18617, 1, 23, 10303, 1013, 1219, 22613, 3881, 163, 12319, 2813, 619, 26681, 4561, 1039, 2053, 9811, 5021, 37, 103, 10739, 16459, 1601, 1907, 35033, 851, 12151, 18583, 1403, 919, 39317

We see a couple 1s have already appeared; these are 17-smooth numbers. When we get all the way up through 43, we have:

53, 307, 109, 79, 457, 479, 661, 509, 113, 277, 1, 307, 61, 163, 263, 3191, 1, 293, 1913, 577, 6379, 1489, 2341, 2099, 2557, 1777, 1, 5767, 73, 18617, 1, 1, 10303, 1013, 53, 22613, 3881, 163, 12319, 97, 619, 26681, 4561, 1039, 2053, 9811, 5021, 1, 103, 10739, 16459, 1601, 1907, 35033, 1, 419, 18583, 61, 919, 39317

We see several numbers set to 53 or 61; these would be smooth if we raised our bound a little bit.

This sieving process is where quadratic sieve gets its name from. This drastically decreases the overall work needed to find a sufficient number of smooth numbers, making it practical for very large numbers. This basic implementation could probably handle numbers up to 50-60 digits.

Improvements and optimizations

Quadratic sieve admits a number of "bells and whistles" to dramatically improve its runtime in practice. We mention only a few of the most important ones here.

The simple row reduction method of Gaussian elimination is not able to accomodate the very large smoothness bounds needed to factor large numbers, which often range in the millions, mostly due to space limitations; such matrices, if stored explicitly, would require trillions of bits. However, this method is wasteful, because most of the entries in the matrix are zero (they must be; each number has no more than log2n prime factors). Instead of using an actual two-dimensional array, we can just keep a list for each column that lists the positions of the 1 bits in that column. We then use a method well-suited to reducing sparse matrices such as the Lanczos algorithm. This still requires a fair amount of space; it's common to use block algorithms that work on small portions of the matrix at one time, storing the rest of the matrix on disk. The matrix step is notoriously difficult to parallelize and for large problems is often done on a single high-performance supercomputer.

The most expensive step by far is the sieving, which can require scanning billions of numbers to locate the needed smooth numbers. A common trick is to only track the approximate logarithm of each number, usually in fixed-point arithmetic. Then, when visiting each number, instead of performing an expensive division we only have to subtract. This introduces a bit of rounding error into the algorithm, but that's okay; by rounding consistently in the correct direction, we can ensure that we don't miss any smooth numbers and only capture a few spurious numbers that we can quickly check and reject. Because the logarithms of small primes are small, and require visiting more numbers than any others, primes like 2 and 3 are often dropped altogether.

Another problem is that a2n grows fairly quickly; because smaller numbers are more likely to be smooth, we get diminishing returns as we scan higher in the sequence. To get around this, we scan values of not just the sequence a2n but also a number of similar sequences such as (Ca+b)2n for suitable constants C, b. This variation is called the multiple polynomial quadratic sieve, since each of these sequences can be seen as the values of polynomial in a.

Finally, although the matrix step does not admit simple parallelization due to many data dependencies, the sieving step is perfectly suited to massive parallelization. Each processor or machine simply takes a portion of the sequence to scan for smooth numbers by itself, returning the small quantity of smooth numbers that it discovers to a central processor. As soon as the central processor has accumulated enough smooth numbers, it asks all the workers to stop. In the multiple polynomial variant, it's common to assign some of the polynomials to each machine.

One peculiar idea for massively parallelizing the sieving step, invented by Adi Shamir, is to use not computers but a specially constructed sieving device based on light emitters and sensors that he calls TWINKLE. The concept is that we have a light for each prime number whose intensity is proportional to the logarithm of that prime. Each light turns on just two times every p cycles, corresponding to the two square roots of n mod p. A sensor senses the combined intensity of all the lights together, and if this is close enough to the logarithm of the current value, that value is a smooth number candidate.

Conclusion

I hope this gives you all some insight into the workings of one of the most powerful factoring algorithms available, and into how unexpected algorithms and mathematics can be applied to familiar problems. Please leave any feedback, whether clarifications, corrections, complaints, or encouragement. Thanks for reading.