http://papersincomputerscience.wordpress.com/

My summary from the introduction post: "*Papers in Computer Science* (originally called *A CS Paper a Day*)

is a blog for abstracts and brief summaries and discussions about

published computer science articles. I’ll cover a variety of areas,

focus on some of the most interesting and important papers in each one,

and I’ll try to explain them to a general computer science audience

without specialist background. Because most of these are available

online, many of them without cost, readers can consult the original

paper for more information if they’re interested."

I hope anyone interested in reading more of my stuff will take a look at *Papers in Computer Science* and leave comments. Thank you all for your patronage!

We're entering an era where CPU clock speeds will soon cease to scale upwards and instead CPU manufacturers are planning to put more and more independent cores on a chip. Intel plans to release an 80-core chip within 5 years. Consequently the research community is setting their eye on the manifold barriers to writing correct and efficient highly-parallel programs. Will we be able to continue to get the same boosts in computational performance that we have been getting through raw clock speed?

You might think it's just a matter of building effective tools for producing programs that can take advantage of highly parallel platforms, and training programmers to take advantage of them. But there are complexity theory results to show that there are some problems that, under widely accepted assumptions, fundamentally cannot be parallelized, and this is the theory of P-completeness. For these problems, it may be literally impossible to get any significant benefit out of throwing additional cores at them - such problems are called *inherently sequential*. More to the point, however long it takes your computer to solve these problems *right now*, that's about how fast it's going to stay for a long time.

The complexity class NC (Nick's Class, named after Nick Pippenger by Steven Cook) is the set of problems that can be solved in polylogarithmic time (O(log^{k} *n*) time for some integer *k*) by a machine with polynomially many processors. For example, given a list of *n* numbers, the typical way to add them up on a single processor would be with a loop, as in this C# snippet:

public static int AddList(List<int> list) {

int sum = 0;

foreach (int i in list) {

sum += i;

}

return sum;

}

But now suppose we had *n*/2 processors. Then each processor can add just two of the numbers, producing a list of *n*/2 numbers with the same sum. We add these in pairs as well, continuing until just one number remains, the sum. This fan-in algorithm requires only O(log *n*) time, placing the problem in NC (or technically the decision problem version of it).

We can use similar ideas to parallelize many algorithms. For example, the best known sequential algorithms for matrix multiplication requires time O(*n*^{2.376}) for an *n* × *n* matrix, but given O(*n*^{3}) processors, we can solve it in O(log *n*) time: for each entry, we assign *n* processors to compute the products that are added to form that entry, then use the fan-in procedure above to add them together. This can still be done in O(log *n*) time. There are efficient parallel algorithms for other common problems too, like sorting, string matching, and algorithms in graph theory and computational geometry.

In real life of course we don't have such a polynomial number of processors, so this is in some ways an impractical definition; all it really tells us is that if we have *many* processors, we can solve the problem dramatically faster. Practical parallel algorithms pay more attention to *speedup* - how many times faster the algorithm is with *k* processors than with one. Effective speedup requires a good way of dynamically decomposing the problem. But speedup is difficult to characterize formally in a machine-independent way, so we'll stick with the definition of NC.

The famous P = NP problem asks whether all problems whose solutions can be verified in polynomial time, can also be solved in polynomial time. In this problem, the problems in P are considered "feasible to solve", and most computer scientists believe P ≠ NP - that is, that some problems which we can feasibly verify cannot be solved feasibly. We believe this in part because we have shown that many important problems are NP-hard. To say a problem is NP-hard means there is a polynomial time algorithm that can take any problem in NP and convert it into an instance of that problem; consequently, if an NP-hard problem can be solved in polynomial time, so can any problem in NP.

But when you have a large number of processors, the question turns around: we no longer consider P feasible, because any algorithm using just one CPU would face a dramatic slowdown compared to algorithms that take advantage of all of them. Now NC algorithms are considered feasible, and we ask: can all problems with efficient sequential solutions be highly parallelized? That is, is NC = P?

Just as with P = NP, nobody knows. However, just as there are NP-hard problems, there are P-hard problems. A P-hard problem is a problem where there is an NC algorithm (running in polylogarithmic time on a polynomial number of processors) to convert any problem in P into an instance of that problem. If we had an NC algorithm to solve a P-hard problem, it would imply that NC = P, providing a constructive and mechanical way of parallelizing any problem we can solve sequentially in polynomial time. Just as most researchers believe P ≠ NP, most researchers believe NC ≠ P, which implies that no P-hard problem has an NC algorithm to solve it.

One of the most important P-complete problems is *linear programming*: given a linear target function in terms of some variables, and a number of linear inequalities expressing constraints between those variables, find the optimal (maximum) value of the target function. It finds uses in operations research, microeconomics, and business management, and it would be great if we could exploit large number of cores to tackle larger instances of these problems. But no one has found an efficient parallel algorithm for this problem.

Another important example is the *circuit-value problem*: given a digital logic circuit and its inputs, compute the output of the circuit. There is a trivial linear time solution in the number of gates: continually get the next gate that you know both inputs of and compute its output, until you're done. It would seem that certain gates don't influence each other and be computed in parallel. But because the wires can express complex interdependencies between values and can create long chains of dependencies, there's no clear way to parallelize a general circuit. Besides obvious applications to circuit design and simulation, many problems can be solved by families of circuits, and if we can evaluate them quickly we could solve these problems. But not only has no one has found a general way of parallelizing such evaluation, they've shown that even if we restrict the circuits so that wires can't cross, or such that only the inputs can be negated, the problem remains P-complete, and so difficult to parallelize (these are the *planar *and *monotone *circuit-value problems, respectively).

The first problem ever shown to be NP-complete, as well as an important problem in practice, is the boolean satisfiability problem: given a formula using AND, OR, and NOT, with a number of variables set to either true or false, determine if there is some way of setting the variables to make the whole formula evaluate to true. The P-complete version of this problem is called Horn satifiability.

Any boolean formula can be reduced efficiently to a form where it represents a three-layer circuit: a literal is either a variable or a negated variable; a clause is the OR of some literals; and an entire formula is the AND of some clauses. A "Horn clause" is a clause where all of the variables are negated except possibly one. This provides an efficient solution to the problem, since Horn clauses can be rewritten as implications (not(x1) or not(x2) or x3 is the same thing as (x1 and x2) implies x3). Formulas made up of Horn clauses appear in the context of logical deduction systems, such as the resolution systems used to model problems in AI. Again, no one has found an effective way to parallelize finding solutions for these formulas.

In the world of P = NP, people have long been studying ways to "get around" the infeasibility of finding a polynomial-time solution for NP-hard problems, such as approximation algorithms, special cases, fixed-parameter tractable solutions, and so on. We could ask the same questions about the NC = P question: could we find an approximate solution much more quickly by using all our cores, instead of running a classical algorithm to find an exact solution on just one? Are there special cases of these problems which are parallelizable? But comparatively little investigation has been done into this area, although some examples exist (like Karger and Motwani's "An NC Algorithm For Minimum Cuts").

Finally, I should mention that P-complete is relevant not only to the study of what problems can be efficiently parallelized, but also to what problems can be solved in a small amount of space. I'll leave this for another time though.

]]>Most computer scientists are familiar with the P = NP problem, which asks essentially whether we can verify more problems in polynomial time than we can solve. So fundamentally does complexity theory hinge on this result that the Clay Mathematics Institute has labelled it one of their seven Millennium Prize Problems, and will award $1,000,000 to the researchers that solve it.

But there are six other Millennium Prize problems that are often less accessible to computer scientists, requiring background in areas such as topology and complex analysis. One of the most important of these is the Riemann Hypothesis, which is normally stated in terms of complex zeros of the Riemann Zeta function. Here, we will describe a recent characterization of the Riemann Hypothesis more suited to a computer scientist's background based on growth of integer functions.

Consider the sum-of-divisors function σ(*n*) that sums up the divisors of the positive integer *n*. For example, σ(15) = 1 + 3 + 5 + 15 = 24. If you've studied asymptotic notation, a natural question to ask is: how fast does σ(*n*) grow?

It's clear that since 1 and *n* always divides *n*, σ(*n*) ≥ n + 1, so σ(*n*) is Ω(*n*). But σ(*n*) cannot be O(n), because σ(*n*)/*n* takes on arbitrarily large values. To see this, consider *n* equal to *k* primorial, the product of the first *k *primes. Because σ(*n*)/*n* is multiplicative, σ(*n*)/*n* will equal the product of σ(*p*)/*p* = (1+*p*)/*p* = (1 + 1/p) for each prime factor. But if you expand out this product, you get a sum including 1/p for each prime factor, and the sums of the reciprocals of the primes, also known as the harmonic series of primes, diverges. But the harmonic series of primes grows quite slowly, and is O(ln ln *n*).

From this we might conjecture that σ(*n*) grows at a rate of *n* ln ln *n*. Gronwall's theorem, proven in 1913, shows that in fact:

lim sup_{n→∞ }σ(*n*)* */ (*n *ln ln *n*) = *e*^{γ}.

Here, *e* and γ are both constants (the latter is the Euler-Mascheroni constant), so *e*^{γ} is a constant, equal to about 1.781. This is somewhat stronger than the statement that σ(*n*) is O(*n* ln ln *n*) - the lim sup actually says that for any ε > 0, some tail of the sequence σ(*n*)* */ (*n *ln ln *n*) must be bounded above by *e*^{γ} + ε.

But instead of a lim sup, what we'd really like is to have a hard bound; we'd like to say:

σ(*n*)* */ (*n *ln ln *n*) < *e*^{γ} for all *n*.

This is false, and a quick search identifies 27 small counterexamples: 2, 3, 4, 5, 6, 8, 9, 10, 12, 16, 18, 20, 24, 30, 36, 48, 60, 72, 84, 120, 180, 240, 360, 720, 840, 2520, and 5040. However, no search to date has located any larger counterexamples. From this we might conjecture:

σ(*n*)* */ (*n *ln ln *n*) < *e*^{γ} for all *n *> 5040.

In 1984, Guy Robin showed that in fact, this statement is true if and only if the Riemann hypothesis is true. This is Robin's theorem. Thus, the Riemann hypothesis could, in theory, be proven false by exhibiting a single positive integer *n* > 5040 such that σ(*n*) > *e*^{γ }(*n *ln ln *n*). Because most mathematicians believe the Riemann hypothesis is true (just as most computer scientists believe P ≠ NP) it's more likely that the above statement is true, and a proof of this fact would win you $1,000,000.

In most data structure and algorithms classes, the model used for basic analysis is the traditional RAM model: we assume that we have a large, random-access array of memory, and count the number of simple reads/writes needed to perform the algorithm. For example, selection sort takes about n(n-1)/2 reads and 2n writes to sort an array of n numbers. This model was relatively accurate up through the 1980s, when CPUs were slow enough and RAM small enough that reading and writing the RAM directly didn't create a significant bottleneck. Nowadays, however, typical desktop CPUs possess deep cache hierarchies - at least a sizable L1 and L2 cache - to prevent runtime from being dominated by RAM accesses. Data structures and algorithms that efficiently exploit the cache can perform dramatically more quickly than those that don't, and our analysis needs to take this into account.

For example, suppose you have a list of bytes stored in both an array and a linked list. Provided the linked list nodes are allocated at random positions, iterating over the array will be dramatically faster, exhibiting optimal locality of reference, while iterating over the linked list will trigger a cache miss for every element. We can formalize this: assume a cache line - the size of the block read into the cache at one time - has size B bytes. Then traversing the linked list requires n cache misses, while traversing the array requires precisely ceiling(n/B) cache misses, a speedup of B times.

It is possible to design data structures that specifically take advantage of the cache line size B. For example, in a previous post I described an unrolled linked list data structure where each node fills up a cache line, achieving similar cache performance to arrays while still allowing efficient insertions and deletions. Such a data structure is termed "cache-aware", because it *explicitly* takes knowledge about the cache model into account in its construction.

Unfortunately, this approach has a compelling disadvantage: one has to tune the data structure based on the cache line size B. This is a problem for many reasons: first of all, programs, even those compiled to native machine code, are often run on many different processors with different cache line sizes. A Pentium 4 can run the same machine code as a Pentium II, but has a very different cache model. But even when running on a fixed machine, these simple cache-tuned data structures can't take advantage of multilevel caches, because each one has a different size and a different block size. One can attempt to build a data structure for a specific cache hierarchy, but this is even less robust against a change of machine.

In 1999, in his masters thesis, Harold Prokop came up with an interesting solution to this problem: the idea of a *cache-oblivious* algorithm. This does not mean that the algorithm does not take advantage of the cache; to the contrary, it does so quite effectively. What it means is that the algorithm does not need to know the cache line size; it works effectively for *all* cache line sizes B simultaneously. This allows the algorithm to robustly exploit caches across many machines without machine-specific tuning. More importantly, it allows the algorithm to exploit multilevel caches effectively: because it works for all B, it applies between each two adjacent levels of cache, and so every level is well-utilized.

In fact, the multilevel advantage extends beyond simply taking advantage of both the L1 and L2 caches - such algorithms perform well even when the storage hierarchy is expanded to encompass much slower media such as slow memory, Flash, compressed memory, disks, or even network resources, without knowledge of block sizes or access patterns. For example, cache-oblivious B-trees can be used to implement huge database indexes that simultaneously minimize disk reads and cache misses.

Note that just because the algorithm doesn't depend on B doesn't mean the *analysis* of the algorithm doesn't depend on B. Let's take an example: iterating over a simple array. As noted earlier, this requires about n/B cache misses, which is optimal. But neither the array's structure nor the iteration algorithm explicitly take B into account. Consequently, it works for all B. In a multilevel cache, all data prefetched into every cache is used. This is the theoretical benefit of cache-oblivious algorithms: we can reason about the algorithm using a very simple two-level cache model, and it automatically generalizes to a complex, deep cache hierarchy with no additional work.

While cache-oblivious algorithms are clearly useful, at first it's not clear that there even exist any other than simple array iteration. Thankfully, extensive recent research has revealed cache-oblivious data structures and algorithms for a multitude of practical problems: searching (binary trees), sorting, associative arrays, FFT computation, matrix multiplication and transposition, priority queues, shortest path, query processing, orthogonal range searching, and more emerging every year. Practical comparison studies have shown a significant performance gain on real processors compared to traditional algorithms, although carefully-tuned cache-aware algorithms still enjoy an advantage on the specific machines they are tuned for (see e.g. [1][2]).

Many of the cache-oblivious data structures and algorithms that have been published are relatively complex, but here I'll describe a simple one just to give you a feel for it. This discussion is based on parts of this paper.

Consider the following simple search problem: we have a static list of records, and we wish to find the record with a particular key. Traditionally, this problem is solved with either an array and binary search, or a binary search tree. Both of these approaches exhibit dismal cache behavior. Database applications rely on B-trees, which group several records into each block. This greatly improves performance, but requires knowledge of the block size, and does not work across all levels of the cache hierarchy.

The key is the *van Emde Boas* *layout*, named after the van Emde Boas tree data structure conceived in 1977 by Peter van Emde Boas. Suppose for simplicity that the number of elements in a power of 2. We create a complete binary tree containing our records (where by "complete" we mean that all leaves are at the same level). This tree will have height *h* = log_{2}*n*.

Take a look at the first *h*/2 levels of the tree. This subtree has 2^{h/2} leaves, each itself the root of a tree of height *h*/2. In the van Emde Boas layout, we first lay out the subtree of height *h*/2 rooted at the root, followed by the subtrees rooted at each leaf of this tree from left to right. Each subtree is recursively laid out using the van Emde Boas layout. An example diagram is shown in Figure 1 of the paper.

The analysis proceeds like this: at each step of the recursion, the size of the subtrees being laid out is the square root of the size of the subtrees laid out at the previous step. Consequently, as we recurse, at some point we will be laying out subtrees between sqrt(B) and B in size (let's call it C). Each of these subtrees can be retrieved in a single memory transfer, and this layout partitions the tree into subtrees of this size.

The search procedure works like this: we access the root, which pulls in the first log_{2}C levels of the tree. We have only cache hits until we access a leaf of the tree, which pulls in the next log_{2}C levels of the tree rooted at that leaf. This continues until we reach the bottom of the tree. Because only 1 out of every log_{2}C accesses causes a cache miss, the total number of misses is log_{2}n/log_{2}C = log_{C}n. Because C is at least sqrt(B), log_{2}C is at least (log_{2}B)/2, and the total number of misses is at most 2log_{B}n.

We made one small assumption above: that each subtree is aligned with the cache block boundaries; if this is not true, each subtree can require two accesses, bringing the final maximum to 4log_{B}n. Probabilistic analysis shows that with a random starting point in memory and sufficiently large block size, the real constant is closer to 2.

This is the same asymptotic performance as B-trees, and is faster than ordinary binary trees by a factor of about log_{2}n/2log_{B}n = (log_{2}B)/2. For example, for a disk block size of 2048 elements, this would be a speedup of about 6.5 times. The advantage is more than theoretical: to quote the paper, "preliminary experiments have shown, surprisingly, that CO [cache-oblivious] B-trees can outperform traditional B-trees, sometimes by factors of more than 2".

But as practical as the research is in cache-oblivious algorithms, many applications and libraries have yet to take advantage of them. The data structures supplied with .NET, Java, Lisp, and so on are not cache-oblivious. We need to start putting this research into practice and reaping the benefits. I hope this gives you some insight into cache-oblivious algorithms, and I hope you will consider taking advantage of it the next time you develop a product whose performance critically depends on memory access patterns.

]]>I apologize to everyone for the hiatus - I realise a post is more than a little overdue and will try to be more regular in the future. I appreciate the support that I've received for the blog and continuing it.

Consider the following problem: you're starting a website for a chain of restaurants, and you want to have a "Locations" feature where a person can enter their address and get a list of the 10 outlets closest to their location, in order of distance from their location (as the crow flies, for simplicity). Moreover, each time they hit the "next" button, they should get the next 10 locations in order of distance.

More abstractly, say you have a list of locations on a 2D map, each specified as a point with an x and y coordinate. You can do any sort of preprocessing you like of this list of points. Then I ask you to efficiently answer: given some point, which is not necessarily one of the points in the list, produce a list of the *k* points closest to that point, in order of their distance from that point.

This problem, called the *k*-nearest neighbor problem, is very well-studied and has a variety of solutions. Here we describe one very simple but effective incremental algorithm first proposed by Hjaltason and Samet in their "Ranking in Spatial Databases" in 1995 (Citeseer). A spatial database is a database specialized to perform this type of query over its (potentially very large) data set.

Consider the following very simple but inefficient algorithm: we take each point, one at a time, and add it to a priority queue with its key equal to the distance from the search point. We then extract minimum repeatedly to obtain the points in order of their distance from the search point. Once we've extracted the first *k*, we're done, and with a good data structure this requires only O(*k* log *n*) time. Unfortunately, even with the best priority queue data structures based on Fibonacci heaps, adding all *n* points requires amortized linear (O(*n*)) time and space, which for a very large number of points as one would expect to find in a spatial database is prohibitive.

One approach to narrowing down the points we have to examine is to split the space into larger regions. The idea is that points close to the search point should be found in regions close to the search point. In more detail, we first divide the space into two rectangular regions, each containing some of the points. Each of these regions is, in turn, subdivided into two rectangular regions, and we continue in this manner until all the resulting regions have some small constant number of points in them. The result is a tree of regions, where each node is a subregion of its parent node, and each point lies in one of the leaves. It really doesn't matter how we choose to divide up the space - the authors used PMR quadtrees, but you could use kd-trees, simple quadtrees, or whatever you like. All that matters is that in the end each leaf region contains only a small number of points.

Creating the regions requires examining all the points, but the key is that this part does not depend on the search point, and so only has to be done once in preprocessing. The regions may require rebalancing as points are added or removed, but this can be done efficiently as well.

However, there's a trick to this: if the search point lies near the boundary of a region, its closest points may actually lie in a different, adjacent region. How can we correctly locate the nearest point without spuriously gathering all the points in all the regions around the point?

Here's the key insight: our priority queue will contain two types of objects, single points and regions. The key of each point is set to its distance from the search point, while the key of each region is set to the *minimum* distance of that region from the search point. Each time we extract a point from the queue, we output it; each time we extract a region from the queue, we expand it by adding its subregions to the queue, or if it's a leaf region, the points in that region. Because the region's key is set to the *minimum* distance of that region from the search point, any point that is outputted before the region is expanded is closer to the search point than *any* point in the region. This ensures correctness.

For example, consider the search problem shown to the right. There are 9 points, and we wish to find the 5 closest points to the big red search point in order. The space is recursively subdivided into regions as shown in the tree.

For the purpose of discussion we will refer to each region by the set of points it contains. Initially, the queue contains only the large region containing all points.

Current queue in order: {1,2,3,4,5,6,7,8,9}

This contains the red search point, so its minimum distance from the search point (its key) is zero. We remove it from the queue and add its two children {1,3,4,8} and {2,5,6,7,9}. The region {2,5,6,7,9} again contains the red search point and so has key zero. The minimum distance to the region {1,3,4,8} is shown by the red line extending out of the red point and to the left and is larger than zero.

Current queue in order: {2,5,6,7,9}, {1,3,4,8}

We extract and expand {2,5,6,7,9}, getting leaf region {2} and region {5,6,7,9}. The distance to region {2} is shown by the red line extending upward from the red point, which is clearly longer than the distance to region {1,3,4,8}. The region {5,6,7,9} has distance zero.

Current queue in order: {5,6,7,9}, {1,3,4,8}, {2}

We extract and expand {5,6,7,9} to get the regions {5,9} and {6,7}. The distance to region {6,7} is shown by the red line extending right from the red point, the longest distance so far encountered. The region {5,9} contains the red point and has distance zero.

Current queue in order: {5,9}, {1,3,4,8}, {2}, {6,7}

Next, we extract and expand region {5,9}. This is a leaf region, so we add the points 5, 9 to the queue. Point 5 is only slightly closer to the red point than region {2}, but farther than region {1,3,4,8}. Point 9 is farther than everything so far encountered.

Current queue in order: {1,3,4,8}, 5, {2}, {6,7}, 9

We extract and expand {1,3,4,8} to get {1,3,4}, at the same distance as {1,3,4,8}, and {8}, the distance of which is shown by the diagonal red line.

Current queue in order: {1,3,4}, {8}, 5, {2}, {6,7}, 9

We extract and expand {1,3,4}, which is a leaf region, and add the three points. Point 4 is closest, point 3 lies between {6,7} and 9 in distance, and point 1 is slightly closer than point 9.

Current queue in order: 4, {8}, 5, {2}, {6,7}, 3, 1, 9

We extract and output point 4, the closest point. We expand region {8} and add point 8, which is slightly farther than point 3.

Current queue in order: 5, {2}, {6,7}, 3, 8, 1, 9

We extract and output point 5. We expand region {2} and add point 2, which is slightly closer than point 8.

Current queue in order: {6,7}, 3, 2, 8, 1, 9

We extract and expand region {6,7}, adding points 6 and 7. Point 6 is farthest yet and point 7 even farther.

Current queue in order: 3, 2, 8, 1, 9, 6, 7

We've now expanded all regions and the remaining points in the queue are the remaining points not yet outputted in order.

The nice thing about this particular algorithm is that it's fully incremental: we could have stopped at any time once we had enough points, and pick up later when we want more. This makes it quite appealing in database queries, where you don't really know how many points you need in advance, because you might want to filter the results according to additional attributes. In practice, in a typical spatial database, one only needs to expand a small fraction of the regions in order to find a small number of closest points. The idea can be extended into higher-dimensional regions, spatial objects other than points such as areas and line segments, and even more general problems where one can easily compute the minimum key of a subtree of objects.

I hope you found this informative and I appreciate your comments.

]]>One of the benefits of functional languages is their great flexibility in list manipulation, which enables them to express certain computations concisely that would require one or more verbose loops in procedural languages. Many of the features that functional programmers take for granted such as first class functions, persistent data structures, and garbage collection were included in Lisp partly for the purpose of convenient list processing (Lisp stands for "LISt Processing").

Although C# is not a functional language, in C# 2.0 and the accompanying .NET Framework 2.0, a number of constructs and library calls have been added that allow generic List objects to be manipulated in a more functional way. The most important of these is anonymous delegates, which are the same construct that functional programmers call *closures*. We start out with the following method, which removes all elements from a list that match a given condition, passed in as a delegate:

delegate bool Predicate<T>(T arg);static void RemoveAll<T>(List<T> list, Predicate<T> condition)

{

int i=0;

while (i < list.Count)

{

if (condition(list[i]))

list.RemoveAt(i);

else

i++;

}

}

The use of the word "predicate" here comes from the mathematical predicate, which is just a function that yields a boolean true/false value.

Now say we have a list of integers, and we want to remove all the multiples of *x*, where *x* is a given number. In C# 1.0, using RemoveAll for this is so no easy task, requiring a temporary object:

class RemoveMultiplesOfHelper

{

private int x;

public RemoveMultiplesOfHelper(int x)

{

this.x = x;

}

public bool IsMultiple(int n)

{

return (n % x) == 0;

}

}static void RemoveMultiplesOf(List<int> list, int x)

{

RemoveAll<int>(list, new RemoveMultiplesOfHelper(x).IsMultiple);

}

The advantages of reusing RemoveAll are important: it's tricky to iterate through a list while removing elements from it, and programmers often introduce subtle bugs while attempting to do so. But this code has serious problems too: it's a hassle to write, it's verbose, the "temporary" class isn't very reusable and is difficult to name, and the code is spread all over the place, making the flow of execution difficult to follow. Because of these issues, many procedural programmers opt to reinvent the wheel instead.

Anonymous delegates solve all these problems. An anonymous delegate allows you to write code right in the middle of a method which has access to all the local variables in that method, and then package it up and pass it to another method to execute later. Here's our example using anonymous delegates:

static void RemoveMultiplesOf(List<int> list, int x)

{

RemoveAll<int>(list, delegate(int y) { return (y % x) == 0; });

}

In this much briefer implementation, the delegate "captures" the local value of x at the instant of its creation, then later RemoveAll uses it in computations.

Another handy trick you can do with anonymous delegates is take a method signature that doesn't quite fit an existing delegate and translate it into one that does. For example, suppose you're sorting a List using the overload of the Sort method that takes a System.Comparison delegate. Now say you're sorting a list of strings, and you have a string comparison method that takes a boolean determining whether the comparison is case-sensitive:

delegate int Comparison<T>(T left, T right);

void Sort<T>(Comparison<T> comparison);

int StringCompare(string left, string right, bool caseSensitive);

At first you appear to be in quite a pickle, as the signature of the `StringCompare`

method does not exactly match the delegate signature `Comparison<string>`

. Anonymous delegates make it easy to overcome this problem:

static void SortStrings(List<string> a, bool caseSensitive)

{

a.Sort(delegate(string left, string right) {

return StringCompare(left, right, caseSensitive);

});

}

We simultaneously match the desired delegate signature and push the decision of whether to sort case-sensitively or not to the caller, where it belongs, all with very little new code.

The advantages of anonymous delegates make functions like our Sort and RemoveAll considerably more useful. Consequently, a number of such functions were included with the generic List class in .NET 2.0:

- RemoveAll(Predicate<T>): Like our RemoveAll implementation above, removes all elements from a list matching a predicate.

- Sort(Comparison<T>): Mentioned above, this is a Sort method that takes a delegate. Sorting with anonymous delegates is very handy for unusual sorting orders; for example, the following sorts a list of integers in reverse order:
list.Sort(delegate(int x, int y) { return y.CompareTo(x); });

- Find(Predicate<T>): Finds the first element in a list satisfying a predicate. Handy for replacing linear searches. For example, this single line of code identifies the first file in the current directory modified in the last hour, or yields null if none:
new List<string>(Directory.GetFiles(Directory.GetCurrentDirectory())).Find(

delegate(string path) { return File.GetLastWriteTime(path) >=

DateTime.Now - new TimeSpan(1, 0, 0); });

- Exists(Predicate<T>): Determines whether a list contains an element satisfying a predicate. Besides its added readability in situations where you only care about the existence of a value, it also has another use: since "Find" returns the default value for the type T to indicate "not found", you need to use Exists for predicates that might match null or zero, such as:
bool containsEven = list.Exists(delegate(int x) { return (x % 2) == 0; });

- TrueForAll(Predicate<T>): Similar to Exists, but determines whether the predicate is satisfied by
*all*elements of the list. For example, this line of code might run some tests and see if they all pass:if (tests.TrueForAll(delegate(Test t) { t.Run(); return t.SuccessCode == 0; })

- FindAll(Predicate<T>): Creates a new list containing all elements satisfying a predicate. In functional languages this is sometimes called
*select*or*filter*, as we're "filtering out" the parts of the list not satisfying the predicate. Here's an example which, in a single line, gets a list of all processes using more than*k*threads:List<Process> x = new List<Process>(Process.GetProcesses()).FindAll(

delegate(Process p) { return p.Threads.Count > k; });

- FindIndex(Predicate<T>): Just like Find, except that instead of returning the first
*value*that satisfies the predicate, it returns its*index*in the list. For example, this expression yields the index of the first alphabetical character in a string:new List

(s.ToCharArray()).FindIndex(Char.IsLetter)

- FindLastIndex(Predicate<T>): Same as FindIndex, except that it finds the
*last*occurrence satisfying the predicate.

- ForEach(Predicate<T>): Simply executes the given delegate on each member of the list. Although just using a foreach loop is cleaner than using ForEach with an anonymous delegate, ForEach is still handy if you already have a named delegate, as in this example that deletes all the files in the current directory:
new List<string>(Directory.GetFiles(Directory.GetCurrentDirectory())).ForEach(File.Delete);

- ConvertAll(Predicate<T>): Perhaps the most powerful of the delegate-based List methods, ConvertAll feeds each element of the list through a conversion function, producing a new list. In functional languages this is often called
*map*because, like a mathematical function, it "maps" one set to another. Here's a simple example which produces a list of strings that are the lowercased versions of the strings in the original input list:list.ConvertAll

(delegate(string s) { return s.ToLower(); });

What's more, besides being individually useful, complex transformations can be achieved by chaining these methods together in clever ways. For example, here's two lines of code that produce a list of all files larger than a given size in a given directory:

static List<string> GetBigFiles(string directory, int bigLength)

{

List<string> paths = new List<string>(Directory.GetFiles(directory));

return paths.ConvertAll<FileStream>( File.OpenRead )

.FindAll( delegate(FileStream f) { return f.Length >= bigLength; } )

.ConvertAll<string>( delegate(FileStream f) { return f.Name; } );

}

The first ConvertAll opens all the files by using the static library method "File.OpenRead" as the converter. The FindAll filters out just the FileStream objects corresponding to large files. The final ConvertAll extracts the filenames from each stream. Each of these "list filters" is independently reusable. At first this way of programming may seem unusual or cumbersome, but just compare the size of the above method to a procedural implementation.

Don't forget also that you can write your own methods taking delegates to reap more benefits from anonymous delegates. A frequently useful application is for error handling and debugging code. Suppose you find yourself writing several methods that all look like this:

void foo()

{

Debug.Write("entering foo()");

try

{

// Do some stuff

}

catch (Exception e)

{

// Do some stuff

throw;

}

Debug.Write("exiting foo()");

}

It's not clear how to factor out the two "Do some stuff" parts, since they're nested right in the middle of the method. Once again, anonymous delegates are the answer. We can write a method like this taking delegates for the parts that get filled in:

delegate void DoAction();

delegate bool ProcessException(Exception e);

void DebugInvoke(string name, DoAction action,

ProcessException processException)

{

Debug.Write("entering " + name);

try

{

a();

}

catch (Exception e)

{

if (processException(e))

{

throw;

}

}

Debug.Write("exiting " + name);

}

Now we can mix-and-match try bodies and catch bodies at will, and get the tracing for free:

void foo(int x, int y)

{

int z = 0;

DebugInvoke("addFoo", delegate() { z = x + y; }, OverflowHandler);

DebugInvoke("addBar", delegate() { z = z + x; },

delegate(Exception e) { Debug.Assert(false); return true; });

DebugInvoke("addBaz", delegate() { y = z + z; }, OverflowHandler);

}static bool OverflowHandler(Exception e)

{

if (e is OverflowException)

{

Debug.Write("Encountered overflow");

return true;

}

return false;

}

Unfortunately, anonymous delegate syntax, while much more concise than the alternative, is still a bit verbose, requiring not only the word "delegate" but full argument lists. One of the goals of the next version of C# is to encourage the use of these functional idioms, partly by introducing a new closure syntax that will take advantage of type inference to overcome this syntactic overhead.

Just remember to keep this rule of thumb in mind: if you're iterating over a list, and the body of the foreach is pretty small, chances are you don't need to. See if you can do the same thing more concisely and with less risk using a combination of the above methods and anonymous delegates (and amaze your coworkers). I hope you found this article helpful.

]]>Today I'm going to talk about how the quadratic sieve factoring algorithm works, giving a comprehensive description assuming knowledge of only basic university-level mathematics.

The foundation of the most popular public-key cryptography algorithm in use today, RSA, rests on the difficulty of factoring large integers. When keys are generated, efficient algorithms are used to generate two very large prime numbers and multiply them together. The person who generated the keys knows these two numbers, but everyone else only knows the product. The product contains enough information to encrypt a message to the person; the two primes allow the recipient to decrypt it. There is no known way to decrypt it without using the primes, but by factoring, we can extract the two prime factors from the product and break the encryption.

At the time that RSA was invented in 1977, factoring integers with as few as 80 decimal digits was intractable; all known algorithms were either too slow or required the number to have a special form. This made even small, 256-bit keys relatively secure. The first major breakthrough was quadratic sieve, a relatively simple factoring algorithm invented by Carl Pomerance in 1981, which can factor numbers up to 100 digits and more. It's still the best known method for numbers under 110 digits or so; for larger numbers, the general number field sieve (GNFS) is now used. However, the general number field sieve is *extremely complicated*, and requires extensive explanation and background for even the most basic implementation. However, GNFS is based on the same fundamental ideas as quadratic sieve, so if factoring the largest numbers in the world is your goal, this is the place to start.

We'll begin by addressing a few problems that at first glance have nothing to do with factoring, then assemble them into a working algorithm. I won't be motivating them first, but trust me - they're important.

Suppose I give you a set of integers and I ask you to find a subset of those integers whose product is a square, if one exists. For example, given the set {10, 24, 35, 52, 54, 78}, the product 24×52×78 is 97344 = 312^{2}. The brute-force solution, trying every subset, is too expensive because there are an exponential number of subsets.

We'll take a different approach based on prime factorizations and linear algebra. First, we factor each of the input numbers into prime factors; for now we will assume that these numbers are easy to factor. For the above example set, we get:

10 = 2 × 5

24 = 2^{3} × 3

35 = 5 × 7

52 = 2^{2} × 13

54 = 2 × 3^{3}

78 = 2 × 3 × 13

When you multiply two numbers written as prime factorizations, you simply add the exponents of the primes used. For example, the exponent of 2 in 24×52×78 is 6, because it's 3 in 24, 2 in 52, and 1 in 78. A number is a square if and only if all the exponents in its prime factorization are even. Suppose we write the above factorizations as vectors, where the *k*th entry corresponds to the exponent of the *k*th prime number. We get:

[1 0 1 0 0 0]

[3 1 0 0 0 0]

[0 0 1 1 0 0]

[2 0 0 0 0 1]

[1 3 0 0 0 0]

[1 1 0 0 0 1]

Now, multiplying numbers is as simple as adding vectors. If we add rows 2, 4, and 6, we get [6 2 0 0 0 2], which has all even exponents and so must be a square. In more familiar terms, we want the last bit of each entry in the sum to be zero. But in this case, we don't need to store all of the numbers above, only the last bit of each number. This gives us the following:

[1 0 1 0 0 0]

[1 1 0 0 0 0]

[0 0 1 1 0 0]

[0 0 0 0 0 1]

[1 1 0 0 0 0]

[1 1 0 0 0 1]

Moreover, since we're only interested in last bits, we can perform all our addition using one-bit integers with wraparound semantics (in other words, mod 2). If we add rows 2, 4, and 6 in this way, we get [0 0 0 0 0 0 0], the zero vector. In fact, all squares correspond to the zero vector.

Let's rephrase this as a matrix equation problem. If we transpose the above matrix, so that rows become columns, we get this:

[1 1 0 0 1 1]

[0 1 0 0 1 1]

[1 0 1 0 0 0]

[0 0 1 0 0 0]

[0 0 0 0 0 0]

[0 0 0 1 0 1]

Call this matrix A. If we multiply A by the vector [0 1 0 1 0 1], using one-bit integer arithmetic, we get the zero vector. This tells us precisely which numbers we need to multiply to get a square. So, our goal is to find a nonzero vector x such that Ax=0 (remember, all arithmetic here is with one-bit integers).

If you've had a course in linear algebra, this problem should look very familiar; it's the problem of finding the null space of a matrix, the set of vectors such that Ax=0. The problem can be solved using row reduction (Gaussian elimination). We row reduce the matrix, and then assign values to the free variables in a way that gives us a nonzero solution. The other variables will be determined by these values and the matrix. You probably studied this problem using rational numbers, not one-bit integers, but it turns out row reduction works just as well for these. For example, if we add row 1 to row 3 in the above matrix, we get the following:

[1 1 0 0 1 1]

[0 1 0 0 1 1]

[0 1 1 0 1 1]

[0 0 1 0 0 0]

[0 0 0 0 0 0]

[0 0 0 1 0 1]

Completing the row reduction, we eventually end up with this matrix:

[1 0 0 0 0 0]

[0 1 0 0 1 1]

[0 0 1 0 0 0]

[0 0 0 1 0 1]

[0 0 0 0 0 0]

[0 0 0 0 0 0]

If we turn this back into a system of equations and rearrange, we get this:

*x*_{1} = 0*x*_{2} = −*x*_{5} − *x*_{6}*x*_{3} = 0*x*_{4} = −*x*_{6}

Suppose we choose *x*_{5}=0, *x*_{6}=1. From the above equations, it follows that the first four vectors have the values 0, 1, 0, and 1 (remember, one-bit integer arithmetic). This gives us our final vector, [0 1 0 1 0 1]. If we were to choose *x*_{5}=1 and *x*_{6}=0 instead, we'd get a different solution: [0 1 0 0 1 0], corresponding to 24×54 = 1296 = 36^{2}.

Moreover, a theorem of linear algebra tells us precisely how many input numbers we need to guarantee that a square can be found: as long as we have more columns than rows, the null space is guaranteed to be nontrivial, so that we have a nonzero solution. In other words, we just need more numbers than prime factors used by those numbers. As this case shows, though, this isn't a necessary condition.

The one remaining problem with this method is that if one of the numbers in our set happens to have very large factors, our matrix will have a large number of rows, which requires a lot of storage and makes row reduction inefficient. To avoid this, we require that the input numbers are *B-smooth*, meaning that they only have small factors less than some integer *B*. This also makes them easy to factor.

You might be wondering what squares have to do with factoring. The connection is the very simple factorization method known as Fermat's method. Although not efficient in general, it embodies the same basic idea as quadratic sieve and works great for numbers with factors close to their square root.

The idea is to find two numbers *a* and *b* such that *a*^{2}−*b*^{2} = *n*, the number we wish to factor. If we can do this, simple algebra tells us that (*a*+*b*)(*a*−*b*) = *n*. If we're lucky, this is a nontrivial factorization of *n*; if we're not so lucky, one of them is 1 and the other is *n*.

The concept behind Fermat's algorithm is to search for an integer *a* such that *a*^{2}−*n* is a square. If we find such an *a*, it follows that:

*a*^{2}−(*a*^{2}-*n*) = *n*

Hence we have a difference of squares equal to *n*. The search is a straightforward linear search: we begin with the ceiling of the square root of *n*, the smallest possible number such that *a*^{2}−*n* is positive, and increment *a* until *a*^{2}−*n* becomes a square. If this ever happens, we try to factor *n* as (*a*−sqrt(*a*^{2}−*n*))(*a*+sqrt(*a*^{2}−*n*)); if the factorization is trivial, we continue incrementing *a*.

Here's an example of Fermat's method from Wikipedia. Let *n*=5959; *a* starts out at 78. The numbers 78^{2}−5959 and 79^{2}−5959 are not squares, but 80^{2}−5959=441=21^{2}. Hence (80-21)(80+21)=5959, and this gives the nontrivial factorization 59×101=5959.

The reason Fermat's method is slow is because simply performing a linear search of all possible *a* hoping that we'll hit one with *a*^{2}−*n* square is a poor strategy − there just aren't that many squares around to hit. A better way of going about it is to proactively compute an *a* having this property (actually a similar property).

The key is to notice that if we take a number of *a*^{2}−*n* values, none of which are squares themselves, and multiply them, we may get a square, say *S*. Let *A* be the product of the corresponding values of *a*. Basic algebra shows that *A*^{2}−*S* is a multiple of *n*. Hence, (*A*−sqrt(*S*))(*A*+sqrt(*S*)) is a factorization of *some multiple* of *n*; in other words, at least one of these shares a factor with *n*. By computing the greatest common divisor (GCD) of each with *n* using Euclid's algorithm, we can identify this factor. Again, it may be trivial (just *n* itself); if so we try again with a different square.

All that remains is, given a list of *a*^{2}−*n* values, to find a subset whose product is a square. But this is precisely an instance of the problem discussed in the last section. Unfortunately, recall that that the method we came up with there is not efficient for numbers with large factors; the matrix becomes too large. What do we do? We simply throw away numbers with large factors! Theoretical results show that there are a fairly large number of values in the sequence *a*^{2}−*n* that are smooth (recall that smooth numbers have only small factors). This gives us a new factoring method that works pretty well up to a point.

For example, consider the number 90283. If we start *a* at 301 and increment it up to 360 while computing *a*^{2}−*n*, we get the following values:

318, 921, 1526, 2133, 2742, 3353, 3966, 4581, 5198, 5817, 6438, 7061, 7686, 8313, 8942, 9573, 10206, 10841, 11478, 12117, 12758, 13401, 14046, 14693, 15342, 15993, 16646, 17301, 17958, 18617, 19278, 19941, 20606, 21273, 21942, 22613, 23286, 23961, 24638, 25317, 25998, 26681, 27366, 28053, 28742, 29433, 30126, 30821, 31518, 32217, 32918, 33621, 34326, 35033, 35742, 36453, 37166, 37881, 38598, 39317

None of these are squares (the first square occurs at *a*=398); however, if we factor each value we will discover that 7 of these values have no factor larger than 43:

6438, 10206, 16646, 19278, 19941, 30821, 35742

If we take these 7 values and feed them to the algorithm described in the last section, it finds a square: 19278×19941×30821×35742 = 423481541612104836 = 650754594^{2}. The corresponding original *a* were 331, 332, 348, and 355, and their product is 13576057680. Now, we can factor the number:

(13576057680−650754594)(13576057680+650754594) = 12925303086 × 14226812274 is a multiple of 90283

GCD(90283, 12925303086) = 137

GCD(90283, 14226812274) = 659

137 × 659 = 90283.

The factorization algorithm above is considerably better than Fermat's algorithm, but if we try to scale up the size of number we factor, we quickly encounter a bottleneck: finding the smooth numbers in the sequence. Only 7 of the 60 values we computed in our last example were 43-smooth (actually we were lucky to get a square with so few vectors). As the size of the number that we're factoring grows, so does the size of the numbers in the sequence, and the proportion of smooth numbers rapidly shrinks. Although finding smooth numbers doesn't require completely factoring every number in the sequence (we only have to test primes up to the smoothness limit), it's still too expensive to test every number in the sequence this way.

The key is to observe that the prime factors of the sequence *a*^{2}−*n* follow a predictable sequence. Let's take a look at the prime factorizations of the first ten or so numbers in our example sequence above:

318 = 2×3×53

921 = 3×307

1526 = 2×7×109

2133 = 3^{3}×79

2742 = 2×3×457

3353 = 7×479

3966 = 2×3×661

4581 = 3^{2}×509

5198 = 2×23×113

5817 = 3×7×277

The most obvious pattern is that every other number is even, beginning with the first one. This should be no surprise, since we're effectively adding 2*a*+1 to get each new number, which is always odd. Also, you'll notice that the first and second numbers are divisible by 3, as are the fourth and fifth, the seventh and eigth, and so on. If you look at the larger list, you'll notice similar patterns for larger primes; for example, the 3rd and 6th numbers are divisible by 7, and every 7th number after each of them as well. And, mysteriously, not one number in our entire sequence is divisible by 5!

So what's going on? The answer involves what number theorists call *quadratic residues*. A number *a* is called a *quadratic residue mod p* if there is some square *S* such that *S*−*a* is divisible by *p*. Half of all numbers are quadratic residues mod *p*, regardless of the value of *p*, and there's a simple formula for determining whether or not a particular number is: just take *a*, raise it to the power (*p*−1)/2, and then take the remainder after division by *p*. Then *a* is a quadratic residue mod *p* if and only if the answer is 1. Although this computation seems to involve very large values, in fact we can compute it quite quickly using exponentiation by squaring combined with frequent remainder operations.

This explains why none of our values are divisible by 5. If we compute 90283^{(5-1)/2} mod 5, we get 4, which is not 1 (remember that 90283 is our original *n* to be factored). Thus, there is no square such that *S*−*n* is divisible by 5; but all numbers in our sequence have this form. In practice, this means we can compute just once ahead of time which factors may occur in the sequence (primes *p* such that *n* is a quadratic residue mod *p*), and ignore all others.

For our next mystery, why is it that given a number in the sequence divisible by *p*, every *p*th number after that is also divisible by *p*? Well, simple algebra shows that if *a*^{2}−*n*=*kp*, then:

(*a*+*p*)^{2}−*n* = (*a*^{2}−*n*)+*p*(2*a*+*p*) = *kp*+*p*(2*a*+*p*).

But this doesn't explain why it always seems to be the case that there are *exactly two* different initial values of *a* such that *a*^{2}−*n* is divisible by *p* (with the exception of *p*=2). For example, in our sequence above the 3rd and 6th values were divisible by 7. The answer again is quadratic residues: it can be shown that the modular equation *x*^{2}≡y (mod *p*) has exactly two solutions (if it has any), and in fact there is an efficient algorithm for computing these two solutions called the Shanks-Tonelli algorithm. I won't go into it here since it requires some background in number theory, but for small primes it isn't really needed; it suffices to test the first *p* numbers to see which are divisible by *p*. For larger primes, it becomes important to avoid this expensive scan.

Recall the Sieve of Eratosthenes, an algorithm for locating prime numbers. It starts with a list of numbers, then crosses off all numbers not divisible by 2 except 2, then does the same for 3, 5, and so on until it's done. The numbers that remain must be prime. When attempting to find a list of prime numbers, this strategy is much more efficient than running even the most advanced primality test on each number individually.

We take a similar strategy here: we begin with a table of the original values in the sequence. We then visit all the numbers divisible by 2 and divide out a factor of 2. We do the same for each power of 2 up to the size of the sequence. We then do the same for every other prime up to our smoothness bound (43 in our example). In the end, the smooth numbers and only the smooth numbers will have become 1. Since we visit less and less list elements as the prime factor increases, the overall work is much less. For example, here's our original list from the above example:

318, 921, 1526, 2133, 2742, 3353, 3966, 4581, 5198, 5817, 6438, 7061, 7686, 8313, 8942, 9573, 10206, 10841, 11478, 12117, 12758, 13401, 14046, 14693, 15342, 15993, 16646, 17301, 17958, 18617, 19278, 19941, 20606, 21273, 21942, 22613, 23286, 23961, 24638, 25317, 25998, 26681, 27366, 28053, 28742, 29433, 30126, 30821, 31518, 32217, 32918, 33621, 34326, 35033, 35742, 36453, 37166, 37881, 38598, 39317

We visit elements 1, 3, 5, and so on, dividing out 2. Here's the list after this first pass is complete:

159, 921, 763, 2133, 1371, 3353, 1983, 4581, 2599, 5817, 3219, 7061, 3843, 8313, 4471, 9573, 5103, 10841, 5739, 12117, 6379, 13401, 7023, 14693, 7671, 15993, 8323, 17301, 8979, 18617, 9639, 19941, 10303, 21273, 10971, 22613, 11643, 23961, 12319, 25317, 12999, 26681, 13683, 28053, 14371, 29433, 15063, 30821, 15759, 32217, 16459, 33621, 17163, 35033, 17871, 36453, 18583, 37881, 19299, 39317

Here it is after dividing out the prime factors 3, 5, 7, 11, 13, and 17:

53, 307, 109, 79, 457, 479, 661, 509, 2599, 277, 1073, 7061, 61, 163, 263, 3191, 1, 10841, 1913, 577, 6379, 1489, 2341, 2099, 2557, 1777, 1189, 5767, 2993, 18617, 1, 23, 10303, 1013, 1219, 22613, 3881, 163, 12319, 2813, 619, 26681, 4561, 1039, 2053, 9811, 5021, 37, 103, 10739, 16459, 1601, 1907, 35033, 851, 12151, 18583, 1403, 919, 39317

We see a couple 1s have already appeared; these are 17-smooth numbers. When we get all the way up through 43, we have:

53, 307, 109, 79, 457, 479, 661, 509, 113, 277, 1, 307, 61, 163, 263, 3191, 1, 293, 1913, 577, 6379, 1489, 2341, 2099, 2557, 1777, 1, 5767, 73, 18617, 1, 1, 10303, 1013, 53, 22613, 3881, 163, 12319, 97, 619, 26681, 4561, 1039, 2053, 9811, 5021, 1, 103, 10739, 16459, 1601, 1907, 35033, 1, 419, 18583, 61, 919, 39317

We see several numbers set to 53 or 61; these would be smooth if we raised our bound a little bit.

This sieving process is where quadratic sieve gets its name from. This drastically decreases the overall work needed to find a sufficient number of smooth numbers, making it practical for very large numbers. This basic implementation could probably handle numbers up to 50-60 digits.

Quadratic sieve admits a number of "bells and whistles" to dramatically improve its runtime in practice. We mention only a few of the most important ones here.

The simple row reduction method of Gaussian elimination is not able to accomodate the very large smoothness bounds needed to factor large numbers, which often range in the millions, mostly due to space limitations; such matrices, if stored explicitly, would require trillions of bits. However, this method is wasteful, because most of the entries in the matrix are zero (they must be; each number has no more than log_{2}*n* prime factors). Instead of using an actual two-dimensional array, we can just keep a list for each column that lists the positions of the 1 bits in that column. We then use a method well-suited to reducing sparse matrices such as the Lanczos algorithm. This still requires a fair amount of space; it's common to use block algorithms that work on small portions of the matrix at one time, storing the rest of the matrix on disk. The matrix step is notoriously difficult to parallelize and for large problems is often done on a single high-performance supercomputer.

The most expensive step by far is the sieving, which can require scanning billions of numbers to locate the needed smooth numbers. A common trick is to only track the *approximate logarithm* of each number, usually in fixed-point arithmetic. Then, when visiting each number, instead of performing an expensive division we only have to subtract. This introduces a bit of rounding error into the algorithm, but that's okay; by rounding consistently in the correct direction, we can ensure that we don't miss any smooth numbers and only capture a few spurious numbers that we can quickly check and reject. Because the logarithms of small primes are small, and require visiting more numbers than any others, primes like 2 and 3 are often dropped altogether.

Another problem is that *a*^{2}−*n* grows fairly quickly; because smaller numbers are more likely to be smooth, we get diminishing returns as we scan higher in the sequence. To get around this, we scan values of not just the sequence *a*^{2}−*n* but also a number of similar sequences such as (*Ca*+*b*)^{2}−*n* for suitable constants *C*, *b*. This variation is called the multiple polynomial quadratic sieve, since each of these sequences can be seen as the values of polynomial in *a*.

Finally, although the matrix step does not admit simple parallelization due to many data dependencies, the sieving step is perfectly suited to massive parallelization. Each processor or machine simply takes a portion of the sequence to scan for smooth numbers by itself, returning the small quantity of smooth numbers that it discovers to a central processor. As soon as the central processor has accumulated enough smooth numbers, it asks all the workers to stop. In the multiple polynomial variant, it's common to assign some of the polynomials to each machine.

One peculiar idea for massively parallelizing the sieving step, invented by Adi Shamir, is to use not computers but a specially constructed sieving device based on light emitters and sensors that he calls TWINKLE. The concept is that we have a light for each prime number whose intensity is proportional to the logarithm of that prime. Each light turns on just two times every *p* cycles, corresponding to the two square roots of *n* mod *p*. A sensor senses the combined intensity of all the lights together, and if this is close enough to the logarithm of the current value, that value is a smooth number candidate.

I hope this gives you all some insight into the workings of one of the most powerful factoring algorithms available, and into how unexpected algorithms and mathematics can be applied to familiar problems. Please leave any feedback, whether clarifications, corrections, complaints, or encouragement. Thanks for reading.

]]>If you've ever done work with Web graphics, I'm sure that at some point you reduced an image with thousands of colors, such as a screenshot or photograph, to an image with only 256 colors, for example to store the image in the GIF format. Remarkably, the reduced image is usually very visually similar to the original image, despite the massive loss of information. Most paint programs are capable of performing this transformation, and operating systems such as Windows will perform the transformation on-the-fly when the user is running a video mode with only 256 colors. But how does it work?

In the literature, this problem is called the color quantization problem -- generally, the word "quantization" refers to any process that maps several values to a single representative value. In this case, the values that we're mapping are colors. There are two main steps to color quantization:

- Generate a good palette

- Dither

First we'll talk about generating a good palette. The simplest approach is to use a fixed palette for all images. We can map pixels in the source image to pixels in the reduced image by simply choosing the closest color. What do we mean by "closest"? Well, we can identify each color with a point in three-dimensional space whose coordinates are the red, green, and blue values of the color. Then, we consider two colors to be close if their points are close in this three-dimensional space. This isn't ideal, but it's very simple.

The fixed palette approach has the advantage that all images using the same palette can be rendered on the same screen simultaneously; unfortunately, it severely compromises image quality. Making this even worse is that typical fixed palettes, such as the Windows standard palette and the web colors palette, are chosen as uniform palettes, meaning that the colors are evenly spaced throughout RGB space. This is a terrible way to construct a palette, as humans can distinguish much finer color differences in some areas of RGB space than others.

Below, we show a photograph of a rose taken by Stan Shebs, but modified to remove its blue channel. This means that all of its colors lie in a single plane, the plane blue = 0. That color plane is shown in the graph below. We also show a 16-color optimized palette found by Adobe Photoshop - each blue point is a palette color, and each region shows the section of the color space mapping to that palette color. As you can see, the palette is far from uniform, and there are many colors that the image doesn't even use.

From this picture, we can see that generating a good palette is essentially a point clustering problem. We want to identify clusters in the color space that contain a large number of points close together and assign a single palette entry to those colors.

One of the most popular algorithms for this problem is called the *median cut algorithm*. It was invented by Paul Heckburt in 1980 in his paper "Color Image Quantization for Frame Buffer Display" and is still in wide use today. The algorithm follows just a few basic steps:

- Create a block or box surrounding the points; it should be just large enough to contain them all.

- While the number of blocks is less than the desired palette size:

- Find the block with the longest side out of all blocks.

- Cut this block into two blocks along its longest side. Cut it in such a way that half of the points inside the block fall into each of the two new blocks (that is, we split it through the median, thus the name).

- Shrink these two new boxes so that they just barely contain their points.

- Find the block with the longest side out of all blocks.
- Average the points in each box to obtain the final set of palette colors.

Despite a number of more sophisticated modern algorithms, this strikingly simple approach is adequate for generating 256 color palettes for most images. You can read about a C++ implemention of the median cut algorithm at my wiki.

If the palette is large enough and of high enough quality, Reducing by nearest color is usually sufficient. If not, however, distinct artifacts will be visible, such as banding, where what were once smooth gradients become striped regions (see this image; look closely at the cat's neck). One effective technique for dealing with these palette limitations is called *dithering*, which mixes dots of different colors to simulate a larger number of colors and smooth out gradients. For example, if you mix a grid of red and blue dots, the result will appear purple. Dithering is also commonly used in printing, where for example a grayscale image can be simulated with great accuracy using a higher-resolution dithered black and white image.

The trick to dithering is to effectively simulate the original color in each area of the image while avoiding telltale artifacts formed by poor dithering patterns; see ordered dithering, commonly used in newspapers, as an example of a dithering scheme with clearly visible artifacts. One of the most popular dithering schemes to this day is the Floyd-Steinberg algorithm, a classic dithering algorithm falling into the general class of *error diffusion *dithering algorithms.

The basic concept behind error diffusion is to begin with nearest color matching, but whenever we can't match a pixel color exactly, the error between the original color and the palletized color is distributed into adjacent pixels not yet visited. In Floyd-Steinberg, the pixel to the right receives 7/16 of the error, the pixel below 5/16, the pixel below and to the left 3/16, and the pixel below and to the right 1/16. When these pixels are visited, this added error affects the palette entry selected for them, leading to an average colour over that portion of the image close to the actual average color of the pixels in that area.

What's great about Floyd-Steinberg is that it generalizes easily to reducing full 24-bit color images to palettes of any size. The only modification we need to make is that we perform the error distribution step for each of the three channels. The palette entry is still chosen by nearest color. You can read about a C implementation of Floyd-Steinberg dithering on my wiki.

This isn't the end of the story of course - there is ongoing research and all kinds of fancy variations on these approaches, many of which can produce much nicer results than these classical algorithms. But at least now the next time you see a GIF or 8-bit PNG file on the web, or use a computer in 256-color mode, you can appreciate the algorithms under the cover that helped make your images look good under these severe limitations.

]]>similar images effectively, often 5 to 15 times over a raw bitmap

format. It's a lossy format that exploits properties of human vision to

eliminate information that is difficult to distinguish. You see JPEGs

all the time, and if you've seen a JPEG compressed with a low quality

level, you have a good idea of the kinds of visual artifacts that

emerge when JPEG compression is pushed to its limits. But how does it

really work, and how does this cause JPEG artifacts?

The JPEG standard defines a number of different encoding options,

and there are a number of container file formats defined other than the

usual JFIF, but I'm going to discuss a single popular JPEG encoding

method used by many applications. This encoding combines three

different completely independent methods of image compression:

downsampling the chrominance channels, quantization of the discrete

cosine transformation, and entropy coding.

The most obvious way to make an image file smaller is to make the

image smaller - for example, a 100×100 bitmap is 1/4 the size of a

200×200 bitmap. You can expand the image back out to the original size

when viewing it by simply upsampling it. This isn't a great compression

method, since it causes clearly visible artifacts, usually blocky or

blurry effects, especially around detailed, sharp areas.

A different approach is instead of downsampling the entire image, to

only downscale certain channels of the image. For example, the eye's

red-green sensors are more numerous and sensitive than its blue-yellow

sensors, so if we shrink just the blue channel and later re-expand it,

the effect is much less noticable. But the RGB color space is not our

only choice - images can be encoded in many color systems, and for each

of these the channels we choose to downsample will have different

visual effects. This experiment

by Eric Setton of Stanford University demonstrates the effect on image

quality of downsampling various channels in various color systems.

The results are visually obvious: we are sensitive to red and green,

but even more sensitive to intensity or luminance, the relative

brightness of pixels compared to other nearby pixels. This is expressed

by the Y component of the YCbCr system and the V component of the HSV

system. Any downsampling of this information is visually disastrous. On

the other hand, downsampling the *chrominance*, expressed by the

Cb and Cr components of the YCbCr system, has almost no visual effect

(Cb and Cr stand for "chrominance blue" and "chrominance red" - see a sample image broken up into YCbCr channels in the middle of this page).

For this reason, typical JPEG files convert images to the YCbCr color

space, and then downsample both the Cb and Cr channels by a factor of 2

in both width and height. Further processing of the image proceeds on

each of the three channels independently.

This is the method in JPEG encoding largely responsible for the

artifact known as "color bleeding", where along sharp edges the color

on one side can "bleed" onto the other side. This is because the

chrominance channels, which express the color of pixels, have had each

block of 4 pixels averaged into a single color, and some of these

blocks cross the sharp edge.

Normally, we view an image as simply a grid of pixels, and if we

alter a pixel or group of pixels dramatically, the effect will be

visually obvious. A different approach to storing images is to come up

with a set of "basis" images which can be added together in various

combinations to form the image we want. JPEG encodes each 8-by-8 square

of pixels using the following basis image set of 64 images:

We assign each of these images a relative weight determining how

much influence it has on the image. This weight can be positive or

negative (when negative, the basis image is subtracted). Taken

together, these 64 weights can describe any 8×8 image at all, and they

describe the image precisely. This example at Wikipedia

shows an 8×8 image, its original pixel array, and its new 8×8 array of

weights after being converted using the discrete cosine transform

(which I'll discuss later).

But what's the point of transforming 64 pixel values into 64

weights? For one thing, the resulting data may exhibit more structure,

and so be easier to compress. For example, with blocks from typical

photographs, the basis images in the upper left above have large

weights and the ones in the lower right have small weights.

But to store the weights, which are real numbers, we have to encode

them somehow. The simplest encoding is to just convert them to integers

between -*k* and *k* by scaling and rounding them. This works, but there's a tradeoff in our choice of *k*:

if it's too big, the encoded image will take too much space, but if

it's too small, the rounding might visibly distort the image, since

it's effectively altering the weights.

A more effective approach is to use a different *k* for each

basis image. This works well because the visual difference caused by

variations in the weights of different basis images is different. The

eye can detect small variations of the weights of the images in the

upper-left, but can overlook much larger variations in the weights of

the images in the lower-right, so we can get away with smaller *k* and larger rounding error for these weights.

In practice, rather than deal with expensive floating-point numbers,

we round the initial weights to integers and later divide them by fixed

factors called the *quantization factors *using integer division.

During decoding, we multiply them by these factors again. The

upper-left images have small quantization factors, so that little

information is lost in this process, while the lower-right images have

larger factors. This example shows a commonly used matrix of quantization

factors, one for each basis image, and the result of quantizing the

sample image using them. Observe how much simpler the matrix becomes -

it now contains a large number of entries that are small or zero,

making it much easier to compress. This example

shows how the weight values expand back out during decoding - notice

that although some of the weights are now different, the visual

difference is almost undetectable.

Quantization is the primary source of JPEG artifacts. Because the images in the lower-right

tend to have the largest quantization divisors, JPEG artifacts will

tend to resemble combinations of these images. The matrix of quantization factors can be directly controlled by

altering the JPEG's "quality level", which scales its values up or down.

Quantization compresses the image in two important ways: one, it

limits the effective range of the weights, decreasing the number of

bits required to represent them. Two, many of the weights become identical or zero,

improving compression in the third step, entropy coding.

But this discussion leaves one important question unanswered: how do

we take an arbitrary square of 8×8 pixels and decompose it into a

combination of the basis images? As it turns out, we specifically chose

the images above in a way to make this easy. These images are formed by

cosine waves of various frequencies in the x and y directions, and a

transformation called the discrete cosine transform from Fourier transformation theory allows us to rapidly decompose a signal into these parts in only O(*n* log *n*)

time. Because of the importance of JPEGs, over the years very fast and

clever specialised solutions for 8×8 blocks have been created in both

software and hardware. This is also the reason we use small,

constant-sized blocks instead of processing the image as a whole: the

cosine transform step doesn't need to scale up to large bitmaps, so the

transformation as a whole can run in linear (O(*n*)) time.

Unlike the other steps, entropy coding is lossless. If you've

studied the simple compression techniques used in ZIP files and BMP

files, the techniques of entropy coding should come as no surprise.

We start by turning our 8×8 grid of integers into a linear sequence

of integers. The obvious row-major and column-major orders are not

ideal here, since we expect there to be many zeros in the lower-right

area. Instead, we start with the top-left corner and zig-zag in a

diagonal pattern across the matrix, going back and forth until we reach

the lower-right corner.

Next, we apply the same run-length encoding algorithm used by GIF

files to condense sequences of identical integer values. In areas where

the values are small, we can expect many such runs, particularly runs

of zeros. We use a special code to say "all the remaining values are

zero", which comes in handy.

Finally, we apply Huffman compression to the remaining sequence,

which is a standard lossless compression technique that encodes

frequently used values using less bits than infrequently used values.

And that's it! For each YCbCr channel, for each 8×8 block, we

perform these transformations, and we store the results in a bitstream.

This data is wrapped up in a JFIF container giving metadata about the

image, and the file goes out ready for web browsers to view, completely

unaware of all the data that has been lost. Just don't go sending JPEGs

into space and expect aliens to see them the same way.

Finally, no discussion of the JPEG encoding would be complete

without noting that JPEG is based on research ideas that are now very

old and largely supplanted. The more modern JPEG 2000 encoding

based on wavelets can be pushed much farther before visible artifacts

appear, and includes a surprisingly efficient lossless mode. Of course,

all this improvement is rather moot until common applications actually

support it.

This article will discuss effective use of types to catch some common problems at compile time not normally found by the typechecker.

The typechecker is the most important tool in the programmer's arsenal. Types enable you to structure code in a more maintainable way and prevent entire classes of errors from ever occurring at runtime, many of which can be insidiously difficult to debug. If you've written anything extensive in assembly language, you can understand the frustration of dealing with misinterpretations of data leading to corruption of program state. Researchers like to say "well-typed programs never go wrong".

But what does "go wrong" mean? One answer is: "more than you think". By making effective use of the type system to encode invariants about your program, you can use the compiler to catch more errors at compile time that you currently catch at run time.

A simple example is SQL injection, where insertion of user-supplied data into a SQL statement string can enable attacks where the user modifies the nature of the statement with carefully chosen data. At first the solution seems simple: double your apostrophes to escape them, and then the value cannot be misinterpreted. This can be trickier than sounds, though, since in a complex program you might not be able to identify which values come from an untrusted source, directly or indirectly, which are safely produced from trusted sources, and which are untrusted data that has already been properly escaped. These three types of data travel from method to method, following a winding path that obscures their source. The compiler is of no help, since it sees them all as just strings.

The solution, called "tainting" in languages like Perl and Ruby, is to make them not "just strings", but to create two types of data: tainted data, which is unsanitized data derived from an untrusted source, and untainted data, which is either derived from a trusted source or already sanitized. For our example, suppose we're performing our queries with SqlDataReader. We create a wrapper class for it which has an identical interface except that wherever SqlDataReader takes a string, our wrapper takes a SafeSqlString. This new type is simply a wrapper for *string *that can only be generated in a few strictly limited ways:

- static SafeSqlString AssumeSafe(string): Used for trusted data, just wraps it
- static SafeSqlString Escape(string): Escapes apostrophes; used for untrusted data.
- SafeSqlString operator+(SafeSqlString): Concatenates two SafeSqlStrings
- static SafeSqlString Format(SafeSqlString format, params object[] args): Takes a safe format string and a list of arguments of safe types such as SafeSqlString, int, and double, and produces a new SafeSqlString.

What does this buy us? Now, as long as we use our new SqlDataReader wrapper for all database access, we are essentially immune to SQL injection: the only way it could sneak in is through AssumeSafe(), which is obviously dangerous and makes the code easy to review. As an added bonus, double escaping can never happen: the data only gets escaped once as it changes from *string* to *SafeSqlString*, and Escape() cannot be applied to *SafeSqlString*.

Because these types flow through the program, we can specify that a particular method always returns a SafeSqlString, or assumes that its arguments are SafeSqlStrings, or assumes that some private variable is always safe, and enforce this statically instead of by convention. This is important, because it allows us to use AssumeSafe() at a point where it's *obvious* that the string is safe, as opposed to five calls up the call stack where we really have no idea who else might be calling us.

What else can we encode in types? All sorts of things! Here are some more examples:

- We could have a type PositiveInt for positive integers, which will throw a runtime error if constructed with a nonpositive integer. Now, if a method receives a PositiveInt, it can safely pass it on to another method expecting a positive integer without any additional runtime checks. This trick works with any object where you want to enforce a complex runtime property on an object but only want to check it once (or only when it might change). In unmanaged code, it's particularly useful for checking buffer/string lengths.
- Some binary tree algorithms require that every node has zero or two children, never one. We can represent this using a subclass hierarchy, where the abstract base class is TreeNode and the subclasses are Leaf and InternalNode. Leaf contains no data at all, while InternalNode contains a piece of data and two (non-null) references to TreeNodes.
- There are many other properties of data structures that can be encoded in types. For example, in a red-black tree, red nodes cannot have red children. We can enforce this statically by creating a subclass hierarchy where BlackNode objects contain references to the abstract base class Node, while RedNode objects contain only references to BlackNode objects. Similarly, the root must be black, so we just use a BlackNode reference for the root.
- If you're writing a sorting algorithm, it might not be clear that the array you get back contains a rearrangement of the same elements as the array you passed in. There's no quick way to check this, even at runtime. On the other hand, if instead of an array you pass in an object whose only valid non-read-only operation is to exchange two elements, then this becomes statically obvious.

In short, wherever you find yourself wanting to keep track of a static property as it flows through a program, types are a great way of doing so - use them to catch errors sooner.

In programming languages with more sophisticated type systems, you can encode even more information in types. Taking this to the extreme are dependent types, a relatively new system of types based on values that are extremely difficult to typecheck but so powerful that they can express many correctness properties of real programs. A number of research languages such as Dependent ML and Epigram already implement dependent types. With the power of systems like these, maybe in a few years the primitive C-based type systems of C# and C++ will look a lot like assembly language does now. But before we can take advantage of new technology, let's start making better use of what we have - encode your program's properties using types.

]]>