# A closer look at the complexity analysis of finding the k’th smallest element in two sorted arrays

Given two sorted arrays of the same length, with unique elements, find the kth smallest element in the combined collection. One solution involves the double binary search. I'll let you go read the article to see how it works. I'm here to dissect the complexity analysis.

Since every time we remove ¼ elements, the complexity is O(2*log(N)), i.e., O(log(N)).

Okay, that was pretty glib. How did we get there?

If an algorithm reduces the problem to a smaller problem that is half the size, then the number of reduction steps needed to reduce the problem to size 1 is ⌈lg N. More generally, if you have an algorithm where the ratio of the size of the next iteration to the size of the current iteration is r, then the number of reduction steps is ⌈log1/r N⌉.

In the above example, the ratio is r = ¾, so the number of reduction steps is ⌈log4/3N. We can change the logarithm to base 2 to arrive at ⌈(lg N)÷(lg (4/3))⌉ ⌈2.4 × lg N.

This is not the same as 2 × lg N given in the original article. My guess is that the original article calculated the conversion factor incorrectly as lg 4 = 2.

But since this is all for order of magnitude calculations, an error in the constant factor is forgiven because order of magnitude ignores constant factors.

But wait, does the formula even apply in the first place?

The formula applies if the reduced problem is the same as the original problem, but with a smaller size. In the case under consideration, the starting point was two equal-sized arrays, but when we're done, we have two unequal-sized arrays: One of the arrays shrunk in half, but the other stayed the same size.

We used the wrong formula!

Consider the second iteration, where we have one small array and one large array. And suppose the second iteration of the algorithm decides to throw away half of the smaller array. We did not throw away a quarter of the elements. We threw away half of the smaller array, which is one third of the total number of elements, which means that we threw away only a sixth!

It gets worse at the third iteration: If we are unlucky and the algorithm decides to throw away half of the tiny array, we discarded only one tenth of the elements.

So in the worst case, we get into a case of diminishing returns, where were throw away less and less from the small array and never make a dent in the big array. Does this algorithm even guarantee termination?

In mathematics, sometimes the way to solve a problem is to convert it to a harder problem, and then solve the harder problem. In this case, the harder problem is "Given two sorted arrays of possibly-unequal length, with unique elements, find the kth smallest element in the combined collection."

Let f(n, m) represent the number of reduction steps needed to solve the problem, where n and m are the lengths of the two arrays. If you decide to throw away half of the first array, then the number of remaining steps is fn, m). Similarly, if you decide to throw away half of the second array, then the number of remaining steps is f(n, ½m). You now have the recursion f(n, m) = 1 + max(fn, m), f(n, ½m)).

The solution to this recursion is f(n, m) = ⌈lg n⌉ + ⌈lg m.

You can think of the problem this way: You have two arrays, and based on the algorithm, you cut one in half or you cut the other in half, until both arrays are cut down to just one element. It takes ⌈lg n cuts to reduce the first array and ⌈lg m cuts to reduce the second array. The algorithm tells you which piece to cut next, but regardless of what order the algorithm gives, the total number of cuts is the same.

In the original problem, n = m = ½N. Therefore, the number of reduction steps is ⌈lg ½N⌉ + ⌈lg ½N⌉ = 2⌈lg ½N⌉ = 2⌈lg N⌉ - 2, which is O(lg N).

So the answer given in the article was off by two. This doesn't make any difference when calculating order of magnitude, but it was interesting that the incorrect calculation was so close to the correct one.

Tags

1. nathan_works says:

I remember cs304/algorithms (from a long time ago) saying something like 80% of all CPU time is spent sorting. A CMU slide says 25%. I wonder how true that still is today.

1. Brian_EE says:

With multiple cores, and most PCs (at least in work/industrial environments) being on 24/7, I would suggest that 80% of all CPU time is spent in the idle loop.

1. Aged .Net Guy says:

And 80% of the rest is spent processing or filtering spam or displaying inane or inflammatory videos.

What have we collectively wrought? Von Neumann would be simultaneously awed with our progress and appalled by our uses of it.

2. Antonio Rodríguez says:

Some time ago I read a joke. A 1950s-era man in a suit in front of a teletype was talking by (wired) phone: “Will we have inside our pockets machines capable of billions of operations per second? And will we use them to do “likes” and “faves”, and share pictures of our breakfast with our friends? Really?”. There’s nothing more to add.

2. Fabian Giesen says:

These figures are from way back in the punchcards-and-tapes era of mainframe computing.

There’s still a lot of sorting and searching going on (e.g. inside database engines, especially if you also count things like B-tree index maintenance as “sorting” in a sense), but computer workloads are a lot more heterogenous now than when the primary commercial usage of computers was accounting or payroll :).

2. Bryce Wagner says:

A binary search makes sense here, but the the choice of bounds does not. The search space is size k, so a binary search on it should be O(log2(k)), both the best case and worst case. We’re searching for how many of k elements should come from each array. At one extreme we have k elements coming from array 1. At the other extreme, we have k elements coming from array 2. So we can represent 0 as one extreme, and k as the other extreme. Your binary search ends up looking something like this pseudocode:

binarysearch(0, k, i => arrax1[k -i].CompareTo(array2[i])).

And then you can shrink your bounds further so it can’t go past the end of either array.

1. Excellent!

2. Hi, in those cases where k is bigger than half of the total size of array 1 and array 2, it should be faster to instead binary search downwards for (total size) – k as the new k.

1. That’s what Bryce’s “shrink your bounds further so it can’t go past the end of either array” step does. Magic.

3. Simon Clarkstone says:

I was thinking in similar ways; I imagine it as an ascending line and a descending line plotted on the same axes and trying to find where they cross.

Since we want to know which specific element is the answer and hence which array to get it from, we might want to consider the bottom k elements of array1 to be in some sense “interleaved” with the reversed bottom k elements of array2, so the index not only specifies at a specific bottom slice of both arrays, but also indicates which of those two slices has the answer at the top. I haven’t thought about exactly how this search would work.

1. Ben Voigt says:

@Simon: Yours is a simple binary search (for zero) on the array representing (the signed difference between the two signals).

3. Markus Falk says:

As both lists are already sorted, only k steps are needed to get the result.
2) Choose the smaller one and put it on the new list. If it was from A take the next element from A, otherwise from B.
3) repeat 2) until you have k elements.

This is always O(k)

1. But O(log k) is better than O(k).

1. Richard Wells says:

It’s not just better; it’s exponentially better. I’m ashamed to say it took one of my students pointing this out for me to realize it; I had never thought of it that way.
Perhaps a better challenge is to find the kth smallest item in two unsorted arrays in O(n) time. I know it can be done with a single array, and I’m pretty sure that solution can be extended to two arrays in a manner similar to this problem, but I have to admit I haven’t bothered working out the details.