I just got a question this morning about how to take two collections of items and determine how many of those items had the same name. The user had written this straightforward but extremely slow VBScript algorithm:

For Each Frog In Frogs

For Each Toad In Toads

If Frog.Name = Toad.Name Then

SameName = SameName + 1

Exit For

End If

Next

Next

There were about 5000 frogs, 3000 toads and 1500 of them had the same name. Every one of the 3500 unsuccessful searches checked all 3000 toads, and the 1500 successful searches on average checked 1500 toads each. Each time through the inner loop does one loop iteration, two calls to the Name property, one string comparison. Add all those up and you get roughly 50 million function calls to determine this count.

This code has been somewhat simplified – the actual user code was doing more work inside the loop, including calls to WMI objects. The whole thing was taking 10+ minutes, which is actually pretty fast considering how much work was being done. Each individual function call was only taking a few microseconds, but fifty million calls adds up!

Now, we’ve been down this road before in this blog (

__here__, __here__ and __here__) and so of course I recommended building a faster lookup table rather than doing a full search through the collection every time.

Set FrogLookup = CreateObject(“Scripting.Dictionary”)

For Each Frog In Frogs

FrogLookup(Frog.Name) = Frog

Next

For Each Toad In Toads

If FrogLookup.Exists(Toad.Name) Then SameName = SameName + 1

Next

Which is much, much faster. That’s only about 16 thousand function calls. Now, this is maybe not an apples-to-apples comparison of function calls, but we at least have good reason to believe that this is going to be several times faster.

And indeed it was. But I’m still not sure how much, which brings me to the *actual subject* of today’s blog. The user reported that the new algorithm was “250% better”. I hear results like this all the time, and I always have to ask to clarify what it means. You can’t just say “n% better” without somehow also communicating what you’re measuring.

(UPDATE: This reported result understandably confused some readers. Clearly the new loop here is *thousands* of times faster, not a mere 250% faster. As I said before, the sketch above is highly simplified code. The real-world code included many thousands of additional calls to WMI objects which were not eliminated by this optimization. Eliminating these 50 million function calls helped — you should always eliminate the slowest thing first! But doing so also exposed a new “hot spot” that needed further optimization. However, the point of this article is not the benefits of using lookup tables, but rather that using unexplained percentages to report performance results is potentially misleading. The result above is merely illustrative. See the comments for more details.)

Allow me to illustrate. Suppose I have a web server that is serving up ten thousand pages every ten seconds. I make a performance improvement to the page generator so that it is now serving up fifteen thousand pages every ten seconds. I can sensibly say:

- performance has improved by 50%, because we are now serving up 5000 more pages every ten seconds, and 5000 is 50% of the original 10000. In this world, any positive percentage is good.
- performance is now 150% of original performance because 15000 is 150% of 10000. In this world, 0%-100% worse or the same, 100%+ is good.
- We’ve gone from 1000 microseconds per page to 667 microseconds per page, saving 333 microseconds per page. 333 is 33% of 1000, so we’ve got a 33% performance improvement. In this world, 0% is bad, 100% is perfect, more than 100% is nonsensical.

If I can sensibly say that performance is better by 50%, 150% and 33%, all referring to exactly the same improvement, then I cannot actually be communicating any fact to my listeners! They need to know **whether I’m measuring speed or time**, and if speed, whether I’m comparing **original speed to new speed** or original **speed to the difference.**

So what is “250%” better? Clearly not raw timings. If this loop took 10 minutes to run before, 250% of 10 minutes is 25 minutes, but surely it is not running in -15 minutes now! I assume that the measurement is speed — say, number of loops runnable per hour. If before it took ten minutes and therefore we could run this loop six times per hour, and now it is 250% better, 250% of 6 is 15, and this is “better”, so we need to add. So that’s 21 loops per hour, or about 3 minutes per loop. Had the user accidentally said “250% faster” but meant to say “250% of previous performance” then we’d conclude that 250% of 6 per hour is 15 per hour, so now we’ve got 4 minutes per loop.

And of course, this is assuming that the person reporting the improvement is actually trying to communicate facts to the audience. Be wary! Since 33%, 50% and 150% are all sensible ways to report this improvement, which do you think someone who wants to impress you is going to choose? There are opportunities here for what Kee Dewdney calls “percentage pumping” and other shyster tricks for making weak numbers look more impressive. Professor Dewdney’s book on the subject, “200% of Nothing”, is quite entertaining.

The moral of the story is: **please do not report performance improvements in percentages**. Report performance improvements in terms of **raw numbers with units attached**, such as “microseconds per call, before and after change”, or “pages per second, before and after change”. That gives the reader enough information to actually understand the result.

(And the other moral is, of course, **lookup tables are your friends**.)

I’m reading the Dictionary docs:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/script56/html/jsobjDictionary.asp

and I have no idea what this line does:

FrogLookup(Frog.Name) = Frog

As I mentioned in yesterday’s blog entry, passing an argument to an object that has a default property where the property takes an argument is equivalent to calling the default property.

The default property of Dictionary is Item, and Item takes an argument. Therefore

dict(x) = y

is just a syntactic sugar for

dict.Item(x) = y

See

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/script56/html/jsproitem.asp

for the Item documentation.

"The default property of Dictionary is Item…"

How do you know what the default property of an (arbitrary) object is, if in fact there is one? The Dictionary object documentation doesn’t seem to identify Item as the default property (unless I’m missing something).

Is the existance and name of the default property something you can easily determine from VBScript? My apologies if you’ve answered this in the past.

I know this isn’t the topic, but wouldn’t this be even faster?

Function FrogExists(sName)

On Error Resume Next

FrogLookup = False

FrogLookup = IsObject(Frogs.Item(sName))

End Function

For Each Toad In Toads

If FrogExists(Toad.Name) Then

SameName = SameName + 1

End If

Next

In general there is no way from the language itself to determine any information about what the functions, properties, events, etc, on an object are. That information can be determined by examining the type library for the object of course, but there’s no facility from within the VBScript language for reflection on arbitrary objects.

However, one of the OLE Automation design guidelines is that all collection classes have at default getter (and preferably also a setter) called Item.

I regret that this wasn’t called out in the dictionary documentation; it probably should have been. The scripting documentation team was very, very small at the time this documentation was written and there were a lot of missed stuff like this.

If the Frogs or Toads collection supported a by-name indexer, then indeed, that would probably be faster than building an index. (Assuming that the people who wrote the indexer used an efficient lookup table internally!)

However, in the actual user code, I believe that the collection only supported indexing by ordinal, not name.

Good idea though.

I think the first poster is asking why you set the value of FrogLookup(Frog.Name) to Frog and not to some constant – since Exists() doesn’t care about the value anyway.

Still too many iterations for really big collections though…

If you sort both lists then you can compare for matches in n+m time where n is the size of list one and m is the size of list two..

This is because you only have to look at the second list until you find a member that is logically after it in the second list. Since the next number is after the first in order, you can start at the position in the second list you last stopped at and then go until you find a member that is logically after it. Eventually, you will exhaust both lists.

This gives an overall runtime of m log m + n log n + m + n or roughly n log n which is less for large sets than n * m (which is the complexity of the algorithm you suggested.

I leave the code as an exercise for the reader. I wonder if Larry Wall will consider me virtuous.

TANSTAFC!!

Shaun Bedingfield

blogsb.blogspot.com

shaunbed@houston.rr.com

Oh, by the way.. 5000 * 3000 = 15 M

and 5000 log2 5000 ~= 13 * 5000 = or 65 K.

If the constant is equal for both sets, which it won’t be… My algorithm would be roughly 23000 % faster. Not a small gain.

Using a constant value would work of course, and if you wanted to wring every last bit of performance out, that would probably be better. But being able to map back to the object from the name is often useful.

Shaun, I’m not following your logic here. How is my algorithm not O(n+m) already? We have O(n) hash table adds and O(m) hash table lookups.

Surely your algorithm will be dominated by the sort time, which will be larger than O(n+m).

It depends on your hash function.. If you are only getting a 33% increase over the original algorithm, the built in hash algorithm must be missing almost everything. Hash functions are only O(1) if you have a good hash.

Yes, my algorithm is O(n log n).

Hash functions are not magic and while they can be O(1), their worst case performance is O(n).. You seem to be much closer to the worst case so you are O(m*n)

The algorithm I proposed is O(m+n), not O(mn) as the original algorithm. That would have about 16000 function calls instead of 50 million. Assuming an apples-to-apples comparison — which is actually reasonable in this particular case — we’d probably get a factor of 3000x speedup.

For all practical purposes, the cost of the loop is now zero compared to the original cost.

The fact that the user was reporting a 3x speedup, not a 3000x speedup tells me that the likely cost of the original algorithm was 7 minutes in the loop, 3 minutes in other WMI calls. Now we’re down to 0 minutes in the loop, 3 minutes in other WMI calls, for a total speedup of 3x. Something else is the gating factor now; probably something else with thousands of WMI calls.

The dictionary hash function is reasonably good on sets of 5000 objects. I would not be putting millions of things into there though. Essentially the hash table is an O(n) table with a very small constant factor.

I am sorry, however, all of the information I had pointed to a very suboptimal hash function. This can happen.

At the end of the day, it is always best to profile before assuming that your hash function is indeed O(c) as it may not be.

You’re algorithm is O(n + m) AVERAGE time complexity (assuming the hash algorithm works for the problem. It is O(mn) worst time complexity as a hash lookup operation can be O(n).

Boy, do I know that the hard way.

As I’ve mentioned in this blog before, when I wrote the scripting dictionary hash algorithm, the first version had a bug, the upshot of which was that short strings consisting solely of digits would get hashed to a very small number of buckets.

We discovered this the hard way when http://www.msn.com performance started to suck one day. They had a dictionary with every zip code in the US in it, and the lookup cost very quickly became O(n). Fortunately it was an easy fix!

Of course, there are tradeoffs there as well.

The obvious solution is to just use a standard lookup area because there are so few zip codes. Excluding zip + 4 you have 10^5 or ~2^17 zip codes. The advantage here is you now exactly how much time every query will take and exactly how much memory you need. Properties like this are really important in certain fields like game programming.

Zip + 4 is not practical as you have 10^9 or ~2^30 (a good way to convert base 10 to 2 is remember that every 1000 is roughly equal to 2^10 = 1024) zip codes with a look up table though and would quickly consume almost all of a machine’s memory. You can add a two table system with a second layer of indirection but this only aids memory usage a little because you don’t need a secondary table for every zip.

In addition, changes to the zip system would require serious rework. You also start to run into problems as your table is no longer very page friendly and page faults abound.

At that stage, you pretty much need a good hash table to exploit the fact that zip codes are sparse (not everyone zip or zip +4 is actually used). A generic hash table has the advantage of being able to expand easily if the zip code properties change. All you really need to do is change the code to verify something IS a zip code. A more specialized function can take into account properties of the hash codes and may perform better but when hash codes are added or if the US government decides to change the rules, rework ensues.

When you don’t know the properties of the data going into your hashtable, you are playing a probability game. Bet on the library functions first and don’t customize until they fail you. Remember Knuth "Premature optimization is the root of all evil" and don’t try to optimize until library routines have failed. Besides, not everyone can write a hash function that out performs the library routines. Those routines are also tested.

But always, profile and make sure the code is doing what you think it is. I like to think I know everything, however, I often find the only thing more astouding than my intellect is my ignorance. Things don’t always work the way you think they will and when they don’t, you need to be able to figure out why. Always, learn as much as you can and then be ready to find out you were wrong.

Overall, the guys at Microsoft do write some nice code and make some great libraries which is why we all like giving them such a hard time ðŸ˜‰

Shaun Bedingfield

I assumed the problem was going to be duplicate names. If you have frogs "Abe, Bill, Bill, and Bill", and Toads "Abe, Abe, Bill, and Colin" how many are the same? Two? Three? Four? Seven?

Good point. I believe in this particular case the collections already ensure that every name within the collection is unique. But if you couldn’t make that guarantee then you would need some more specific statement of the desired output, and a different algorithm.

If your only counting, it is an easy mod.

You just have the associative array point to the number of items with that name and then add that number when you get a match.

Actually it is a multiply.. gr.. Looping through an associative array (size n) and getting the value of the other..

I am just not a morning person.

This is an obvious case of trying to optimise the code, and not the algorithm.

Sorting the lists, and comparing them is by far the best solution to this problem in the real world (As apposed to a simple sample).

Chances are you either need to do this test several times, or you have the data in trees , or inserting new items can be done in order, all of those leaving you with the sorting time a non-existant problem.

About 15 years ago, I was a programmer on IBM mainframes. I read Computerworld every week. The company Syncsort ran a full-page ad touting how much better their Syncsort product was than the built-in sort that the IBM operating system provided.

The Syncsort ad I remember best featured a quote from a customer in large type, crowing:

"Since we switched to Syncsort, our sort jobs take 120% less time than before!"

I felt sure that Syncsort had some employees with mathematics aptitude, but of course they didn’t vet the ads. That particular ad ran for several weeks at least.

One problem with not putting differences in Percentages–> There seem to be very few people who won’t look at the specifics and turn it into a percentage or fraction anyway. For example, if I say that we used to be able to process 176/second and now we can process 323/second, I invariably get asked what percentage increase it is? It is how many people think and understand data like this.

Statistics are fun things … as Samuel Clemens famously said, "There are lies, darn lies, and statistics."

Mark Twain is indeed famous for saying that, but he was quoting someone else.

http://www1c.btwebworld.com/quote-unquote/p0000149.htm