Cost of JetIntersectIndexes

We recently had a question about JetIntersectIndexes()'s performance. Here's what Brett wrote up:

 

Intersect is to save disk IO at the cost of CPU time.

The cost of intersect indices, is the cost of iterating over ALL the index ranges (choice #3 below).  That may not sound like a savings, HOWEVER index entries are stored together, and we try to maintain sequentially … and so iterating over index entries generally speaking is very fast off disk.  If you walk an index, and retrieve columns you have to get from the real record … you have to jump back and forth between the index and the table, you will end up seeking into the table, and seeking is slow if the cache is not full of that data.  Likely your cache is full of data … but let me continue on the cold case points though …

Some examples help imagining cold IO cases:

  • Base Index Range Scan – Scanning or counting a range of secondary index entries (where you do not have to retrieve the main data) you can walk 100s of thousands or even single digit millions of entries in a second, because the index entry data is often sequentially laid out and a good index range tells us exactly what you’ll want to read (so we preread aggressively often in a few IOs [1] … let’s say 3 IOs for argument… though a range of 241k entries would probably be more than 3 IOs), and the indices often have very dense / smaller entries … so we sometimes even end up being CPU bound here even if the index is cold.
  • Base Index Range + Primary Data Scan – Generally on a _cold_ DB, you can walk like 100 entries / sec via a secondary index … b/c while you can retrieve a range of 100s or thousands of index entries in a second (per #1), each seek into the main table is cold, requires a disk operation and random disk operations are typically ~100 IO per sec (IOPS)[2] on HDs (not SSD).  Such a case is thoroughly random-IO disk bound.
  • a. Best Case 3-way Index Range – So if you have 3 index ranges that selects (as you have) 3k records, 240k records, and 71k records … and it turns out only 1 record satisfies all 3 index ranges, then it will only cost say 3 IOs * 3 index ranges + 1 seek IO for the final data of the primary record = 9 random IOs total.  But how much did the 240k index range save you …
  • b. Even the most costly 240k index range probably processes in ~1 second or less (maybe a bit more, not truly sure of the speed of our insert into the sort) … so if you dropped that, and it ended up giving you an extra 150 records to process (over the 1 that you’ll eventually select), you just barely lose out w/ the extra 150 random disk IOPS (1.5 seconds) from dropping that index range.
  • c. If dropping the 240k index range gave you an extra 2999 records to process (b/c not sure if this is obvious, but implicitly we would know in this case the 71k index range didn’t add much additional selectivity over the 3k range), then that index range will have been EXTREMELY beneficial as that’d be 2999 extra seeks (an extra 30 seconds of disk IO savings) to determine the 1 true good record.  Vs. ~1 second to process the super large index range, that’s extremely fast (30x improvement).
  • If dropping the 240k index range gave you NO extra records (b/c there is already only 1 record in common between the 3k and the 71k index ranges … sort of the opposite of 3.b.), then the 241k is pure cost … this is of course difficult to predict.

 

Every time another index range is added it generally takes a few IOs[1] / 30 ms and let’s say 1 sec CPU / 250k entries of range.  For a small index range of say 100 entries (where IO cost is dominant), if the selectivity of the eventual set of records you have to look up improves by only 3 records per 3 IOs, then you save those costly #2 type seeks that can only be performed at a measly 100 IOPS pace.  For a large index range, if the selectivity of the eventual set of records improves by 100 records per 250k entries processed (CPU wise) then you save enough costly #2 type disk seeks that you win over the CPU cost.  Keep in mind also you have 4 cores to process things, but generally speaking only 1 disk … so again there CPU over disk is nicer to the system as a whole.

However, if everything is warm, generally it becomes an "even numbers" game … where more entries is just more processing … it probably would’ve been better to just select the most selective (smallest) index range and process them straight off by applying the left over terms (residual predicate in query parlance) through the primary record … as that’s 2000 index entries + 2000 seeks (but no IO, just CPU)… every index range you add is a cost / penalty of CPU processing you add to your operational time at likely little improvement.  Which is likely what you’re seeing here … but this is again another thing that is difficult to know a-priori.

[1] I am generally assuming we have generally good indices, like those over dates, auto-inc-like IDs, common terms or common names, etc.  Keep in mind some really randomized indices, like GUID indices can have much poorer contiguity and thus can cost a lot more IO to traverse a range of them … and that can change the results dramatically.

[2] This is numbers for HDs … SSDs change things significantly, as they can often do 2000+ random IOPS, and don’t experience as much speed differential from sequential IOs.