Foxpro’s legendary speed is due in part to its index technology.
About 14 years ago, the speed of Foxpro was demonstrated to live audiences around the country. A table (places.dbf) of almost 3 million streets in the
Q: How do you search a 200 megabyte file for a single record?
A: Use an efficient compressed index.
Accessing the disk is orders of magnitude more expensive in time than accessing memory, so the goal is to keep as much in memory as possible, thus reading the disk as little as possible. Data compression helps to keep more data in memory, even if there is computation cost in compressing/uncompressing.
The index of a book is a table with 2 columns: the 1st column lists keywords from the text in alphabetical order and the 2nd has the page numbers on which the keywords occur.
A FoxPro index is very similar: An index on the key “streetname” can be thought of as just a table with 2 columns: the streetname values and the record # where the streetname occurs. Using the index to find data would then just be a search of this 2 column index table. This table could be searched using a binary search and would be fairly fast: however, it could be made faster.
Imagine the entries for this table. Because they’re in order, there will typically be a lot of repetition:
Name Rec #
The first 5 letters of these are the same, so these can be shortened to:
Each entry contains an integer indicating how many letters of the prior word to use as the first letters of the current entry. (This encoding is how I was able to compress 2 English spelling dictionaries into a tiny DLL.)
At about 15 megabytes, the index used (places.cdx) is a tree structure of about 30,000 nodes that is only 7 levels deep. (each node is 512 bytes: 512 * 30,000 = 15 megs). For 3 million streets, that’s an average of almost 100 records per node, or about 5 bytes per record. The entire index is so compressed that it can fit in memory.
Use a binary editor (like VFP\tools\hexedit\Hexedit.APP or Visual Studio (File->Open->Open With->Binary Editor) to examine a CDX file. You can see that the data is structured in 512 (0x200) byte segments or nodes. Not coincidentally, this is also exactly the size of a disk sector.
A VFP CDX file is a data structure that
- is compact to minimize disk access.
- Is Balanced: All records can be reached with roughly the same cost: any leaf can be reached by a path through roughly the same number of interior nodes.
- Minimizes Insertion/deletion costs
- Has nodes that contain left/right pointers to sibling nodes, so traversing data in indexed order (BROWSE, SCAN) does not require going back to the root node.
The sample code creates a sample table with 3 character fields: name, address and zip. Then an index is created on each field. The Names are generated as 10000N00001, 20000N00002, etc. and the Addresses are 10000A00001, etc. This data doesn’t compress well, but is uniform for easier analysis..
The sample then decodes the CDX and displays the data found in the CDX file. It does not read the original DBF table at all.
The output shows that each index node is identified by its address within the file. Using hexadecimal makes it easier to read because each file offset is a multiple of 0x200. It also shows that each node has a left and right node pointer. These make it efficient to traverse records without going back to the root.
The interior nodes are not compressed. You can read the uncompressed keys right in the raw hex dump of the file.
Each line is indented by the node depth, and contains several columns. The ixnode columns are:
- the node depth (distance from the root)
- the Node ID: a multiple of 0x200
- the node type: Tagnode or Ixnode
- the node attribute
- node left pointer, or 0xFFFFFFFF if none
- node right pointer, or 0xFFFFFFFF if none
- The # of key/value pairs in this particular node (nKeys)
· For interior nodes:. (not compressed). nKeys pairs of (key and Node ID of child node)
· For leaf nodes: nKeys pairs of (key and Record # )
This code creates a test index with 3 index tags.
CREATE TABLE test (name c(20), address c(30), zip c(11))
INDEX on UPPER(name) FOR 7 < 8 AND SECONDS()>0 TAG name && have a FOR expression that is always true for demo
INDEX on UPPER(address) TAG address
INDEX on UPPER(zip) TAG zip
INSERT INTO test (name,address,zip) VALUES ("fred ","52 anywhere","z11111")
INSERT INTO test (name,address,zip) VALUES ("barney","23 anywhere","z22222")
INSERT INTO test (name,address,zip) VALUES ("wilma ","33 anywhere","z33333")
INSERT INTO test (name,address,zip) VALUES ("betty ","43 anywhere","z44444")
FOR i = 5 TO 10 && try 100, 1000, 10000
INSERT INTO test (name,address,zip) VALUES (padr(i,5,"0")+"n"+PADL(i,5,"0"),padr(i,5,"0")+"a"+PADL(i,5,"0"),padr(i,5,"0")+"z"+PADL(i,5,"0"))
The output from running the Index decoder program
0 0x00000000 Tagnode 0x000000E0 0x00000400
1 0x00000400 Ixnode 3 0xFFFFFFFF 0xFFFFFFFF 3 ADDRESS 0x00000C00 NAME 0x00000600 ZIP 0x00001200
2 0x00000C00 Tagnode 0x00000060 0x00001000 UPPER(address)
3 0x00001000 Ixnode 7 0xFFFFFFFF 0xFFFFFFFF 10 10000A00010 10 23 ANYWHERE 2 33 ANYWHERE 3 43 ANYWHERE 4 50000A00005 5 52 ANYWHERE 1 60000A00006 6 70000A00007 7 80000A00008 8 90000A00009 9
2 0x00000600 Tagnode 0x00000068 0x00000A00 UPPER(name) FOR 7<8.AND.SECONDS()>0
3 0x00000A00 Ixnode 7 0xFFFFFFFF 0xFFFFFFFF 10 10000N00010 10 50000N00005 5 60000N00006 6 70000N00007 7 80000N00008 8 90000N00009 9 BARNEY 2 BETTY 4 FRED 1 WILMA 3
2 0x00001200 Tagnode 0x00000060 0x00001600 UPPER(zip)
3 0x00001600 Ixnode 3 0xFFFFFFFF 0xFFFFFFFF 10 10000Z00010 10 50000Z00005 5 60000Z00006 6 70000Z00007 7 80000Z00008 8 90000Z00009 9 Z11111 1 Z22222 2 Z33333 3 Z44444 4
You can see that the first NodeID is 0x0000. It points to NodeID = 0x400, which contains the tag names and NodeIDs of each tag root.
Each tag root (0xC00, 0x600, 0x1200), in turn, contains the NodeID of any child nodes, as well as the index expression and any FOR condition. In this small sample there are no interior nodes:, each of these child nodes are leaf nodes, which contain the actual key/record # pairs.
For complete source code of the index decoder (about 250 lines): click here. (The sample will not work in VFP8: it requires the new VFP9 CTOBIN features for decoding binary numbers.)
Experiment with changing the data, index expressions, field sizes and record count. Examine your own indexes.
This code only works with character string index expressions.