A Group Project – Create an Index File System

Article
06/05/2008

Back when I first started writing code for commercial use I was working on a project on a small computer (about the size of a small refrigerator) that was also a very low cost computer (well under $100,000). Obviously a computer like that could not support a real database engine. Whoa! That sounds ridiculous in today’s world doesn’t it! Well in the mid 1970s it was true. Back then databases and even most good keyed ISAM (indexed sequential access method) file systems were very expensive and only ran on really large expensive computers. That didn’t last long of course but it was true for one of the first systems I worked on. So we had something that was home grown.

Oh it was horrible. It was slow, inefficient, and complex to use. In fact they way it was set up when I started we had to create a set of custom subroutines for each new file we were going to use. The first thing I did was to create a single parameterized set of routines so I could reduce duplicate code and make the whole thing more maintainable, modifiable and supportable. I think that got me my first raise. It was also a great learning experience. Which got me to thinking recently.

It occurs to me that students could reasonably be expected to duplicate this tool as a group project. There are a number of useful things it would teach plus it would be large enough that students working as a team could do it. This would promote that whole team work soft skill that is in such high demand these days. So let me explain how this system worked and how I think it could be used for a lot of discussion points.

First off remember that this is a very inefficient tool. I know that. It makes it useful as a learning tool though as it automatically opens lots of doors for discussion. I’m also not looking to discuss the discussion questions below here on this blog. Rather I am suggesting them as useful in the classroom. With that said, this is how it worked.

What we are talking about is an indexed access method file. The user (a program) enters a key and the record that matches that key is returned. If the record does not exist an error is returned. The system includes two files – an index file and a data file. The index file has a header record, a sorted key area and an overflow area. The header record has pointers to the first and last record in the sorted area and the first and last record in the overflow area. There is also a number that indicates the last possible record in the index file. The data file includes only data records which may or may not be sorted – and are mostly inserted in order that they were added in the first place. The key records include a key value, a pointer to the location in the data file where the data is and a flag that indicates if the record exists or has been marked for deletion. Note that in an alternative implementation the deleted flag could be in the data record rather than the key record. There you have your first discussion item – which is more efficient?

The methods you would need in your program are:

OpenFile – Open the key and data files. BTW let’s not allow shared files for now. Maybe later after more discussion about what that means.

AddRecord – write the data record to the data file and write the key to the key file. If the key fits at the end of the sorted keys and there is room in the sorted key section add it there. If not, add it to the end of the overflow area.

Discussion:

What are the costs of sliding keys around to make the new key “fit” in the right place?
What about duplicate keys? Can they be allowed or do they complicate the system too much to be practical?
If the location we want is occupied by a record that is marked for deletion can we use it? If so do we add a new data record or just replace the existing “deleted” record?
What do we do if there is no more room in the file?
Is there a point where we should return a code that says “reorganize this file?” When is it?

FindRecord – given a key do a binary search of the sorted area to see if it exists there. If if it not in the sorted area search the overflow area sequentially, If the key exists read the record from the data file and return it. If it doesn’t exist return a status or error code.

Discussion:

If the keys are all sorted we could retrieve data in sequential order easily. What would a FindNext method look like and how would it work if the overflow is in use?
Error handling?

UpdateRecord – Replace data in an existing record by re-writing it in the data file.

Discussion:

Do we just pass in the index of the data record as returned from a previous FindRecord call or should we use a key to call FindRecord to verify the address?
What errors could we have and how and where do we handle them?

DeleteRecord – Set the deleted flag for the key/record.

Discussion:

How do we prevent deleting the wrong records? Do we delete based on a key search?
Should we over write the data in the data file? Why or why not?
Is this just a special case of UpdateRecord (which if might be if the deleted flag is in the record) or should be be completely separate?
Record not found? What other errors could there be?

CloseFile – Close the open files and flush any buffers.

Did I leave anything out?

Now we probably also want three programs.

A test bed program to test the method created.
A reorganization program to run periodically. This program would recreate the key file by sorting keys including the ones in the overflow area and remove records that are marked for deletion. We probably want to recreate the data file to get back the space used by “deleted” records. Should we place the data in a sorted order? Why or why not?
A bulk load program. This program would take a flat sequential file, find the keys and create a new key/index file and a new data file. How that should work should be quite a topic of discussion. Should it require that the file be in sorted order for example?

I’m not sure how long this should all take. I think it would depend on the number of students and their level of comfort programming. It would also depend on how much up front design work they do before they start coding. My gut tells me that more planning would result in a faster completion of a working project. The students will want to have all the method interfaces defined early as well and the internal structure of the files. Code re-use would be a time saver as well.

Of course once it is done the next step is to find a much better system (have them look up B* trees [B star] for example) and see if the design they have allows them to completely change the file structure without needing to change the method calls for accessing data.

I’d like to hear what others think of this as a project. Too much? Outdated by ease of using things like SQL, MySQL, Access, etc? Or is the learning and discussion of various issues worth the time themselves? If anyone tries it with a class I would really love to hear how it goes.

A Group Project – Create an Index File System

Additional resources