Fragmentation (part 4): what are heaps?

Ok - really catching up with the various blog post series I started back in June/July - time to bang out the next few posts in the series on index fragmentation. Remember I'm starting from first principles and covering

before getting into the details of fragmentation.

 

So what is a heap?

  • A heap is the simplest storage arrangement and is an unordered table (or you could think of it as a collection of unordered pages).
  • This means that rows will be inserted into the heap anywhere where there is space (don't let this statement confuse you - for a table that has nothing but inserts, records will always be appended to the last allocated page).
  • A heap consists entirely of data pages - as there is no b-tree, there are no index pages.
  • As a heap is unordered, the data pages are not linked in any way. (Ah - there are some exceptions. In SQL Server 2000 and 2005 (and maybe in 7.0 - I don't remember), the sysfiles1 table is a linked-heap. This table contains the locations of the files comprising the database and so for safety's sake we want the pages to be linked, but not with the complexity of having a b-tree because we read the pages directly at startup time. In SQL Server 2005, some other system tables are also linked-heaps).
  • If you want to do a singleton lookup from a heap, the whole heap has to be scanned to find the row(s) matching the search predicate.
  • Doing a select (*) of the contents of a heap will not guarantee to return the rows in the order they were inserted (as it did in 6.5 when all heaps were linked as sysfiles1 is and the space from singleton deletes was not reclaimed). This is because the pages are accessed in allocation order.
  • Non-clustered index records contain the physical RID (record ID or record locator) of the corresponding heap record.

Heaps have one interesting feature - forwarded records. If a record needs to be updated, the updated record size is greater than the current record size, and there is no space on the page to fit the new record in then we have two choices:

  1. move the record to a new page and change all the non-clustered index records that point to it to point to the new location of the record
  2. move the record to a new page and leave a forwarding record in the original location to point to the new location

Guess which approach we took? Right, #2. Approach #1 has enormous performance implications because of the extra IOs and logging involved. The forwarding record has the physical location of the new record and the forwarded record has a back-pointer to the forwarding record (so that if its deleted, the forwarding record can be deleted too).

 

This is one drawback of using heaps - all the extra space that's "wasted" with the forwarding/forwarded records. Another drawback is that when scanning through the heap, forwarding records have to followed immediately (as opposed to ignoring them and just reading the forwarded records when they're encountered). This is to vastly reduce the possiblity of read anomalies (such as non-repeatable reads or missed rows if a row moves before the scan point during a scan).

 

And that's it. Next up is clustered indexes...