Inside of Table and Index Partitioning in Microsoft SQL Server


 


 


Introduction


 


With those huge fact tables, you typically find in today’s data warehouses, you’re often confronted with restrictions concerning the manageability of these tables and their indexes. Some actions like generating full database backups, checking the database consistency, or reorganizing indexes are becoming prohibitive in terms of time required to perform these activities. Additionally, bulk loading big chunks of data into these tables incurs big overhead for index maintenance and transaction logging and results in reduced concurrency on the tables due to blocking.


 


Most of these problems are being addressed by a feature available in the Enterprise Edition of SQL Server 2005 and later, called table and index partitioning. Quite a few information about partitioning is available in various sources, like SQL Server Books Online, or the excellent White Papers “Partitioned Tables and Indexes in SQL Server 2005" http://msdn.microsoft.com/library/ms345146.aspx,  or “Partitioned Table and Index Strategies Using SQL Server 2008" http://msdn.microsoft.com/library/dd578580.aspx. In the following, I’d like to dig a little bit deeper into this topic, showing the effects of partitioning on SQL Server metadata, with special focus on sliding window scenarios. Also, I’ll present some T-SQL queries that might come handy for administering your partitions. And finally, I will talk about the improvements SQL Server 2008 has up its sleeves for partitioning.


 


 


Table Partitioning


 


Partitioning a table means splitting the table data horizontally into smaller units, called partitions, that are easier to manage than the table as a whole. While a non-partitioned table is stored on a single filegroup, each of the partitions of a partitioned table can be mapped to a different filegroup, thus allowing for better control over the physical distribution of table space. With this separation, filegroup backups and checking the consistency of a single filegroup instead of the whole database are becoming a reality. Additionally, by marking some filegroups, that are associated with partitions that do not change, as READ_ONLY, you can take advantage of the new partial backup feature or the reduced locking when querying data on those filegroups.


 


So how does SQL Server determine, which partition a row of a partitioned table belongs to? For this, we first have to identify the column by which the table is being partitioned. The data type of this column defines the maximum range of possible values, e.g. for integer the range is from negative infinity to positive infinity. Next, this range is further divided into consecutive, non-overlapping sub-ranges, each of them defining a partition range.  Thus, the sub-range, into which the value of the partitioning column for a particular row falls, determines the partition the row belongs to. 


 


Therefore, the first decision to make, when creating a partitioned table, is which column to use to partition the table. This column is called the partitioning column. In SQL Server 2005 and SQL Server 2008 only one column can be used for partitioning a table, but this may change in future releases. The decision which column to choose will be based on data distribution and the intended usage of the partitioned table and is mostly based on the same criteria we apply, when identifying the columns of a clustered index. In fact, the partitioning column also has to be a member of the clustered index keys.  It’s obvious, that a partitioning column needs to be a fairly static one, as a change of the partitioning column also might result in a change of the partition the row belongs to and thus, the row would have to be copied physically from one partition to another. If the partitioned table is intended for business analysis purpose, the most predominant dimension used for analysis is a good candidate for the partitioning column, e.g. date, product group, sales region.  If the table stores historical data and at any time we only want to keep a certain range of data, i.e. we constantly age old data out and move new data in, the table needs to be partitioned on some date value stored in the table. This is the typical usage of the sliding window scenario as we will discover later.


 


 



Partition Functions


 


Once the partitioning column has been identified, we need to define the number of partitions to use and the boundary values for each of the partition ranges. This designation has to be set up by means of a Partition Function. The partition function is being created using the following T-SQL statement:


 


CREATE PARTITION FUNCTION <name> (<type>) AS RANGE {LEFT | RIGHT} FOR VALUES (v1, …, vn);


 


By specifying n boundary values v1 through vn of the given type, this function defines n+1 partitions. A maximum of 1000 partitions can be created this way. The type the function is based on must be of an integral SQL type,  LOB’s, xml, timestamp and user defined types are excluded. The order of the boundary values does not matter as long as all values are distinct. The system orders them implicitly in case they are not given in ascending order. LEFT or RIGHT specifies to which side of a boundary value the pertaining partition extends. If the boundary values are arranged in ascending order on a horizontal axis, the partition range alignment can be illustrated as: 


 


             partitioni                         partitioni+1
LEFT:   ---|[---------|]---        RIGHT:  ---[|---------]|---           
          vi-1         vi                       vi         vi+1


 


 


This arrangement probably becomes more obvious when listing for the partitioning column x the partitions and their pertaining ranges in tabular format: 


 






























 


RANGE LEFT


RANGE RIGHT


partition1:


x <= v1


x < v1


partition2:


x > v1 AND x <= v2


x >= v1 AND x < v2


         :


 


 


partitionn:


x > vn-1 AND x <= vn


x >= vn-1 AND x < vn


partitionn+1:


x > vn


x >= vn


 


 


The choice whether we specify a LEFT or a RIGHT range ordering is basically an arbitrary one with only minor functional differences. But both have their pros and cons. In most scenarios a RIGHT range seems to be easier to formulate. On the other hand, a LEFT range function might be somewhat more intuitive to use when it comes to switching partitions in and out. We will come back to this topic later when talking about sliding window scenarios. 


 


Also keep in mind that partition functions are not static. You can adjust the number of partitions at any time. By merging two partitions into one, you decrease the number by one, and by splitting one partition into two, you increase the number by one.


 


 



Partition Schemes 


 


Next, before starting to fill the partitions with data, we have to define where to store the partition data. This is being accomplished by means of a Partition Scheme. Each partition scheme is associated with a specific partition function, and for each partition defined by this function the filegroup where to store the data for this partition is being specified. A partition scheme is created using the following T-SQL syntax:


 


CREATE PARTITION SCHEME <name> AS PARTITION <partition function> TO (<fg1>, …, <fgm>); 


 


There has to be an assignment to a filegroup for each partition defined by the function, and the assignments have to be in order, i.e. partition1 maps to fg1, and so on. But it’s also legitimate to specify more filegroups than partitions exist. In this case, the first filegroup in excess will be marked as NEXT USED (more on this later). All filegroups listed must already exist in the database. But the filegroups do not need to be distinct, i.e. several partitions can be mapped to the same filegroup. There’s even a special abbreviated syntax for the case where all partitions map to the same filegroup, ALL TO (<fg>).


 


 



Creating Partitioned Tables 


 


Now that we have everything in place, let’s create the partitioned table. There is no big difference from creating a non-partitioned table, except the ON clause at the end of the statement. While for a non-partitioned table we can specify a filegroup where the table should be created on (if nothing is specified, the table is being created on the default filegroup), for a partitioned table we simply specify the partition scheme together with the partitioning column. So, the T-SQL statement for creating a partitioned table looks like:


 


CREATE TABLE <name> ( ) ON <partition scheme> (<col>)



 


The partitioning column specified must be a column of the table (even computed columns are allowed as long as they are persisted) and it must be of the same type (or implicitly convertible to this type) defined for the partition function the partition scheme is associated with. Practically, this column is being defined as NOT NULL, although this is not a requirement. The same partition scheme can be used for other tables as well.


 


 



Partitioning Metadata 


 


Now let’s have a closer look at the metadata associated with the various partition objects. SQL Server provides several system catalog views to make the internals of partitioning visible. For a comprehensive description of these catalog views please consult BOL. Here we are just focusing on the most meaningful columns of those views while following the creation process of a sample partition.



 


So let’s start with creating a partition function for an integer column: 


 


CREATE PARTITION FUNCTION PFTest (int) AS RANGE RIGHT FOR VALUES (10, 20, 30, 40)



 


Every partition function has an entry in the sys.partition_functions catalog view. The above sample partition function might create the following row: 


 


select name, function_id, fanout, boundary_value_on_right
  from sys.partition_functions
 where name = N'PFTest' 


 


name      function_id fanout      boundary_value_on_right
--------- ----------- ----------- -----------------------
PFTest    65546       5           1


 


The function_id is a unique identifier assigned to each partition function. The fanout column specifies the number of partitions defined by this function (Note: this number is always the number of boundary values given in the partition function definition plus 1). And the value for boundary_value_on_right is 1 for RIGHT aligned partition ranges and is 0 for LEFT ones.



 


The data type the partition function is based on can be determined from the catalog view sys.partition_parameters: 


 


select parameter_id, TYPE_NAME(system_type_id) as [type]
  from sys.partition_parameters
 where function_id = 65546



 


parameter_id type
------------ ----------
1            int 


 


Note that the partitioning catalog views are already prepared to deal with more than one partitioning column. But the current T­­-SQL syntax only allows to create partitions based on a single partitioning column. Therefore, sys.partition_parameters currently will only have one row per partition function and parameter_id always will be 1. 


 


And finally, the sys.partition_range_values catalog view has an entry for every boundary value given in the function definition:



 


select boundary_id, parameter_id, value
  from sys.partition_range_values
 where function_id = 65546 


 


boundary_id parameter_id value
----------- ------------ -----
1           1            10
2           1            20
3           1            30
4           1            40 


 


The boundary values specified in the CREATE PARTITION FUNCTION statement are assigned to boundary_id’s in ascending order. Also note that the value column is of type sql_variant having the base type of the partition function parameter, i.e. SQL_VARIANT_PROPERTY(value, 'BaseType') should give the same type returned by the previous query on sys.partition_parameters. 



 


Before creating the partition scheme, we have to create the filegroups to hold each partition, in case they do not yet exist. It’s also legitimate to place all partitions in the same filegroup, but doing so would sacrifice much of the benefits partitioning does offer to you. For our sample partition scheme let’s create five filegroups named FG1, FG2, FG3, FG4 and FG5 using the ALTER DATABASE ADD FILEGROUP statement.



 


Now we are ready to create the partition scheme based on our partition function and map each of the 5 partitions to one of our filegroups: 


 


CREATE PARTITION SCHEME PSTest AS PARTITION PFTest TO (FG1, FG2, FG3, FG4, FG5)



 


This creates an entry for the catalog view sys.partition_schemes that can be queried like: 


 


select name, data_space_id, function_id
  from sys.partition_schemes
 where name = N'PSTest' 


 


name      data_space_id function_id
--------- ------------- -----------
PSTest    65608         65546 


 


The data_space_id uniquely identifies the data space that is associated with the partition scheme (a data space represents the storage scheme of a table/index and either can be a filegroup, or a partition scheme). And function_id links the partition scheme with the partition function it is based on.



 


The data space of a partition scheme is further divided into destination data spaces where each destination data space is having an entry in the system catalog view sys.destination_data_spaces. Every destination data space is linked to its parent partition scheme through its partition_scheme_id column. The data_space_id column links the destination data space to its pertaining filegroup. The destination_id is equivalent to the partition number the destination data space is mapped to. The following query returns each destination data space for our sample partition scheme and the filegroup assigned to it: 


 


select dds.destination_id, fg.name as [filegroup]
  from sys.destination_data_spaces dds join sys.filegroups fg
    on dds.data_space_id = fg.data_space_id
 where dds.partition_scheme_id = 65608 


 


destination_id filegroup
-------------- ---------
1              FG1
2              FG2
3              FG3
4              FG4
5              FG5


 


 


Putting it all together 


 


In order to display all this information within a single query, I have created the following view that will be used extensively within the next sections:


 








CREATE VIEW [dbo].[PartitionRanges] AS
select pf.name as [partition_function],
       ps.name as [partition_scheme],
       1 as [partition_number],
       case when prv.value is null then NULL else '<' end as [relation],
       prv.value as [boundary_value],
       type_name(pp.system_type_id) as [type],
       fg.name as [filegroup],
       case when ps.name is null then NULL else N'IN USE' end as [status]
  from sys.partition_functions pf
  join sys.partition_parameters pp on pp.function_id = pf.function_id
  left join sys.partition_schemes ps on ps.function_id = pf.function_id
  left join sys.destination_data_spaces dds
    on dds.partition_scheme_id = ps.data_space_id and dds.destination_id =
  left join sys.filegroups fg on fg.data_space_id = dds.data_space_id
  left join sys.partition_range_values prv
    on prv.function_id = pf.function_id and prv.parameter_id = 1 and
       prv.boundary_id = 1
 where pf.boundary_value_on_right = 1
 union all
select pf.name as [partition_function],
       ps.name as [partition_scheme],
       prv.boundary_id + cast(pf.boundary_value_on_right as int) as [partition_number],
       case when pf.boundary_value_on_right = 0 then '<=' else '>=' end as [relation],
       prv.value as [boundary_value],
       type_name(pp.system_type_id) as [type],
       fg.name as [filegroup],
       case when ps.name is null then NULL else N'IN USE' end as [status]
  from sys.partition_functions pf
  join sys.partition_range_values prv on
       prv.function_id = pf.function_id and prv.parameter_id = 1
  join sys.partition_parameters pp on pp.function_id = pf.function_id
  left join sys.partition_schemes ps on ps.function_id = pf.function_id
  left join sys.destination_data_spaces dds
    on dds.partition_scheme_id = ps.data_space_id and
       dds.destination_id = prv.boundary_id + cast(pf.boundary_value_on_right as int)
  left join sys.filegroups fg on fg.data_space_id = dds.data_space_id
 union all
select pf.name as [partition_function],
       ps.name as [partition_scheme],
       pf.fanout as [partition_number],
       case when prv.value is null then NULL else '>' end as [relation],
       prv.value as [boundary_value],
       type_name(pp.system_type_id) as [type],
       fg.name as [filegroup],
       case when ps.name is null then NULL else N'IN USE' end as [status]
  from sys.partition_functions pf
  join sys.partition_parameters pp on pp.function_id = pf.function_id
  left join sys.partition_schemes ps on ps.function_id = pf.function_id
  left join sys.destination_data_spaces dds
    on dds.partition_scheme_id = ps.data_space_id and dds.destination_id = pf.fanout 
  left join sys.filegroups fg on fg.data_space_id = dds.data_space_id
  left join sys.partition_range_values prv
    on prv.function_id = pf.function_id and prv.parameter_id = 1 and
       prv.boundary_id = pf.fanout - 1
 where pf.boundary_value_on_right = 0
 union all
select pf.name as [partition_function],
       ps.name as [partition_scheme],
       NULL, NULL, NULL, NULL,
       fg.name as [filegroup],
       case when dds.destination_id = pf.fanout + 1
            then N'NEXT USED' else N'NOT USED'
       end as [status]
  from sys.partition_functions pf
  join sys.partition_schemes ps on ps.function_id = pf.function_id
  join sys.destination_data_spaces dds
    on dds.partition_scheme_id = ps.data_space_id and
       dds.destination_id > pf.fanout
  join sys.filegroups fg on fg.data_space_id = dds.data_space_id 


 

 


To watch this view working, simply execute:



 


SELECT * FROM PartitionRanges WHERE partition_function = N'PFTest' 


 


partition_function partition_scheme partition_number relation boundary_value type filegroup status
------------------ ---------------- ---------------- -------- -------------- ---- --------- ---------
PFTest             PSTest           1                <        10             int  FG1       IN USE
PFTest             PSTest           2                >=       10             int  FG2       IN USE
PFTest             PSTest           3                >=       20             int  FG3       IN USE
PFTest             PSTest           4                >=       30             int  FG4       IN USE
PFTest             PSTest           5                >=       40             int  FG5       IN USE 


 


As one can see, this view returns a row for every partition defined by the partition function together with the range of values covered by this partition and the filegroup the partition is mapped to by the partition scheme. If there is any filegroup assigned in the partition scheme currently not mapped to any partition, it will also be shown together with its status, e.g. NEXT USED.



 


 


Performing Alterations on Partitions 


 


Next we will perform modifications on the partition function and study the changes to the metadata. The two possible modifications of a partition function are the deletion of a partition by merging two adjacent partition ranges into one, and the creation of a new partition by splitting one existent partition into two. This is being accomplished by the two variants of the ALTER PARTITION FUNCTION statement.



 


The first one we are investigating in is the MERGE. The syntax is:


 


ALTER PARTITION FUNCTION <partition_function_name> () MERGE RANGE (<boundary_value>); 


 


The boundary_value has to be one of those already stored in sys.partition_range_values for this partition function.



 


If we perform the following MERGE operation on our sample partition function: 


 


ALTER PARTITION FUNCTION PFTest() MERGE RANGE (20)



 


and then check again the results of our PartionRanges view, we’re getting: 


 


partition_function partition_scheme partition_number relation boundary_value type filegroup status
------------------ ---------------- ---------------- -------- -------------- ---- --------- ---------
PFTest             PSTest           1                <        10             int  FG1       IN USE
PFTest             PSTest           2                >=       10             int  FG2       IN USE
PFTest             PSTest           3                >=       30             int  FG4       IN USE
PFTest             PSTest           4                >=       40             int  FG5       IN USE 


 


As one can see, the former partition 3, which had been mapped to filegroup FG3, has been removed and partition 2 now covers the range between 10 and 30, i.e. the dropped partition has been merged with its predecessor partition. This is true because our partition function was defined as a RANGE RIGHT function. For a RANGE LEFT function the partition would have been merged with its successor partition. Since we have not yet partitioned any table using this partition scheme, no real data have been moved by the merge. But otherwise, all data that would have been placed in partition 3 prior to the merge, would have to be moved from FG3 to FG2 during the merge. Obviously, that is something that should be avoided for big partitions. Therefore, usually we only merge partitions that are already empty. Later we will see how a partition easily can be emptied by switching partitions out.



 


But, how do we drop partition 1? Because, if we specify for the boundary value in the MERGE statement the lowest possible value, which is 10,  partition 2 will be dropped, and not partition 1. The answer is, you can’t drop partition 1 if the partition function defines a RANGE RIGHT partition. Therefore, when having the need to regularly switch out the partition with the oldest data, as in a sliding window scenario, and the partition function is defined as RANGE RIGHT, partition 1 should be an empty partition, not containing any data. Hence, when defining the partition scheme we can specify any filegroup, like PRIMARY, for the empty first partition as it never should receive any data.



 


Because of this, you might argue that RANGE RIGHT partitions are not the best choice for sliding window scenarios. But stay tuned. We will discover a similar problem for LEFT RANGE partitions, just on the opposite end.


 


Now let’s inspect the effects of the other operation on partition functions, the SPLIT. The syntax is:


 


ALTER PARTITION FUNCTION <partition_function_name> () SPLIT RANGE (<boundary_value>); 


 


The boundary_value must be a new one, i.e. not yet existing in sys.partition_range_values for this partition function, and the data type of the new boundary value must be implicitly convertible to the type of the partition function parameter.



 


As the name implies, a SPLIT operation splits an existing partition range into two ranges. All the existing data that fall into the new range have to physically move into the new partition. Again, this should be avoided for big partitions by enforcing that none of the existing data rows falls into the new range, e.g.  by means of a CHECK CONSTRAINT. Also note, that even if no data are being moved, the filegroup mapped to the partition to be split must not be designated as READ_ONLY. 


 


Another thing we have to make sure before performing the SPLIT, is that a filegroup marked as NEXT USED exists for all partition schemes based on the partition function that we’d like to split. The new partition being created by the SPLIT implicitly will be mapped to this NEXT USED filegroup. If such a filegroup does not yet exist for a partition scheme, you can create one using the following T-SQL statement:


 


ALTER PARTITION SCHEME <partition_scheme_name> NEXT USED <filegroup_name>;



 


You can also designate a NEXT USED filegroup in the CREATE PARTITION SCHEME statement by listing more filegroups than partitions exist. In this case, the first filegroup not mapped to a partition will be marked as NEXT USED.  


 


In our previous MERGE example we have removed filegroup FG3 from the partition scheme. So let’s mark this one as our NEXT USED filegroup:


 


ALTER PARTITION SCHEME PSTest NEXT USED FG3 


 


Again, let’s see the effect of this statement by querying our PartitionRanges view. Here’s the result:



 


partition_function partition_scheme partition_number relation boundary_value type filegroup status
------------------ ---------------- ---------------- -------- -------------- ---- --------- ---------
PFTest             PSTest           1                <        10             int  FG1       IN USE
PFTest             PSTest           2                >=       10             int  FG2       IN USE
PFTest             PSTest           3                >=       30             int  FG4       IN USE
PFTest             PSTest           4                >=       40             int  FG5       IN USE
PFTest             PSTest           NULL             NULL     NULL           NULL FG3       NEXT USED 


 


As one can see, the partition scheme now has filegroup FG3 designated as the NEXT USED filegroup.


 


Now we are able to issue a SPLIT operation on our partition function. Here’s the statement we’d like to execute:



 


ALTER PARTITION FUNCTION PFTest() SPLIT RANGE (50) 


 


This statement splits the existing partition range for values >= 40 into the range >= 40 and < 50 and the range >= 50 and thus adds an additional partition at the right end of our partition ranges. Again, let’s verify this using the PartitionRanges view:  



 


partition_function partition_scheme partition_number relation boundary_value type filegroup status
------------------ ---------------- ---------------- -------- -------------- ---- --------- ---------
PFTest             PSTest           1                <        10             int  FG1       IN USE
PFTest             PSTest           2                >=       10             int  FG2       IN USE
PFTest             PSTest           3                >=       30             int  FG4       IN USE
PFTest             PSTest           4                >=       40             int  FG5       IN USE
PFTest             PSTest           5                >=       50             int  FG3       IN USE 


 


Previously, I mentioned that we also might run into troubles removing the outmost partition with a LEFT RANGE partition function. So let’s create another RANGE LEFT partition function together with a partition scheme and see what might be the effect of adding a partition at the right end. Here are the statements to create the function and scheme:



 


CREATE PARTITION FUNCTION PFTestL (int) AS RANGE LEFT FOR VALUES (10, 20, 30)
GO
CREATE PARTITION SCHEME PSTestL AS PARTITION PFTestL TO (FG1, FG2, FG3, FG4, FG5)
GO
SELECT * FROM PartitionRanges WHERE function_name = N'PFTestL' 


 


partition_function partition_scheme partition_number relation boundary_value type filegroup status
------------------ ---------------- ---------------- -------- -------------- ---- --------- ---------
PFTestL            PSTestL          1                <=       10             int  FG1       IN USE
PFTestL            PSTestL          2                <=       20             int  FG2       IN USE
PFTestL            PSTestL          3                <=       30             int  FG3       IN USE
PFTestL            PSTestL          4                >        30             int  FG4       IN USE
PFTestL            PSTestL          NULL             NULL     NULL           NULL FG5       NEXT USED 


 


Note, that we already have created a NEXT USED filegroup in the CREATE PARTITION SCHEME statement by specifying more filegroups than required by the underlying partition function. Now let’s see the effect of splitting the rightmost partition by running the following statement and looking at our PartionRanges view output:



 


ALTER PARTITION FUNCTION PFTestL() SPLIT RANGE (40) 


 


partition_function partition_scheme partition_number relation boundary_value type filegroup status
------------------ ---------------- ---------------- -------- -------------- ---- --------- ---------
PFTestL            PSTestL          1                <=       10             int  FG1       IN USE
PFTestL            PSTestL          2                <=       20             int  FG2       IN USE
PFTestL            PSTestL          3                <=       30             int  FG3       IN USE
PFTestL            PSTestL          4                <=       40             int  FG5       IN USE
PFTestL            PSTestL          5                >        40             int  FG4       IN USE 


 


As you can see, the new partition has been inserted logically between partition 3 and the previous partition 4, and all the rows from previous partition 4 with values > 30 and <= 40 would have been moved into the new partition 4, i.e. from filegroup FG4 to filegroup FG5. Thus, with a LEFT RANGE partition function you can’t add a new partition at the right end. The answer to this dilemma again is the use of an empty partition at the high end of the partition range when using a LEFT RANGE partition function.



 


Therefore, the general advice, when defining the partition scheme for a sliding window scenario, is to keep the first partition, for RIGHT RANGE partition functions, or respectively the last partition, for LEFT RANGE partition functions, empty and to map this partition in the partition scheme to a special filegroup, like PRIMARY, in order to make your intention, to keep this partition empty, more clear. Additionally, a CHECK CONSTRAINT can be defined to enforce this requirement. 


 


Hence, in our previous examples the partition schemes should have been defined the following way, if they are intended to be used in a sliding window scenario:



 


CREATE PARTITION SCHEME PSTest AS PARTITION PFTest TO ([PRIMARY], FG1, FG2, FG3, FG4)


CREATE PARTITION SCHEME PSTestL AS PARTITION PFTestL TO (FG1, FG2, FG3, [PRIMARY], FG4) 


 


Note, that if we specify a NEXT USED filegroup in the partition scheme creation statement for a LEFT RANGE partition function, this filegroup mapping has to appear after the mapping for the empty partition. Therefore, in the second example, the NEXT USED filegroup FG4 has to come after PRIMARY.



 


 


Partitions and Allocation Units 


 


So far we have covered all the basics about partitioning. We have seen how to define the partition ranges using a partition function and how to map a filegroup to each partition using a partition scheme. Next, I will talk about the effects on the metadata when partitioning a table or an index.



 


First, let’s create a table and partition this table based on the partition scheme PSTest defined earlier. We want to make the table definition fairly simple. The table just consists of a unique key column and the column we want the table to be partitioned on. Here’s the table definition: 


 


CREATE TABLE TabTest (pk uniqueidentifier not null, pc int not null) ON PSTest(pc)



 


Now, let’s explore the changes introduced into the metadata by creating this partitioned table. Every table/index has a data space assigned. As we have seen, a data space either can be a filegroup, or a partition scheme. The sys.indexes row for the table has a column data_space_id which you can use to join sys.indexes with the system catalog view sys.data_spaces. Here you can query the type column to see if the data space refers to a filegroup, if the value is ‘FG’, or to a partition scheme, if the value is ‘PS’. But if you know in advance that you are dealing with a partition scheme, you’d better join with sys.partition_schemes, because this derived view returns only the subset of sys.data_spaces where type is ‘PS’, augmented with the ID of the partition function the scheme is based on. 


 


In order to determine the partitioning column, one has to look in the sys.index_columns view for rows pertaining to this sys.indexes entry and having  a non-zero partition_ordinal value.



 


Given that, the following query returns all partitioned tables and indexes together with the partition scheme and the partitioning column: 


 


select object_name(i.object_id) as [table_name],
       i.index_id, ps.name as [partition_scheme],
       col_name(i.object_id, ic.column_id) as [partitioning_column]
  from sys.indexes i
  join sys.partition_schemes ps
    on ps.data_space_id = i.data_space_id
  join sys.index_columns ic
    on ic.object_id = i.object_id and ic.index_id = i.index_id and
       ic.partition_ordinal > 0 


 


table_name index_id    partition_scheme partitioning_column
---------- ----------- ---------------- -------------------
TabTest    0           PSTest           pc  


 


Next, we’d like to enter data into the partitioned table. But before doing so, we have to create database files within each of the filegroups defined for the partition scheme. Just add a secondary data file to each of the 5 filegroups we have created previously. For our purpose it’s not important how big those files are and which drive they are placed on. But in practice it deserves some careful planning to specify a file size big enough to avoid auto-grow and small enough to avoid wasting precious disk space. Therefore you need some good estimates on the expected maximum partition size. Also, for optimal performance, it would be beneficial to place each of the files on a separate physical disk.



 


After having created the disk files, we should enter a few rows in our sample table such that each partition has at least 1 row. But leave the first partition empty, as later we want to use this table for demonstrating a sliding window process (as we have seen in the previous section, for a RANGE RIGHT partition function, the first partition should be left empty). 


 


After having filled the partitions with data, we’d like to  see how the data are being distributed over the partitions. Again, there are several system catalog views available to make this information visible. First there is the view sys.partitions, providing an entry for every partition defined in the database (note, that even non-partitioned tables consist of one partition). This entry contains the partition_number, the number of rows within this partition and the hobt_id column, which you can join with the container_id column of the sys.allocation_units view. Each partition can consist of up to three allocation units, for in-row data, LOB data and row-overflow data. From the sys.allocation_units view you can gain information about the number of pages allocated for each allocation unit. If you also want to retrieve information about the actual pages allocated, you’d better join with the undocumented sys.system_internals_allocation_units view instead of sys.allocation_units, because this view additionally returns the first_page, root_page and first_iam_page values for traversing the page chain or index tree.



 


In order to get this information for our sample table, you can issue the following select statement and get these results returned:  


 


select object_name(p.object_id) as [table_name], p.index_id,
       p.partition_number, p.rows, au.total_pages as [pages],
       au.type_desc as [allocation_type], au.first_page, au.root_page
  from sys.partitions p join sys.system_internals_allocation_units au
    on au.container_id = p.hobt_id
 where p.object_id = object_id(N'TabTest') 


 


table_name index_id partition_number rows   pages  allocation_type first_page     root_page
---------- -------- ---------------- ------ ------ --------------- -------------- --------------
TabTest    0        1                0      0      IN_ROW_DATA     0x000000000000 0x000000000000
TabTest    0        2                10     2      IN_ROW_DATA     0x080000000400 0x000000000000
TabTest    0        3                10     2      IN_ROW_DATA     0x080000000500 0x000000000000
TabTest    0        4                10     2      IN_ROW_DATA     0x080000000600 0x000000000000
TabTest    0        5                10     2      IN_ROW_DATA     0x080000000700 0x000000000000 


 


As you can see, our table has 5 partitions. Partition 1 has no data allocated and all other partitions are having 1 allocation unit assigned for in-row data pages, consisting of 2 pages (1 data and 1 IAM page). Since our table still is a heap, the root page pointer is NULL. 


 


 


Partitioned Indexes 


 


In the next section we will have a closer look at partitioned indexes. Besides partitioning tables, you may also partition indexes. Partitioning an index transforms the single B-tree of a non-partitioned index into several smaller B-trees, one for each partition defined by the partition function, where each B-tree only consists of index rows where the partitioning column falls into the partition’s range. Therefore, the partitioning column always has to be part of a partitioned index.


 


Splitting a partitioned index into several independent B-trees obviously also has consequences for index seeks. As long as the partitioning column is one of the search predicates, the optimizer knows which of the trees to choose for the index seek. But if the filter criteria do not include the partitioning column, the optimizer would not know where to start the index seek or scan when using a partitioned index. E.g. if we create the following composite partitioned clustered index “CREATE CLUSTERED INDEX IX_Tab ON Tab (a, b)”, where table Tab is being partitioned on column b, the query “SELECT * FROM Tab WHERE a = @x” cannot be performed as a simple clustered index seek, like in the non-partitioned case, but each of the index trees would have to be traversed separately to find the qualifying records. As this example shows, for partitioned tables the rule, that the table rows are sorted by the clustered index keys, does not necessarily hold true for the whole table, but only for single partitions. Therefore, it’s always good practice to include the partitioning column in your search predicates, even if it does not affect the result set. 


  


If an index is partitioned on the same column as its base table and the partition scheme used is based on the same partition function used for partitioning the table, both table and index are said to be aligned. If you do not specify otherwise, an index is automatically aligned with its table. A clustered index always has to be aligned with its base table. In fact, if you create the clustered index differently, the underlying table will adopt the allocation scheme of the clustered index. Thus, to enforce a new allocation scheme for a table, just rebuild the clustered index with the new scheme. Therefore, the easiest way to partition an existing table is to create a partitioned clustered index on the table, although, it might not be the most efficient way doing so.



 


Having all indexes aligned with its base table is essential for switching table partitions in and out, as we’ll see later. But it’s also legal to partition an index on a different partitioning column or using a different partition function with a different number of partitions or different boundary values.  You may also create an index, that is not partitioned, on a partitioned table, or partition an index although the base table isn’t partitioned. If you want to apply a different allocation scheme on the index, just use the ON clause in the index creation statement just like for the table creation statement. 


 


As we have seen, a partitioned index always has to include it’s partitioning column. There are a few rules to be aware of concerning the inclusion of the partitioning column:


 





  • For every unique index (also implies primary key and unique constraints), the partitioning column must be part of its index keys.


  • For a non-unique clustered index, that does not contain the partitioning column, that column implicitly is being added to the index keys.


  • For a non-unique nonclustered index, that does not include the partitioning column, the partitioning column is added at the leaf level as a non-key column.


 


The major consequence is, that you cannot partition an unique index on a column that is not one of the index keys. Therefore, if you need an unique index or constraint to enforce the uniqueness of a set of columns, that does not include the partitioning column, it will be necessary to make the unique index a non-partitioned one, or find other ways to check for uniqueness.  


 


Now let’s create two indexes on our sample table, a clustered one on the partitioning column and a primary key constraint on column pk, which is not partitioned:



 


CREATE CLUSTERED INDEX CIX_TabTest ON TabTest (pc)
GO
ALTER TABLE TabTest ADD CONSTRAINT PK_TabTest PRIMARY KEY (pk) ON FG1 


 


And with query from the previous chapter we can inspect again the allocations:



 


table_name index_id partition_number rows   pages  allocation_type first_page     root_page
---------- -------- ---------------- ------ ------ --------------- -------------- --------------
TabTest    1        1                0      0      IN_ROW_DATA     0x000000000000 0x000000000000
TabTest    1        2                10     2      IN_ROW_DATA     0x0B0000000400 0x0B0000000400
TabTest    1        3                10     2      IN_ROW_DATA     0x0B0000000500 0x0B0000000500
TabTest    1        4                10     2      IN_ROW_DATA     0x0B0000000600 0x0B0000000600
TabTest    1        5                10     2      IN_ROW_DATA     0x0B0000000700 0x0B0000000700
TabTest    2        1                40     2      IN_ROW_DATA     0x100000000300 0x100000000300 


 


As you can see the clustered index is partitioned the same way as its base table and is spread over the partitions 2 through 5, with a separate index root page on every partition. Thus, the one clustered index is divided into 4 separate B-trees. Whereas the index for the primary key (index_id = 2) is not partitioned. But at the leaf level this index still has the partitioning column included as this is the clustering key, and therefore, when performing the bookmark lookup to get to the data page, SQL Server immediately knows in which partition to look for.



 


 


Sliding Window Scenario 


 


One of the major benefits of partitioning is the ability to efficiently move chunks of data in and out of a table. If the chunks happen to be multiples of a partition, no data physically have to be moved, and only metadata for allocations have to be adjusted. This fact is being exploited in sliding window scenarios where on a regular basis (e.g. quarterly, monthly, weekly, or even daily) aged data are being moved out of a table, e.g. for archiving,  and new data swapped in, that up to this point have been accumulated in a staging table. By collecting the data in a staging table instead of inserting directly into the target table, one can take advantage of the fact, that bulk copying data into an empty staging table with no indexes in place is much more efficient than inserting data directly into a big table with all indexes in place. (See “SQL Server 2000 Incremental Bulk Load Case Study” in http://technet.microsoft.com/library/cc917716.aspx, which is equally applicable to SQL Server 2005 and later).



 


A typical requirement for a sliding window scenario might look like this:


 




  • The last 12 month of business transaction data should be stored in a partitioned fact table of a relational data warehouse.


  • During each month, the current transaction data from various sources, e.g. the OLTP databases at each business site, are being collected within a staging table of the data warehouse.


  • On the first of each month, all data older than 12 month should be archived  and the data of  the previous month should be transferred from the staging table into the fact table. 

 


To put these requirements into reality, we first have to build the partitioned fact table. Either we build that table up from scratch by adding one partition each month until we have reached the final size of 12 month worth of data, or we have the data already available somewhere else, let’s say in a non-partitioned table, and simply have to transition this table into a partitioned table (e.g. by creating a new partitioned clustered index on the table). Either way, we first have to decide on how to distribute the 12 partitions over our storage media. Ideally, we devote one filegroup for every partition. Next, the 12 filegroups have to be spread over the available disk drives. Let’s say 4 drives are available, so we assign each of the filegroups in a round robin fashion to one of the drives by creating a secondary data file for the filegroup on this drive. The files should be sized big enough to hold one month of data plus some room for future growth. Creating the partition function is straight forward. Just define the 12 boundary values, being either the first day or the last day of each month, depending on whether we plan to create a RIGHT or LEFT range function.



 


Now having the previous 12 month data within the partitioned table, let’s start the preparation work for the next month. We need another filegroup with a data file assigned where we create the non-partitioned staging table. Filling this staging table can be a continuous process during the month, inserting each days transactions at the end of the day, or the staging table is being filled with data at the end of the month (for alternative load scenarios also see the SQL Server Best Practices article “Loading Bulk Data into a Partitioned Table” in http://technet.microsoft.com/library/cc966380.aspx). Nevertheless, the staging table needs to have the same structure as the partitioned table. Whether you add the same indexes to the staging table already at the beginning or after filling the table with data, depends on the usage of the staging table during the month. If you just insert new data, for optimal bulk insert performance, you should refrain from creating the indexes before all data have been loaded into the table. But, if the staging table already participates in queries, some indexes might be essential. Anyway, before merging the staging table with the partitioned table, the staging table must have the same indexes and foreign key constraints as the partitioned table and a check constraint must be defined on the partitioning column, that restricts the data in the staging table to values conformant with the boundaries of new partition that will absorb the data. 


 


Now we have reached the end of the month and during the coming night the staging table’s data need to be moved into the partitioned table, so that this month’s data is available at the beginning of next business day for starting various data analysis jobs. We start with preparing the staging table. We create all the necessary indexes and foreign key constraints and define a check constraint for the partitioning column. Afterwards, we might try to shrink the file(s) associated with the staging table’s filegroup and run a final defragmentation on the staging table. Finally, it also might be a good idea checking the consistency of the table by running a DBCC CHECKTABLE. Now the time has come to move the data. 


 


The essential operation to perform this data movement is the partition switch operation. In T-SQL this is being accomplished by the ALTER TABLE…SWITCH statement. The syntax is:



 


ALTER TABLE source_table_name SWITCH [PARTITION partition_out] TO target_table_name [PARTITION partition_in]



 


This statement supports three different scenarios:


 




  • You can move a partition of a partitioned table out into a non-partitioned table


  • You can move a non-partitioned table into a partition of a partitioned table


  • You can move a partition of one partitioned table into a partition of another partitioned table 

 


All participating tables need to be structurally equivalent, i.e. need to have the same columns, same clustered index, etc. The target table/partition must be empty, and source and target partition must reside in the same filegroup. If the switch target is a partitioned table, all indexes on this table have to be aligned with the table. Any non-aligned index has to be dropped or disabled prior to the switch operation. For an exact list of requirements see the topic ‘Transferring Data Efficiently by Using Partition Switching’ in SQL Server Books Online.



 


Here’s the list of steps you would have to perform moving the data to/from the partitioned table:


 




  1. Determine the filegroup where the partition with the oldest data resides (this is partition 1 for a LEFT RANGE partition or partition 2 for a RIGHT RANGE partition).


  2. If this filegroup is marked as READ_ONLY, change it to READ_WRITE.


  3. On this filegroup create a non-partitioned archive table, which is an exact copy of the partitioned table.


  4. If the partitioned table is clustered, create the same clustered index on the archive table.


  5. Using the ALTER TABLE…SWITCH command, switch the partition identified in step 1 into the archive table created in step 3.


  6. Using the ALTER PARTITION FUNCTION command, merge the partition with the lowest partition boundary value, thus dropping the partition switched out in step 5.


  7. If not done yet, for the partition scheme mark the filegroup, where the staging table resides, as NEXT USED using the ALTER PARTITION SCHEME command. (Note, that all partition schemes based on the same partition function need to have a NEXT USED filegroup.)


  8. If the filegroup assigned to the last partition of a RIGHT range partition is marked as READ_ONLY, change it to READ_WRITE.


  9. Add a new partition using the ALTER PARTITION FUNCTION…SPLIT command with the new boundary value.


  10. If CHECK constraints exist on the partitioned table that restrict the possible values of the partitioning column, adjust them accordingly (actually, only the upper bounds need to be adjusted).


  11. Determine the partition number of the new partition added in step 9 (this is the last partition for a RIGHT RANGE partition, or its predecessor for a LEFT RANGE partition).


  12. Switch the staging table into the partition identified in step 11 using the ALTER TABLE…SWITCH command.


  13. Now you can drop the staging table and you might seal the filegroup by designating it as READ_ONLY.


  14. Create a new staging table for the upcoming month, possibly in the filegroup identified in step 1.


  15. Update statistics on the partitioned table.


  16. Save and drop the archive table created in step 3 (this step does not have to be performed immediately and can be deferred to a later point in time).

 


Although this might sound complex, most of it can be scripted. And best of it, these steps (except the last two) can be executed within seconds as no data really are moved and only metadata have to be changed. See http://sqlpartitionmgmt.codeplex.com for a program that might be helpful for performing these steps.


 


  


Enhancements in SQL Server 2008 with Regard to Table and Index Partitioning 


 


SQL Server 2008 provides several enhancements that affect partitioned tables and indexes. Here I’ll discuss some of those enhanced features. See also http://technet.microsoft.com/magazine/2008.04.dwperformance.aspx.




 


 


Enhanced parallel query execution


 


In SQL Server 2005, if a query covers more than one partition, only a single thread is being used for executing the query on each partition in parallel. This limits the parallelism of a query in case the number of partitions touched by the query is less than the number of processors available to SQL Server. SQL Server 2008 removes this restriction and distributes the available threads for query processing evenly on all partitions participating in the query. 


 


For a detailed discussion of this topic see “Parallel Query Execution Strategy for Partitioned Objects” under “Query Processing Enhancements on Partitioned Tables and Indexes” in SQL Server 2008 Books Online. 


 


 


Partition-aligned indexed views 


 


In SQL Server 2005 you are able to create indexed views on a partitioned table. But these indexed views cannot be partitioned. Hence, before switching partitions you would have to drop the indexed views before the switch and recreate them afterwards. This obviously leaves indexed views on partitioned tables useless in case partition switching is a requirement.



 


Now, SQL Server 2008 allows to partition indexed views as well. Principally, you define the indexed view the same way you do it in SQL Server 2005. There are a few restrictions concerning the inclusion of the partitioning column in the view definition (see “Partition Switching When Indexed Views Are Defined” in SQL Server 2008 Books Online). Then, when materializing the view by creating the unique clustered index on the view, simply use the ON clause to specify the partition scheme and partitioning column. The same rules for partition alignment apply as for partitioned indexes.


 


  


Data Compression



 


SQL Server 2008 allows for compressing data on a row or page level. The good news is, that this feature not only can be set on a whole table or index, but for each partition individually. Thereby, you can enable the more costly page level compression on older partitions, that are not altered very often, and leave the actual partition, that gets all the new data, uncompressed or row compressed. You can change the compression state of a partition using the ALTER TABLE .. REBUILD PARTITION statement. 


 


For tables with mixed compression settings for different partitions, you should be aware of http://support.microsoft.com/kb/957810.


 


 



Lock Escalation 


 


Instead of escalating the existing fine-grain locks on a table to a single table lock in case the lock escalation threshold has been exceeded, SQL Server 2008 allows to escalate the locks on a single partition into a lock on the heap or B-tree level, thus allowing for greater concurrency than a full table lock.



 


This feature can be set for each table individually by specifying SET (LOCK_ESCALATION = AUTO) with the ALTER TABLE statement.  


 


 


Support for partitioning in SQL Server Management Studio



 


The support for partitioning, which Management Studio in SQL Server 2005 provides, is pretty sparse. For instance, you can list and delete the partition schemes and functions defined for a database. But that’s it. Management Studio in SQL Server 2008 extends this functionality by providing several Wizards that may help in creating and maintaining partitioned tables. Just right-click a table in the Object Explorer and select Storage in the context menu. Depending on whether the table is already partitioned or not, you can activate the Create Partition Wizard or the Manage Partition Wizard. These wizards allow you to graphically design partition functions and schemes, create a staging table, or to manage partition switching in a sliding window scenario. Although the wizards are quite neat, you always should be alert if the solution to a complex problem is just a few mouse clicks away. The solutions chosen by the wizards might not always be the most practical ones, e.g. if you ask the wizard  to partition a table, the solution proposed by the wizard, although correct, certainly is not the one you’d apply to a existing huge table coming with a bunch of indexes. You’d better script the solution first for further examination, instead of immediately running it. But the wizards come in handy during the design phase of a database or for becoming acquainted with partitioning. 


 


 


Partitioning of tables with FILESTREAM data



 


SQL Server 2008 incorporates a new storage scheme for large binary data. If you define a column of type VARBINARY(MAX) with the FILESTREAM storage attribute, the column data will be stored in the servers file system instead of database storage. The locations within the file system where these data are placed, are defined by special FILESTREAM filegroups. Within the CREATE TABLE statement you have to specify in a FILESTREAM_ON clause the filegroup to be used for the FILESTRAEM data in the table. But, if the table is partitioned, also the FILESTREAM storage is going to be partitioned. Therefore, you have to define a separate partition scheme, based on the same partition function and same partitioning column as the table, and mapping each partition to a FILESTREAM filegroup. And this partition scheme has to be used in the FILESTREAM_ON clause instead.  


 


But there is one more requirement for FILESTREAM columns that lessens the usefulness of partitioning tables with FILESTREAM data. A table with FILESTREAM data also needs to have a column with the ROWGUIDCOL property and an UNIQUE or PRIMARY KEY single-column constraint defined on this column. But as you recall, unique partitioned indexes also have to add the partitioning column to their index keys. Hence, the index for this UNIQUE constraint cannot be partitioned. But as you have learned, an index automatically inherits the storage scheme of its base table and this also applies to UNIQUE constraints defined with the table. Thus, in order to make an UNIQUE constraint within a partitioned table a non-partitioned one, you explicitly have to specify a filegroup to store this index on within the CONSTRAINT definition.



 


But the more important catch here is, that you cannot switch partitions in or out while there is a non-aligned index defined on the table. And you cannot simply drop the UNIQUE constraint or disable the pertaining index while performing the switch, as long as there are FILESTREAM data in the table.


 

Comments (6)

  1. Jimmy George says:

    Great Article ! Explains inside out of table partitioning.

    I have a VLDB which is used to load data continuously. Thnking of partitioning it on a monthly basis – which is about 30 million rows. I need to keep only 12 months of data on the table before archiving into a slower disk.  I plan to have 12 partitions each in its own filegroup by a datetime column.

    I read about a "Rotating Window" mechanism where instead of   a sliding  window that moves forward in time by creating new partitions, you can use the partitions in a ring structure, so that new data rotates around the partition scheme one partition at a time. The advantage is that the partition number could be easily identified with the current month.

    Do you have any suggestions/comments to make on this continuous loading, rotating window mechanism ?

    Regards

    Jimmy

  2. Jimmy George says:

    Great Article ! Explains inside out of table partitioning.

    I have a VLDB which is used to load data continuously. Thnking of partitioning it on a monthly basis – which is about 30 million rows. I need to keep only 12 months of data on the table before archiving into a slower disk.  I plan to have 12 partitions each in its own filegroup by a datetime column.

    I read about a "Rotating Window" mechanism where instead of   a sliding  window that moves forward in time by creating new partitions, you can use the partitions in a ring structure, so that new data rotates around the partition scheme one partition at a time. The advantage is that the partition number could be easily identified with the current month.

    Do you have any suggestions/comments to make on this continuous loading, rotating window mechanism ?

    Regards

    Jimmy

  3. Mark Whitfield says:

    Hi, What is the maximum size of a partitioned SQL table in Microsoft SQL Server (and a non partitioned for that matter)…. thanks in advance for any input here… Regards Mark.

  4. Jake says:

    How can you programatically define the next used partition?

  5. Rose K says:

    great article, this is also a great guide about using a SQL Server distributed partitioned view to allow transparent querying of multiple Azure SQL databasesp sqlturbo.com/tutorial-how-to-query-multiple-azure-sql-databases-transparently-from-one-on-premise-sql-server

  6. Stuart says:

    Brilliant article. Was particularly interested in what happens with MERGE and this explained it better than any other article I have read about it.

    The view is extremely helpful as well.

    Thanks

Skip to main content