DATALENGTH optimizations for LOB data types...

DATALENGTH function in TSQL can be used to find the actual length in bytes of the data in a specific value. The value can be any of the data types. It is often used to determine length of LOB data type columns (text, ntext, image, varchar(max), nvarchar(max) and varbinary(max)) in a table. One of the question that comes up regarding use of DATALENGTH on LOB columns is whether it requires reading the entire value. For example, if the text or varchar(max) value is 16 MB in size does SQL Server need to read the entire value to determine the size. In this post, I will describe why SQL Server doesn't need to read the entire value and show you some mechanism to ascertain that easily in SQL Server 2005.

The default storage option for text/ntext/image data type in a table is to store it out of row. So the text/ntext/image value will take up 16-bytes per data row. And the value is stored in separate set of pages. The 16-bytes pointer points to the root node of the tree built of pointers that map the pages into the value fragments. The root structure also contains the size of the text/ntext/image value. Lastly by using the system stored procedure "sp_tableoption" with 'text in row' option, part of the text/ntext/image value can be stored in row for quicker access.

So the DATALENGTH function needs to access only the root structure for BLOB values stored out of row to determine the size and the in-row data if the 'text in row' option is set. This means that if your BLOB data is very large in size there is one additional IO incurred to read the root structure and the entire value need not be accessed to compute the size. This can be demonstrated by using the SET STATISTICS IO command in SQL Server 2005 which exposes new counters for LOB reads (logical, physical and read-ahead). The script below shows the amount of IO performed when only DATALENGTH function is used on a BLOB column vs reading the entire column value.

use tempdb
go
create table blob_t( i int not null primary key, t text not null )
go

--- Insert a row in the table:
insert into blob_t values( 1, replicate('x', 8000 ) )
go

-- Set 'text in row' option so we can create a value that is stored in-row:
exec sp_tableoption blob_t, 'text in row', 7000
go

-- Insert a row with text value that will be stored in-row:
insert into blob_t values( 2, replicate('x', 7000 ) )
go
set statistics io on
go

-- Find length of data stored in row #1
select datalength(t) from blob_t where i = 1

/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 1, lob physical reads 0, lob read-ahead reads 0.
*/

-- Find length of data stored in row #2
select datalength(t) from blob_t where i = 2

/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

*/

-- Read row #1
select * from blob_t where i = 1

/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 5, lob physical reads 0, lob read-ahead reads 0.

*/

-- Read row #2
select * from blob_t where i = 2

/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

*/
go
set statistics io off
go
drop table blob_t
go

As you can observe from the statistics io output, when the DATALENGTH function is used to determine the length of value in the row (i =1) that contains the text value out of row there is one extra read to locate the root structure. This is shown in the log logical reads counter output. Similarly, in case of the row (i=2) that contains the text value in row there is no extra IO to read any of the LOB pages and the size can be determined by reading the in-row data.

The same behavior can be observed for the new LOB data types in SQL Server 2005. For the new data types, the defaults are however different in that the LOB valued is stored in row by default (SQL Server determines the size of the in-row data automatically). The setting can be modified by using "sp_tableoption" with the option "large value types out of row". The script that demonstrates the behavior of DATALENGTH for varchar(max) data type column is shown below:

use tempdb
go
create table blob_t( i int not null primary key, t varchar(max) not null )
go

-- Add new rows. SQL Server will automatically determine size of in-row data
insert into blob_t values( 1, replicate(cast('x' as varchar(max)), 8000 ) )
insert into blob_t values( 2, replicate(cast('x' as varchar(max)), 8000*4 ) )
go

-- Set option large values to be stored out of row:
exec sp_tableoption blob_t, 'large value types out of row', 1
go

-- Insert a new row that will be stored out of row:
insert into blob_t values( 3, replicate(cast('x' as varchar(max)), 8000*8 ) )
go
set statistics io on
go
select datalength(t) from blob_t where i = 1

/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
*/

select datalength(t) from blob_t where i = 2

/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
*/

select datalength(t) from blob_t where i = 3

/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 1, lob physical reads 0, lob read-ahead reads 0.

*/

select * from blob_t where i = 1
/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

*/

select * from blob_t where i = 2
/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 12, lob physical reads 0, lob read-ahead reads 0.

*/

select * from blob_t where i = 3
/*

Table 'blob_t'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 47, lob physical reads 0, lob read-ahead reads 0.

*/

go
set statistics io off
go
drop table blob_t
go

I hope you found the information useful and also learnt some new techniques on using SET STATISTICS IO output in SQL Server 2005.

To summarize, the use of DATALENGTH function on LOB data type (text, ntext, image, varchar(max), nvarchar(max) and varbinary(max)) column values incurs one additional IO to access the root structure if the value is stored out of row. And the root structure contains the size of the entire value so the value need not be accessed or read in it's entirety to determine the size in case of out of row storage.