Writing New Hash Functions for SQL Server

Author: Thomas Kejser

Contributors/Reviewers: Alexei Khalyako, Jerome Halmans, Fabricio Voznika, Sedat Yogurtcuoglu, Mike Ruthruff, Tobias Ternstrom and Steve Howard

In this blog, I will explore ideas for extending SQL Server with new, fast hash functions. As will be shown, the high speed, built in functions CHECKSUM and BINARY_CHECKSUM are not always optimal hash functions, when you require the function to spread data evenly over an integer space. I will show that it can be useful to extend SQL Server with a new CLR User Defined Function (UDF).

In distributed workloads, hashing a columns in the data and evenly distributing the hash value into buckets, is often a good way to evenly scale work over multiple scale units (either machines or NUMA nodes). In such an architecture, each scale-unit will be responsible for handling a subset of the total buckets. Achieving a good distribution in hash buckets will depend on the skew of the input data, but also on how well the hash function behaves on the input.

In this blog, I will use a simple input table to test the properties of different hash functions. For my example, let us imagine that the input values is 10M customer keys that have been generated either by a SQL Server 2010 SEQUENCER or an IDENTITY(1,1) column with no holes in the sequence. This table can be created like this:

CREATE TABLE CustKey (SK INT NOT NULL)

INSERT INTO CustKey WITH (TABLOCK) (SK)
SELECT n FROM dbo.fn_nums(10000000)

(The fn_nums function can be found here )

In this test, the input has no skew, and I will assume that we want to split the input into 65536 approximately equal sized hash bucket, spread all over the 16-bit integer space, as the desired output of the hash function. This means that ideally, the output will have 152 values (10M / 65536) in each bucket.

In other words: we are trying to achieve a mapping between a sequence of values in the 32-bit integer space to an evenly spread out, 16-bit integer space like this:

image

For convenience, I will use 64K = 65536 and 32K = 32768 below.

Using CHECKSUM and BINARY_CHECKSUM

The built in functions CHECKSUM and BINARY_CHECKSUM both operate on any input and return a 32-bit, signed integer (SQL type: int). By integer dividing the output by 64K we can split the integer space into the 64K buckets we want.

With the following query, we can get a rough idea of the distribution in the buckets:

WITH Input (bucket, bucket_count) AS
(
SELECT BINARY_CHECKSUM(SK) / 65536 as bucket
, COUNT(*) as bucket_count
FROM CustKey
GROUP BY BINARY_CHECKSUM(SK) / 65536
)
SELECT MIN(bucket_count)
, MAX(bucket_count)
, COUNT(*) filled_buckets
, SUM(bucket_count)
FROM Input

Running this for both BINARY_CHECKSUM and CHECKSUM, I get:

image

This does not look promising, we would expect the min/max to linger around 152 and the number of filled buckets to be 64K. Books Online does state:

[…] we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

So, it seems we should use HASHBYTES.

Using HASHBYTES

HASHBYTES with MD5 returns a 160 bit value. We can turn this into 64K buckets by taking modulo 32K on the output. We can now write

WITH Input (bucket, bucket_count) AS
(
SELECT HASHBYTES('MD5', CAST(SK AS VARBINARY)) % 32768 as bucket
, COUNT(*) as bucket_count
FROM CustKey
GROUP BY HASHBYTES('MD5', CAST(SK AS VARBINARY)) % 32768
)
SELECT MIN(bucket_count)
, MAX(bucket_count)
, COUNT(*) filled_buckets
, SUM(bucket_count)
FROM Input

As I run this, I notice that it take very long to run. But the output is quite good compared to BINARY_CHECKSUM and CHECKSUM:

image

To benchmark the speed of the hash function, we need to write this little script that minimizes the transmission time to the client:

SET STATISTICS TIME ON

SELECT MAX(bucket)
FROM(
SELECT HASHBYTES('MD5', CAST(SK AS VARBINARY)) % 32768 as bucket
FROM CustKey
) AS input
OPTION (MAXDOP 1)

On my machine, the CPU time is around 14 seconds, about one millisecond per hash calculation. Note that MD5 has cryptographic properties that we may not need. Are we paying too high an overhead for this? I tested with the other hash functions in HASHBYTES, and all of them take a lot of CPU per hash calculation

Building a SQLCLR function on Binary data

At this point, it seems natural to ask if there is a way to write our own hash function. Fortunately, SQL server provides a great way to extend the engine with new functionality: SQLCLR.

In order to explore this option, I wanted to quantify the cost of marshaling data from SQL to CLR, this simple function was my initial test:

image

Running the performance test above, I saw a shocking 56 seconds, just to return an integer!

At times like these, it is very useful to run xperf, I ran the following command line before starting the test again:

xperf –on latency –stackwalk profile

After about one minute, when the test was done, I ran:

xperf –d C:\dumps\nohash_trace.etl

This allows me to quantify all the CPU time spend in SQL server, the result was very interesting. On my 4 core machine, the single threaded execution (25% of the CPU) breaks down like this:

image

With a properly configured symbol path, we can actually zoom into sqlservr.exe and see what is going on (you can do this too, it does not require source code access – see this link on how to get started)

image

UDFInvokeExternal sounds like the call that wraps my CLR. But wait a minute. What are those UrtReadLob, UrtGetLobLength and ExecUdfLobAccess in there? They add up to more than the time taken for the highest CPU consumer (GetNextRowValuesInternal).

Looking up SqlBytes on MSDN, we find:

“Represents a mutable reference type that wraps either a Buffer or a Stream .”

This doesn’t sound like the type I wanted: buffer and streams – perhaps this is where LOB’ish calls are from? My next attempt replaces SqlBytes with SqlBinary:

image

Runtime with the new data type: 7 seconds. A factor 8 improvement! This is also faster than the HASHBYTES run. If I could come up with a hash function that has a good spread, but which is cheaper than MD5…

Testing CRC32 and CRC16

Fortunately, the field of computer science is full of work about hash functions and you can pick one that fits your needs. Two easy to implement functions are CRC32 and CRC16. You can find reference implementation in Wikipedia.

CRC32 returns a 32-bit integer, and we can turn it into the 64K buckets by dividing with 64K. CRC16 on the other hand, returns exactly what we want.

Without further ado, here are the results I obtained by writing my own hash CRC16/32 function:

image

Conclusion

If you need hash functions that spread data evenly over an integer space, you have to choose the hash function carefully. The built in hash functions BINARY_CHECKSUM and CHECKSUM, while very fast, do not provide a good spread over a 16 bit integer space. The built in SQL Server HASHBYTES offers good spread over the the space, but comes a high computation cost.

If you do not need the cryptographic properties of the HASHBYTES function, you can consider writing your own SQLCLR function to calculate hash values. With a properly chosen hash algorithm, and by avoiding the SqlBytes data type, you can hash values faster than HASHBYTES, while maintaining the benefits of a good spread over the integer space.