Using Hashing to Obscure Sensitive Data

At least two customers that I’ve worked with in the past month have been building Data Warehouses that contain Social Security Numbers (SSN) – in one case it is even being used as a foreign key for a very large fact table.  These customers were interested in scrambling the SSN so that it couldn’t be identified in reports, but still allow it to remain useful as a unique person identifier and accessible for lookup by authorized queries.

 

While SQL Server 2005 offers encryption, a SQL encrypted value is not deterministic and cannot be used for key lookup purposes.  A better solution is to use a Hash value based on the SSN, or even a bigint derived from a Hash. 

 

The SSN can be deterministically mapped (one-way) to a hash using both Integration Services (SSIS) or TSQL.  This allows the encoding to be efficiently performed within an Extract-Transform-Load pipeline or later on in queries that need to access the hash result.

 

To construct a 20-byte SHA1 hash of SSN within SSIS, input the SSN as a string in the pipeline, then append a leading “0” to the string using the Derived Column transform so that you obtain a 10-digit string.  In the same Derived Column transform, convert that result to a 5 byte stream (this treats the incoming characters as hex).  The Derived Column Transformation will look something like this:

 

Derived Column Name Derived Column Expression Data Type Length

Converted SSN <add as new col> “0”+[Raw SSN] byte stream 5

 

You can then build a 20-byte hash of this very easily with a couple of lines of code in a Script Transform.

 

. . .

Imports System.Security.Cryptography

Public Class ScriptMain

    Inherits UserComponent

    Dim mySHA1 As New SHA1CryptoServiceProvider()

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

        Row.Sha1 = mySHA1.ComputeHash(Row.ConvertedSSN)

    End Sub

End Class

The resulting hex characters can be used as a string to insert in the database, or type conversion can be performed.  Since a 20-byte hash value may be overkill for the number of unique SSNs that you will encounter in your data, you can convert it to a positive bigint using the leftmost 7 bytes (or any set of 7 bytes you choose) within the same script.

 

Imports System.Security.Cryptography

Public Class ScriptMain

    Inherits UserComponent

    Dim mySHA1 As New SHA1CryptoServiceProvider()

    Dim SHA1Result(20) As Byte

    Dim longResult As Long

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

        SHA1Result = mySHA1.ComputeHash(Row.ConvertedSSN)

        longResult = 0

        For i As Integer = 0 To 6

            longResult = CLng(longResult + (CLng(SHA1Result(6 - i)) * (256 ^ i)))

        Next

        Row.Sha1 = SHA1Result

        Row.Long7Bytes = longResult

    End Sub

End Class

This SSIS-based conversion can be performed with identical results within TSQL.  For example, to convert the SSN 566098776 to a 20 byte hash or a bigint associated with the first 7 bytes, use the SQL expressions below.  Don’t forget the leading 0 in the hex representation so that you have a full 5 bytes as an argument to the hashbytes function!

 

select hashbytes('SHA1', 0x0566098776)

select cast(substring(hashbytes('SHA1', 0x0566098776),1,7) as bigint)

 

There are 2 points of caution to keep in mind:

 

1)  Remember these hashed values are subject to a dictionary attack, so this technique is useful for obscuring data, but not sufficient for securing data.  A malicious user could construct a table of hash values (and decimal equivalents) for all possible SSNs and use that table to perform a reverse lookup.  The risk of this can be minimized by appending a ‘salt’ value – a secret long string of digits – to the SSN string before performing the hash.  Of course your code and the equivalent TSQL procedure to perform the salted hash must be secured so that the salt value cannot be discovered.

 

2)  Using a bigint built from a truncated hash may create an unacceptable risk of a hash collision if you are working with millions of SSN values.  The likelihood of a hash collision can be computed using the estimate that there is a 50% chance of a collision given 2^(#bits/2) input values.  For the bigint computed using 7 bytes, that means that there is a 50% chance of a collision if you hash 2^28 different values – or 268 million different SSNs.   So if your universe of input values is large, you may want to use the full hash string, or convert a longer substring of the hash to a Decimal value. 

 

By the way, this sort of deterministic hashing can also be very useful in clickstream data warehousing applications, where you want to construct a compact, unique value associated with a long URL string to use as a key.

 

-- Stuart Ozer