SQL Server 2005: searching encrypted data


How to search encrypted data is a question that came up several times on forums and I should have blogged about this issue earlier, but better later than never.


So the problem is that we have just encrypted that confidential column in our table, but we would also like to continue retrieving records based on its values, and we would like to do this in an efficient manner, which in a database means making use of an index. How can we achieve this?


Let’s first start by figuring out why we can’t directly index the encrypted data and search on it. The encryption algortihms in SQL Server 2005 are salted. By salting, I mean that the encryption algorithms are always using a random initialization vector (IV), which leads to the following property: encrypting twice the same piece of data using the same key will produce two different ciphertexts. The benefit of this is that if in a table we have a column value appearing several times and we encrypt that column, then the fact that the value repeats in the column is no longer apparent by examining the encrypted data. This adds additional protection against the analysis of the ciphertext (encrypted data), but also prevents the ability to search it efficiently. Also note that while the absence of salting would have permited equality searches, range searches would still not be possible as encryption algorithms are not preserving order.


So, how can we perform an equality search given that the encryption is salted and does not permit this? The solution is to add an additional column for holding hashes of the cleartext. The column containing hashes can then be indexed, and searching a piece of sensitive data can be done by first hashing it and then searching the hash value in the new column. This is all nice and easy, but it introduces the threat of a dictionary attack: an attacker could build hashes for as much secret data as he can generate or guess, and then he can verify whether this secret data exists in the table by comparing his hashes with the ones stored in the hash column. This is a significant threat.


How can we mitigate the threat of a dictionary attack against the secret data hashes? I will first describe the general idea behind the solution before moving to its description. The general idea is that we would want to add some secret to the hash computation, such that an attacker would not be able to construct a dictionary of hashes without knowing that secret. This mechanism is known as a MAC – message authentication code. There can be several ways of implementing MACs, but one solution can be to generate some random secret X and keep it stored encrypted some place in the database – it could be, for example, stored in a special table and encrypted with the same key that encrypts the secret data. When building the hashes of the secret data, this random secret can be added to the secret data before computing the hash. This would make the hash values dependent on this X, so even if an attacker knows an individual secret and the hashing algorithm, he still cannot compute the proper hash that we store, without knowing X. When searching data, we would hash X and the secret that we need to look up, and then we would search the resulting value in the hash column (which is now actually a MAC column). For more information on MACs, see for example “Applied Cryptography” by Bruce Schneier (MACs are covered in pages 455-459 of the second edition).


To conclude, encrypted data cannot be directly searched, but searches can be performed on MACs of the secret data that are stored in separate columns.

Comments (19)

  1. Ravi says:

    Hi,

    You mentioned that encrypted data cannot be directly searched, I am little confused with this statement may be i did not understand properly. Please clarify

    I have a column encyrpted with a symmetric key and a certificate and I created an index on the encrypted column and in my query i put a where clause on the encrypted column and it is doing an index scan using the index. Do you mean that because it is encrypted data it cannot do an index seek and hence doing an index scan?

    Regards,

    Ravi

  2. Yes, what I meant by not being able to search the encrypted data is that it cannot be searched _efficiently_.

    To elaborate, let’s say I have an encrypted SSN column and one SSN value I want to search for is ‘111-11-1111’. I could write two query clauses (syntax is simplified):

    (a) select … where decryptbykey(SSN) = ‘111-11-1111’

    This will do an index scan and will not be able to do a seek, as you mentioned. The reason for this is that we cannot know which of the index values is actually going to decrypt to ‘111-11-1111’, so we have to scan them. This will be very slow, not only because of the scan but also because of all of the decryptions that will be made for each entry that we scan.

    (b) select … where SSN = encryptbykey(‘111-11-1111’)

    This will simply not work, because of the salting. Even if ‘111-11-1111’ is in the table, this query will not be able to find it. But note that this type of query can use an index seek and will only perform the encryption operation once.

    Here’s a small demo of these ideas:

    =======

    create database test

    use test

    create symmetric key skey with algorithm = triple_des encryption by password = ‘Test!@#$5’

    open symmetric key skey decryption by password = ‘Test!@#$5’

    create table t (ssn varbinary(256) primary key clustered)

    insert into t values (encryptbykey(key_guid(‘skey’), ‘111-11-1111’))

    insert into t values (encryptbykey(key_guid(‘skey’), ‘222-22-2222’))

    insert into t values (encryptbykey(key_guid(‘skey’), ‘333-33-3333’))

    select * from t

    select convert( varchar(256), decryptbykey(ssn) ) from t

    — In SSMS, use Ctrl-L or from menu: QueryDisplay Estimated Execution Plan, to check execution plan

    — scan

    — a decryption will take place for each scanned row



    select convert( varchar(256), decryptbykey(ssn) ) from t where convert( varchar(256), decryptbykey(ssn) ) = ‘111-11-1111’

    — seek, but no result, because of IV use



    select convert( varchar(256), decryptbykey(ssn) ) from t where ssn = encryptbykey(key_guid(‘skey’), ‘111-11-1111’)

    use master

    drop database test

    =======

    Hope this makes it clearer.

    Thanks

    Laurentiu

  3. Mark says:

    There’s one aspect of your article I need to clarify. While using MACs allows you to perform indexed searches on data without having to decrypt it, am I correct in stating it will only return direct matches, but not wildcard searches?

    If, for example, I wanted to perform a search like:

    Select * …. where Surname like ‘and%’

    Would this be possible using hashing?

    Thanks,

    Mark

  4. Yes, the MAC solution I described above only allows index use for direct searches; it does not allow index use for wildcard searches or range searches.

    No, you cannot use hashing to enable efficient wildcard searches. Encryption and searching are conflicting objectives.

  5. Iaan says:

    I want to give a client a secured database (sql 2005), then only allow my .net application which i also supply to acces the data. The client shouln’t be able to read the data directly. I do need to do wild card searches on the encrypted / secured data /database. I also need to index my data, as it has 3 million records times 10 columns of varchar data.

    It doesn’t have to be too secure. Just enough to prevent the honest client from directly accessing it. I can even just hard code a password into my app and execute my queries directly from the app.

    What are my options? What do you suggest

    Thanks

    Iaan

  6. You cannot perform wildcard searches efficiently on data that is encrypted with a strong algorithm. If you could, then it would mean that the encryption algorithm is not as strong as you think it is.

    You can perform searches on data that is encrypted with a simple substitution cipher, but such a cipher can be broken easier than solving a crossword puzzle. This would only protect you from people that are not interested in your data, which were not a threat in the first place.

    What you’re looking for is a type of DRM solution. The encryption in SQL Server is using algorithms designed for data protection. A DRM solution would need to use different techniques.

  7. Raul’s new post provides additional information on indexing encrypted data, as well as a demo: http://blogs.msdn.com/raulga/archive/2006/03/11/549754.aspx.

  8. Bruce says:

    We have a requirement that we encrypt SSNs but they also must be searchable. I’ve played around with the hashing the cleartext for searching, but I’m getting results that concern me. If I have 2 SSNs that are close, a hash generated with SHA1 will generate a duplicate hash value. For instance:

    DECLARE @Hash1 nvarchar;

    DECLARE @Hash2 nvarchar;

    SELECT @Hash2 = CONVERT(nvarchar,’501679344′);

    SELECT @Hash1 = CONVERT(nvarchar,’501679493′);

    IF (Select HashBytes(‘SHA1’,@Hash1)) = (Select HashBytes(‘SHA1’,@Hash2))

     PRINT ‘Same’

    Else

     PRINT ‘Different’

    The output is ‘Same’ for 2 different SSNs. Granted, they’re not far from each other, but they are different. What am I doing wrong?

  9. You are declaring your variables as nvarchar. They should be nvarchar(9). nvarchar only stores one character, so the values that you hash are both ‘5’, so they obviously hash to the same value.

    It is however theoretically possible to hit a collision for two different SSNs, such that they both hash to the same value. To cover this possibility, you would first want to allow such a hash to be inserted in the table, so the index you build on the MAC column should not be declared UNIQUE. Your search code should be prepared to retrieve more than one row, and if it does that, it should compare the searched SSN with the decrypted SSNs from the resultset to decide which row is the correct one. You’re probably not going to ever hit this code, but it should be there to take care of this unlikely scenario.

    I want to stress that I find a collision very unlikely to happen. This is because SSNs are just 9 digit values and SHA1 hash values are 20 byte long. So we’re hashing from a set with cardinality 10^9 to a set with cardinality 2^160 (160 = 20 * 8), which is much larger (10^9 is about 2^30).

  10. Mike C# says:

    According to BOL, SQL 2005 does *not* salt the encryption when using RC4 and RC4_128 stream ciphers.

  11. That is correct. But you should not use RC4. The non-salted behavior is due to a bug. Also, there is no keystream burning, so I’ll say it again: you should avoid using SQL Server 2005’s RC4.

  12. Mike C# says:

    What options do you have if you don’t want a random IV, or if you want to specify the IV/Salt?  Write your own SQLCLR UDF’s?  Just wondering…  Thanks

  13. Yes, CLR would be an option. Extended procedures would be another.

  14. Arash says:

    Hi Cris,

    I have a similar problem about searching but not exactly the same. I have a table of first & last names plus some additional data fields. I need to do range and LIKE searches on names. To secure it I encrypted only the last name column since in my case it is not important if others can see a part of the data because it would be useless to them.

    Now the problem of searches came about as my database is around 150,000 records. I was wondering if I move the last name column along with a copy of the ID column to a different table and instead of encrypting the last name column, encrypt the ID column. As I said I only need to hide the link between the two tables to make the data unusable, so that now I can create an index on last name (that is plaintext now).

    What do you think?

    By the way, it seems reasonable to mark the function EncryptByKey as indeterministic (because of the salt, as you explained); but why so for DecryptByKey? Doesn’t it always return the same output for the same encrypted value?

    Thanks for your time.

  15. You won’t be able to join the two tables if the ID is encrypted. And if you use hashing, you’re not hiding the link.

    DecryptByKey is non deterministic, because the result will vary depending on whether the key was opened or not. The return value depends on more than just the input arguments.

  16. maysam says:

    First of all let me thank you for your informative article!

    There is still something that needs to be clarified. In my humble opinion, there is no need to add a MAC column to make our data searchable. This is why I think it’s not required:

    The data in our Plaintext column are either unique or not.

    If they are unique, we are sure that encrypting them with the same key will not (as you would expect) result in the same ciphertext. Saying that, there is no need for salting, which means we can use the same IV for all encryptions or even no IV at all. In this case, there is no need for a MAC column, because we can simply encrypt our search phrase using no IV and just compare it to the encrypted data to find the desired value.

    In the latter case, when our plaintext data are not unique, using different IVs (using salting for encryption) seems to be a redundant and useless step. Using salting for encryption in this case, as you’ve mentioned in your article, adds additional protection against the analysis of the ciphertext (encrypted data), but also prevents the ability to search it efficiently. Therefore, we need to add a MAC column. Now, I just want you to answer this question: Is it true that the MAC column for two similar plaintext values is the same? I think your answer is yes. If so, an attacker can easily guess the plaintext value of a record by analyzing other MAC records if only he knows the MAC value of just one record. This is obviously against what we intended to achieve by salting the encrypted data, isn’t it? Again encrypting data with no salting and searching on encrypted data makes more sense.

    At this point, I can only see two reasons to use a hash (MAC) column. A hash value is shorter than the encrypted data and can be searched more quickly. Also when we use encryption with authenticator the ciphers won’t be the same and hash column will be useful for search purpose.

    Please correct me if I haven’t understood something properly.

    Thanks in advance

    – Maysam

  17. Hi maysam,

    The IV is important whether the data encrypted is unique or not. See http://blogs.msdn.com/lcris/archive/2006/05/08/592868.aspx for a more in-depth explanation, but the short version is that if you do not use an IV, you can still figure out data that has common prefixes and if you happen to know what one ciphertext corresponds to (maybe it’s your data or data that you have inserted), then you can find out something about the other as well.

    There could be scenarios where an IV may not be necessary, but these would be special cases and the SQL Server encryption routines were designed to be safe no matter the scenario.

    Yes, the MAC is the same for the same data. But you seem to think of a MAC as if it were a simple hash, and that is not the case. There is a big difference between these two. A hash would not be sufficient because a hash allows a dictionary attack. I will use SSNs as an example, as these are often what people are encrypting. An attacker can just determine what hash function is used and then he can hash all possible SSN numbers to determine which ones you have in your data. A MAC prevents this attack because the MAC is a keyed hash – it depends on a key and the attacker does not know that. The attacker can determine that two values have the same MAC, thus they are the same, but he cannot figure what SSN value they correspond to. So, your observation that an attacker can determine the plaintext value from the analysis of a MAC is incorrect – he can only determine if the plaintexts are the same, but not what they are.

    Hope this helps

    Laurentiu

  18. noter says:

    Thanks for the article. I have a question that I hope is neither too stupid nor too obvious. Why store both a MAC column and an encrypted value column? If the MAC column is necessary to facilitate indexing/searching, and it’s "less secure" than the encrypted value column, what good does the encrypted value column accomplish? It seems like the equivalent of putting burglar bars on one window of your house but not on the window next to it. What am I missing?

  19. Who said the MAC is "less secure"? It allows you to figure out which pieces of data are equivalent, yes, but we’re starting from the assumption that we don’t care about this, or we wouldn’t discuss about searching encrypted data, because that’s what you have to give away to enable such search.

    To retrieve the data, you must have it in some form – the MAC cannot get you the data back; only the encrypted value can allow you to retrieve the data, by using the decryption key, so you cannot just use a MAC.

    You have two things that you want to achieve:

    (1) store data in a secure way

    (2) search the data

    (1) is achieved via encryption, but encryption doesn’t allow you to achieve (2) as well. The MAC helps you with (2) but doesn’t allow you to get the data back, so you don’t get (1).

    So, to achieve both (1) and (2), you can combine encryption with a MAC value.

    A way to work around this problem of searching encrypted data is to not attempt it at all: you can have encryption/decryption performed at a lower level than the data access, as is the case with the new Transparent Data Encryption feature – with this approach, all your searches will be done on decrypted data, because decryption happens before you access that data.