Compression for Speed and Cost Savings

[This article was contributed by the SQL Azure team.]

SQL Azure doesn’t currently support page level or row level compression like the enterprise edition of SQL Server 2008. However, you can implement your own column level compression in your data access layer to take advantage of the performance and cost savings of compression. I will discuss how to do this and provide some example code in this blog post.

Column level compression is the concept of compressing your data before you write it to the database, and decompressing it when you it from the database. For example, instead of having varchar(max) column of text you have a varbinary(max) of compressed text that is holding on average 80% less data.

The Right Scenario

Only in certain scenarios does column level compression work well. When compressing columns consider that:

  • Large text columns are the best to compress; they gain you the most savings. The gain from compression must exceed the costs of creating the compression dictionary (a result of using deflate compression); this only happens when there is a large amount of data that repeats itself. This technique will benefit not only text but large xml and binary data as well depending on the content. For example, you don’t want to compress images blobs – they are usually already compressed.
  • Don’t compress columns that appear in the WHERE clause of your queries. With this technique you can’t query on the compressed text without decompressing it. You also won’t be able to query or access these fields through SSMS or load data directly using BCP.exe
  • Compress columns that can be cached on the application side, to avoid multiple reads; this avoids the costs of decompressing the text.

For example scenario that would work well is a web based product catalogues where you compress the product description and where you don’t need to search within the description and it doesn’t change very often.

The Benefits

Compression can reduce the amount of data you are storing, creating potential cost savings. It can also potentially help you stay below the maximum 50 Gigabyte database size for SQL Azure and avoid the development costs of partitioning.

In certain scenarios compression can result in speed improvements for queries that are preforming full table scans on the clustered index. When dealing with large-value data types, if the data in the column is less than 8,000 bytes it will be stored in the page with the rest of the column data. If you can reduce the 8000 bytes, more rows can be paged at one time, giving you performance gains on full table scans of the table.

Table Modification

Compressed data should be stored in varbinary(max) columns. If you are compressing a nvarchar(max) column, you will need to create an additional column to compress you data into. You can do this with an ALTER TABLE command. Once you have the text compressed, you delete the nvarchar(max) column. Here is a little example Transact-SQL that adds a column to SQL Azure:

 ALTER TABLE Images ADD PageUriCompressed varbinary(max) NOT NULL DEFAULT(0x0)

The Data Layer

Fortunately, .NET CLR 2.0 has some great compression built into the System.IO.Compression namespace with the GZipStream class. The first thing I need to do is create a throw-away console application that I will use once to compress all the existing rows into the new column, here is what it looks like:

 do
{
    using (SqlConnection sqlConnection =
        new SqlConnection(
            ConfigurationManager.ConnectionStrings["SQLAzure"].ConnectionString))
    {
        String pageUri;
        Int64 Id;

        // Open the connection
        sqlConnection.Open();

        // Pull One Row At A Time To Prevent Long Running
        // Transactions
        SqlCommand sqlCommand = new SqlCommand(
            "SELECT TOP 1 ID, PageUri FROM [Images] WHERE PageUriCompressed = 0x0",
            sqlConnection);

        using (SqlDataReader sqlDataReader = sqlCommand.ExecuteReader())
        {
            // WWB: Exit Do Loop When There Is No More Rows
            if (!sqlDataReader.Read())
                break;

            pageUri = (String)sqlDataReader["PageUri"];
            Id = (Int64)sqlDataReader["ID"];
        }

        Console.Write(".");

        // Compress Into the Memory Stream
        using (MemoryStream memoryStream = new MemoryStream())
        {
            using (GZipStream gzipStream = new GZipStream(memoryStream,
                CompressionMode.Compress, true))
            {
                // Unicode == nvarchar
                Byte[] encodedPageUri = Encoding.Unicode.GetBytes(pageUri);
                gzipStream.Write(encodedPageUri, 0, encodedPageUri.Length);
            }

            // Now Everything is compressed into the memoryStream
            // Reset to Zero Because We Are Going To Read It
            memoryStream.Position = 0;

            // WWB: Stream for Writing
            using (SqlStream sqlStream = new SqlStream(sqlConnection,
                "dbo",
                "Images",
                "PageUriCompressed",
                "ID",
                SqlDbType.BigInt, Id))
            {
                using (BinaryReader binaryReader = new BinaryReader(memoryStream))
                {
                    using (BinaryWriter binaryWriter = new BinaryWriter(sqlStream))
                    {
                        Int32 read;
                        Byte[] buffer = new Byte[1024];
                        do
                        {
                            read = binaryReader.Read(buffer, 0, 1024);
                            if (read > 0)
                                binaryWriter.Write(buffer, 0, read);

                        } while (read > 0);
                    }
                }
            }
        }
    }
} while (true);

Console.WriteLine("");

This code uses the SqlStream class that was introduced in this blog post.It also tries to make good use of the local memory and not consume too much if the string being compressed is really big. However, this results in a very “chatty” application that creates a lot of connections to SQL Azure and runs slower than I would like.

Evaluation

My next step is to evaluate if the compressed really helped me. I do this because it can be hard to evaluate what the benefits compression will have until you have compressed your real world data. To do this I use the DATALENGTH field in Transact-SQL to sum up the two columns, i.e. before and after compression. My query looks like this:

 SELECT COUNT(1), SUM(DATALENGTH(PageUri)), SUM(DATALENGTH(PageUriCompressed))
FROM Images

The results look like this:

clip_image001

I can see that compressed is actually going to reduce the size of my database in this case. In some scenarios compression makes the data bigger, usually when the data dictionary exceeds the gains from compression. As a rule of thumb, to have effective compression on text you need to have multiple repeating phrases, which happen with longer text blocks

Code

Now that the column is compressed you need to be able to read the compressed data and write uncompressed data to the column compressed. Here is a little bit of example code to help you do that:

 protected static String Read(Int64 id)
{
    using (SqlConnection sqlConnection =
        new SqlConnection(
            ConfigurationManager.ConnectionStrings["SQLAzure"].ConnectionString))
    {
        sqlConnection.Open();

        SqlCommand sqlCommand = new SqlCommand(
            "SELECT PageUriCompressed FROM [Images] WHERE ID = @Id",
                sqlConnection);

        sqlCommand.Parameters.AddWithValue("@Id", id);

        using (SqlDataReader sqlDataReader = sqlCommand.ExecuteReader())
        {
            sqlDataReader.Read();

            Byte[] compressedPageUri = 
                (Byte[])sqlDataReader["PageUriCompressed"];

            using (MemoryStream memoryStream = 
                new MemoryStream(compressedPageUri))
            {
                using (GZipStream gzipStream = new GZipStream(memoryStream, 
                    CompressionMode.Decompress))
                {
                    using (StreamReader streamReader =
                        new StreamReader(gzipStream, Encoding.Unicode))
                    {
                        return (streamReader.ReadToEnd());
                    }
                }
            }
        }
    }
}

For writing:

 protected static void Write(Int64 id, String pageUri)
{
    using (SqlConnection sqlConnection =
        new SqlConnection(
            ConfigurationManager.ConnectionStrings["SQLAzure"].ConnectionString))
    {
        // Open the connection
        sqlConnection.Open();

        // Compress Into the Memory Stream
        using (MemoryStream memoryStream = new MemoryStream())
        {
            using (GZipStream gzipStream = new GZipStream(memoryStream,
                CompressionMode.Compress, true))
            {
                // Unicode == nvarchar
                Byte[] encodedPageUri = Encoding.Unicode.GetBytes(pageUri);
                gzipStream.Write(encodedPageUri, 0, encodedPageUri.Length);
            }

            // Now Everything is compressed into the memoryStream
            // Reset to Zero Because We Are Going To Read It
            memoryStream.Position = 0;

            // WWB: Stream for Writing
            using (SqlStream sqlStream = new SqlStream(sqlConnection,
                "dbo",
                "Images",
                "PageUriCompressed",
                "ID",
                SqlDbType.BigInt, id))
            {
                using (BinaryReader binaryReader = new BinaryReader(memoryStream))
                {
                    using (BinaryWriter binaryWriter = new BinaryWriter(sqlStream))
                    {
                        Int32 read;
                        Byte[] buffer = new Byte[1024];
                        do
                        {
                            read = binaryReader.Read(buffer, 0, 1024);
                            if (read > 0)
                                binaryWriter.Write(buffer, 0, read);

                        } while (read > 0);
                    }
                }
            }
        }
    }
}

SQL Compression

For a comparison, the compression built into an on-premise SQL Server is transparent to the application and benefits much wider range of types like decimal and bigint all the way to larger types because it scopes to the row to the page. In other words it can use the repetition dictionary in compression across all the columns in the page. You can read more about SQL Server 2008 compression here.

Summary

Do you have a better way to accomplish the same thing? Post it in the comments below. Do you have questions, concerns, comments? Post them below and we will try to address them.