What I learned at PASS - ETL in Queries.

Article
11/25/2008

I attended only a small number of the talks at PASS. I really enjoyed Bob Ward's talk on memory and Paul Randal's talk on DBCC corruption/repair. Otherwise, I spent a lot of time answering questions from customers. LOTS of questions. So many questions I pretty much lost my voice.

Overall, I was amazed at how many people had challenging questions at PASS - I didn't expect to be busy for the whole day answering questions like that - I conclude that SQL Server users are all above average, better looking, and generally smarter than non-SQL Server users :).

While I received lots of questions, advice, feature requests, requests to debug live servers, and much more, the topic that came up the most was ETL - lots of data loading questions, specifically. So, I'll talk a bit about data loading in SQL Server and in query plans.

Most query plans are complicated because we have to deal with tree normalization, statistics, exponential search spaces, and a host of other problems. Whenever you have queries with lots of joins, this is typically the kind of stuf that I end up examining to see why a query is fast or slow. ETL _can_ be about this, but often this is more about how much computing power you can throw at the problem.

In automotive terms, algorithmic optimization is about knowing what gear to use, what line to take in the turns, and when to do smart things like draft on someone else. Loading lots of data, on the other hand, is about horsepower. The more, the merrier.

As I had this conversation a few times with customers, I'll share questions and typical answers I got while working on this:

1. Have you separated out the major streams of data in your system? Is your data on a different drive than your log and a different drive again from tempdb? If they are on different drives, do you need different storage controllers?

2. Some people buy storage aggregators - basically lots of hard drives in big striping solutions over a LOT of drives (think RAID). These are nice because you can theoretically load at the throughput of more drives. You can either federate traffic on these (put tempdb on a subset of the drives, for example) or buy multiple aggregation solutions, one for each set of traffic if you prefer.

Remember, traditional drives are fastest at sequential writes because moving the disk head is relatively expensive. So, you want to minimize disk head seeks which means writing sequentially when possible. I had one customer who had put far more files than spindles on this, and I have no idea why (if someone has a good reason, let me know - I don't buy these things every day, so perhaps there is a good reason that eludes me).

So most of this advice from the first two is simple - separate things out so you can get greater throughput.

Now we get into the more interesting things.

3. bulk insert can work in parallel. regular insert typically doesn't. SQL Server can't support more than one writer to a table in a single query, typically. So, often you get blocked on this if you want to insert into an existing table (as bulk insert has restrictions). So, if you want to add more stuff into your table quickly, consider partitioning. You can create a new table, bulk insert into that table, and then SWITCH the partion into your big table. In fact, we did substantial work in SQL 2005 and 2008 to make that work well. It's called the Sliding Window Scenario (look it up in Books Online).

If you understand that query plans can't write to a single rowset in parallel, you'll understand the bottleneck from 1-2 to 3. If you hit the rate at which your CPU and memory can push rows for a single thread, you're in trouble. So, federating your work is key to a highly scalable app. So, you can bulk insert into the same table with multiple threads and you can do multiple partitions-to-be at the same time. This lets you exploit all of the CPU cores you have as well. Eventually you will run out of something else (memory bandwidth?), but by now I'll say that most of the customers are happy

The advice is basic, but most of the problems I saw around loading lots of data boiled down to one of these problems - understand how to keep all of the drives busy and then all of the CPUs busy and as productive as possible, and then you will maximize your hardware.

Conor

What I learned at PASS - ETL in Queries.

Additional resources