ETL World Record!

Today at the launch of SQL Server 2008, you may have seen the references to world-record performance doing a load of data using SSIS. Microsoft and Unisys announced a record for loading data into a relational database using an Extract, Transform and Load (ETL) tool. Over 1 TB of TPC-H data was loaded in under 30 minutes. I wanted to provide some background material in the form of a Q&A on the record, since it’s hard to give many details in the context of a launch event. We are also planning a paper that talks about all this, so think of this article as a place-holder until the full paper comes along. I hope you find this background information useful.

- Len Wyatt

How fast was the data load?

More than one terabyte of data was parsed from flat files, transferred over the network and loaded into the destination database in less than 30 minutes, a world record beating all previously published results using an ETL tool. That is a rate in excess of 2 TB per hour (650+ MB/second). To be precise, 1.18TB of flat file data was loaded in 1794 seconds. This is equivalent to 1.00TB in 25 minutes 20 seconds or 2.36TB per hour.

Why is this important?

Businesses have ever-increasing volumes of data stored in many heterogeneous systems. Thay want to know that an ETL tool they choose will be able to support any data volumes they might require. Microsoft has been making a significant investment in SQL Server Integration Services (SSIS), and this record illustrates the capability of SQL Server Integration Services 2008, SQL Server 2008 and the Unisys ES7000 to handle a significant volume of data at a dramatic speed.

Why not just do a bulk load of the data?

It is rare in businesses today that data is always available on the destination system, and does not need to be standardized or corrected for errors before loading. These rare cases are the times that bulk loading data makes sense. Data integration can involve complex transformation rules, error checking and data standardization techniques. ETL tools like SSIS can perform these functions such as moving data between systems, reformatting data, integrity checking, key lookups, tracking lineage, and more. SSIS has proven itself to be a versatile ETL tool, and now it is shown to be the fastest one as well.

What data did you choose to load?

DBGEN tool from the TPC-H benchmark was used to generate 1.18 TB of source data. The data were partitioned by DBGEN, allowing it to be loaded in parallel from multiple systems. DBGEN generates data on customers, parts, suppliers, orders and line items. It is broadly representative of a wholesale business. The data contain a variety of data types, including dates, money amounts, integers, strings and flags.

Please note that the ETL loading results are not TPC-H benchmark results and should not be compared to TPC-H benchmark results.

Was this a certified benchmark?

There is no commonly accepted benchmark for ETL tools. Microsoft thinks there should be. Industry standard benchmarks can lead to healthy competition, better products, and better publication of the techniques used to get high performance. Microsoft would welcome the opportunity to join with others in the industry to define a common benchmark that reflects the real-world uses of ETL tools.

The use of TPC-H data for this project was a convenience. This is not a TPC-H benchmark result.

How does this compare to your competitors?

Multiple competitors have published results based on TPC-H data. Informatica has the fastest time previously reported, loading 1 TB in over 45 minutes. SSIS has now beaten that time by more than 15 minutes.

There are other claims of fast times that have been made, but on non-standard data sets and without enough information to allow any meaningful comparison. This is part of the reason Microsoft would support the creation of an industry standard ETL benchmark.

What system configuration was used?

The database server ran on a Unisys ES7000/one Enterprise Server , with 32 socket dual core Intel® XeonTM  3.4 Ghz (7140M) processors , 256 GB RAM and 8 dual port 4Gbit HBA’s . The SQL Server data was stored on an EMC Clariion CX3-80 SAN with 165 (146 GB/15 krpm) spindles. The database server ran a pre-release build of SQL Server 2008 Enterprise Edition (V10.0.1300.4, built just before the “February 2008 CTP”) on the Windows Server 2008 x64 Datacenter Edition operating system.

 

Four servers acted as data sources, modeling the fact that data comes from a variety of systems in a modern enterprise. Each source server ran SSIS packages that sent data across the network to the database server. The source servers ran SSIS from SQL Server build V10.0.1300.4, on the Windows Server 2008 operating system. Source data came from flat files, as it was generated by DBGEN.

For the source servers, 4 Unisys ES3220L servers with Windows2008 x64 Enterprise Edition were used. Each server is equipped with 2 x 2.0GHz quad core Intel® processors, 4GB RAM, a dual port 4Gbit Emulex HBA and Intel PRO1000/PT network card. The source data was read from 2 x EMC Clariion CX600 SAN’s with 45 spindles each.

The Source servers were connected to the ES7000/one server database server with private dual port 1Gb Ethernet connections.

Why use multiple source systems?

Modern large businesses are complex operations. Large data sets are often the result of multiple data feeds. This made the test more realistic by mimicking a real world ETL scenario.

What do the SSIS packages look like?

There was just one package, though the source systems ran multiple instances of it. It is quite simple: There is one control flow for each “stream” of data generated by DBGEN. The control flow has one data flow for each table, each data flow reading data from a flat file source and writing to the SQL Server database via OLEDB. Using this data set there is a one-to-one column mapping between the flat file data and the database tables.

Did Windows Server 2008 figure in to this?

A lot of innovative engineering work in Windows Server 2008, including significant improvements in memory management, PCI and block storage I/O, and core networking, helped achieve this great performance. Because of these advances, Windows Server 2008 sustained about 960 megabytes per second over the Ethernet network, during processing of one large table.

Were secret internal tricks were needed to make this work?

No secret internal tricks or special builds were needed. Although this project used a pre-release version, it was a regular SQL2008 Enterprise Edition build. No special code in the product was used. Everything we did could be replicated by others.

The main thing done in the relational database was to use “soft NUMA” and port mapping to get a good distribution of work within the system. This is a published technique; you can find articles about it on MSDN. We also set the –x flag on starting SQL Server. This reduces the time SQL Server spends collecting performance statistics at run-time.

In SSIS we made sure the data types used in the SSIS data flows matched the types used in SQL Server, so the data did not need to be converted again after the initial conversion of strings read from flat files. Fast Parse is set on the text file fields where it applied.

The network connections on the server used the built-in Intel PRO/1000 GbE controllers. Released versions of network drivers were used, and Ethernet jumbo frames were configured to better support this bulk streaming scenario. Window Server 2008’s new TCP/IP receive window autotuning was set to “restricted”. The IntPolicy tool was used to ensure the ES7000 server NICs’s interrupts & DPCs occurred on a CPU affinitized to the same NUMA node as the NIC.

A complete list of settings and optimizations will be included in the paper when it is released.