The “Balanced Data Distributor” for SSIS


The “Balanced Data Distributor” for SSIS
Len Wyatt, SQL Server Performance Team

There is a new transform component available for SQL Server Integration Services.  It’s called the Balanced Data Distributor (BDD) and the download is available here.  The BDD provides an easy way to amp up your usage of multi-processor and multi-core servers by introducing parallelism in the data flow of an SSIS package.

Functionality of the BDD

The functionality of the BDD is very simple:  It takes its input data and routes it in equal proportions to its outputs, however many there are.  If you have four outputs, roughly ¼ of the input rows will go to each output.  Instead of routing individual rows, the BDD operates on buffers of data, so it’s very efficient.

Some of you will already be noticing that there is no transformational value in the BDD, and no control over which data rows go to which output.  You may be wondering, what the heck is the value of that?

The value of the BDD comes from the way modern servers work:  Parallelism.  When there are independent segments of an SSIS data flow, SSIS can distribute the work over multiple threads.  BDD provides an easy way to create independent segments. 

This diagram gives a trivial example:

 

If you run this data flow on a laptop, there probably won’t be any speed advantage, and there may even be a speed cost.  But suppose you run this on a server with multiple cores and many disk spindles supporting the destination database.  Then there might be a substantial speed advantage to using this data flow.

When to use the BDD

Using the BDD requires an understanding of the hardware you will be running on, the performance of your data flow and the nature of the data involved.   Therefore it won’t be for everyone, but for those who are willing to think through these things there can be significant benefits.  Here is my summary description of when to use BDD: 

  1. There is a large amount of data coming in.
  2. The data can be read faster than the rest of the data flow can process it, either because there is significant transformation work to do or because the destination is the bottleneck.  If the destination is the bottleneck, it must be parallelizable.
  3. There is no ordering dependency in the data rows.  For example if the data needs to stay sorted, don’t go and split it up using BDD.
  4. 

Relieving bottlenecks in the SSIS Data Flow

Let’s talk about bottlenecks, since changing bottlenecks is what BDD is all about.  A bottleneck is whatever limits the performance of the system.  In general there are three places that could be the bottleneck in an SSIS data flow:  The source, the transformations, or the destination.

Bottlenecks in the Source

If the limiting factor is the rate at which data can be read from the source, then the BDD is not going to help.  It would be better to look for ways to parallelize right from the source.

Bottlenecks in the Transformations

If the limiting factor is the transformation work being done in the data flow, BDD can help.  Imagine that there are some lookups, derived columns, fuzzy lookups and so on:  These could easily be the components limiting performance.  Make two or four or eight copies of the transformations, and split the data over them using the BDD.  Let the processing run in parallel.  If there are several transformations in the data flow, put as much as you can after the BDD, to get more things running in parallel.

Bottlenecks in the Destination

If the limiting factor is the destination, BDD might be able to help – you need to determine whether the destination can be run in parallel.  You might be surprised at some times when it can.  One example is when loading data into a simple heap (table with no indexes) in SQL Server.  With the database properly distributed over a number of disks, it is quite possible to load in parallel with good performance.  When working on the ETL World Record a while ago, we used a heap for a side experiment and found that loading 56 streams concurrently into a single heap was almost as fast as loading 56 streams into 56 independent tables.  Many sites already drop or disable their indexes during data loading, so this could be more of a freebie than you would expect.  More recently we saw a benefit from parallel loading into SQL Server Parallel Data Warehouse (PDW).  PDW is an architecture designed for parallelism!

When the destination does not support parallel loading

A final case to consider is when the limiting factor is the transformation work being done in the data flow but the destination cannot receive data in parallel for some reason.  In this case, consider using BDD to parallelize the transforms followed by a Union All to rejoin the data into a single flow; then a single destination can be used.  Here is an illustration:

 

Best practice – balanced!

One final note:  Whatever you put behind the BDD, be sure the same work is being done on all paths.  It doesn’t make logical sense to have the paths be different, and from a performance point of view, you want them all to be the same speed.  Remember, the “B” in BDD stands for “Balanced”.

Conclusion

Someday maybe SSIS will be able to do the work of the BDD automatically, but for now you have an easy way to amp up your usage of multi-processor and multi-core servers by introducing parallelism in the data flow of an SSIS package.

Comments (17)

  1. Is this balace data Distributor work same for all type od Data source or is there any performance difference with different Data Source

  2. When i am clicking on download of balace data distributer link it just download .Doc file where i can get actuall file will pleases explain me same

  3. Len Wyatt says:

    I agree that the way to do this is not totally obvious, but there are instructions on the download page.  It says:

    • Please download the License on this page and accept the terms.

    • Once you accept the License terms, a download link will display giving you the option to download the X86 or X64 version of the software

    The Accept button is at the bottom of the License agreement.

  4. Carla Sabotta says:

    To view a demo of the Balanced Data Distributor, go to technet.microsoft.com/…/hh369962 .

  5. Michael Brönnimann says:

    Parallelization for queries (SELECT) is covered quite well by the SQL engine itself, but when it comes to large volume data modifications (UPDATE, INSERT, DELETE), the standard engine does parallelize towards best use of all available resources (disk, multiple cpu-cores, etc.).

    Next to SSIS, there exists also a lightweight more generic and engine focused approach 😉

    The concept of SQL Parallel Boost is a native T-SQL based solution an allows a performance gain up to factor 10 (!) in a 8-16 CPU core environment.

    The purely SQL engine related parallelisation solution takes advantage of minimized complexity and has no 'external' components like SSIS involved, Furthermore it's the best performing solution regarding task splitting and synchronization, as it hasn't potential connection and communication overhead. The solutioncan also be used to execute multiple SQL statements in parallel initiated thru the same connection.

    The free 'Community Edition' can be downloaded at sqlparallelboost.codeplex.com

  6. Hady Ziade says:

    what would be the main advantage of having a solution using BDD or running many packages in parallel by splitting the data in the query .

    what is obvious is : instead of having many sources it will be one query returning all the data … what else ?

  7. Victor Isakov says:

    What's happening with the BDD and SQL Server 2012?

    At present I cannot get it to work with RC0. Have not tried on subsequent builds.

  8. Ke Yang says:

    We're actively working on making BDD work in SQL Server 2012.

  9. Hussein Yousef says:

    Thanks, great explanation…

  10. Theron Knapp says:

    I was looking for the 2012 version and since there isn't a link on this page already, here it is:

    http://www.microsoft.com/…/details.aspx

  11. Tibco Online Training says:

    It is so nice article. I was really satisfied by seeing this article and we are

    also giving Tibco Online Training. Tibco <a href="http://www.tibco-online-training.com">Tibco online training</a>is best online training institute in USA.

  12. Theron Knapp says:

    And the 2014 version is part of the SQL Server 2014 Feature Pack: http://www.microsoft.com/…/details.aspx

  13. L.McArthur says:

    Hi there. My name is Lee McArthur – I am an account manager for a re-insurance company here in Bermuda that is doing some amazing things with SQL 2014 and Fusion IO.

    Fusion IO – going great. However, its being help up by SQL 2014 barfing / rejecting / blocking connection requests far before it should. Can someone help and hop on a quick call to discuss please? Client more than happy to pay as well. We are all Microsoft partners, customers, etc.

    Lee McArthur

    441-278-5239

    Thanks in advance!

  14. Scott says:

    Is there a performance benefit to using this instead of a conditional split? I would normally take an ID field and split the data based on ID % 3=0, ID %3=1, ID % 3=2 and so on.