By Wayne Clark
What is Sustainable?
Of primary concern when planning, designing, and testing business solutions built on BizTalk Server 2004 SP1 is that the solutions must be able to handle the expected load and meet required service levels over an indefinite period of time. Given the number of solution architectures, configurations, and topologies possible on BizTalk Server 2004 SP1, there are many things to consider when evaluating a proposed or existing deployment. The purpose of this, our inaugural BizTalk Performance blog posting, is to provide guidance on:
Understanding BizTalk Server SP1 throughput and backlog capacity behavior and how to observe the behavior of your system.
- Critical success factors when planning for capacity.
Let me start off by defining some terms and concepts:
Maximum Sustainable throughput is the highest load of message traffic that a system can handle indefinitely in production. Typically this is measured and represented as messages processed per unit time. Solution design choices such as the choice of adapters, pipeline components, orchestrations, and maps will all have a direct effect on system performance. In addition, BizTalk offers scale-up and scale-out options that provide flexibility when sizing a system. Often overlooked, however, are things like standard operations, monitoring, and maintenance activities that have an indirect effect on sustainable throughput. For example, Performing queries against the messagebox database (e.g., from HAT) will require cycles from SQL and effect overall throughput depending on the type and frequency of the query. Backup, archiving, and purging activities on the database also have an indirect effect on throughput, and so on.
Engine Capacity, also known as backlog capacity, is the number of messages that have been received into the messagebox, but have not yet been processed and removed from the messagebox. This is easily measured as the number of rows in the messagebox database table named spool.
BizTalk Server 2004 SP1 Backlog Behavior
BizTalk 2004 SP1 implements a variety of store-and-forward capabilities for messaging and long and short running orchestrations. The messagebox database, implemented on SQL server, provides the storage for in-flight messages and orchestrations. As messages are received by BizTalk 2004 SP1, they are en-queued, or published into the messagebox database so they can be picked up by subscribers to be processed. Subscribers include send ports and orchestrations. Some of the arriving messages activate new subscriber instances. Other messages arrive and are routed, via a correlation subscription, to a waiting instance of an already running subscriber such as a correlated orchestration.
In order for correlated orchestrations to continue processing, arriving correlated messages must not be blocked. To facilitate this, BizTalk does its best to make sure messages (both activating and correlated) continue to be received, even under high load, so that subscribers waiting for correlated messages can finish and make room for more processes to run. This means it is possible to receive messages faster than they can be processed and removed from the messagebox, thus building up a backlog of in-process messages. Being a store-and-forward technology, it is only natural for BizTalk to provide this type of buffering.
Every message that is received by, or created within, BizTalk 2004 SP1 is immutable. That is, once it has been received or created, its content cannot be changed. In addition, received messages may have multiple subscribers. Each subscriber of a particular message, references the same, single copy of that message. While this approach minimizes storage, a ref-count must be kept for each message and garbage-collection must be performed periodically to get rid of those messages that have a ref count of 0. There are a set of SQL Agent jobs in BizTalk 2004 SP1 that perform garbage collection for messages:
- MessageBox_Message_Cleanup_BizTalkMsgBoxDb – Removes all messages that are no longer being referenced by any subscribers.
- MessageBox_Parts_Cleanup_BizTalkMsgBoxDb – Removes all message parts that are no longer being referenced by any messages. All messages are made up of one or more message parts, which contain the actual message data.
- PurgeSubscriptionsJob_BizTalkMsgBoxDb – Removes unused subscription predicates left over from things like correlation subscriptions.
- MessageBox_DeadProcesses_Cleanup_BizTalkMsgBoxDb – Called when BizTalk detects that a BTS server has crashed and releases the work that that server was working on so another machine can pick that work up.
- TrackedMessages_Copy_BizTalkMsgBoxDb – Copies tracked message bodies from the engine spool tables into the tracking spool tables in the messagebox database.
- TrackingSpool_Cleanup_BizTalkMsgBoxDb – Flips which table TrackedMessages_Copy_BizTalkMsgBoxDb job writes to.
The first two from the above list are the ones responsible for keeping the messagebox cleared of garbage messages on a regular basis. To do their work, they sort through the messages and message parts looking for messages with a ref count of 0 and for parts that are not referenced by any messages, respectively, and removing them.
So, what does all this have to do with throughput and capacity?
When a system is at steady state, that is, processing and collecting garbage as fast as messages are received, this is clearly indefinitely sustainable. However, if for some length of time the system is receiving faster than it can process and remove, messages start to build up in the messagebox. As the amount of this backlog builds up, the amount of work that the cleanup jobs have to do increases and they typically start taking longer and longer to complete. In addition, the cleanup jobs are configured to be low priority in the event of a deadlock. As a result, when running under high load, the cleanup jobs may start to fail as the result of being the victim of deadlocks. This allows the messages being en-queued to have precedence and not be blocked.
As an example, let’s take a look at a system that we have driven at various throughput levels and investigate the observed behavior. The system is configured as follows:
Two BizTalk Servers – These servers are HP DL380 G3, equipped with dual 3GHz processors with 2GB of RAM. BizTalk 2004 SP1 is running on these two servers. Local Disks.
- One SQL Server Messagebox – This server is an HP DL580 G2, equipped with quadruple 1.6GHz processors with 4GB of RAM. This server is connected to a fast SAN disk subsystem via fiber. The server is dedicated to the messagebox database and the data and transaction log files for the messagebox database are on separate SAN LUNs.
- One SQL Server All Other Databases – This server is an HP DL580 G2, equipped with quadruple 1.6Ghz processors with 2GB of RAM. This server is also connected to the SAN. This server houses all BizTalk databases other than the messagebox, including the management, SSO, DTA, and BAM databases.
- Load Driver Server – This server is an HP DL380 G3, equipped with dual 3GHz processors with 2GB of RAM. This server was used to generate the load for testing the system using an internally developed load generation. This tool was used to send copies of a designated file to shares on the BizTalk servers to be picked up by the file adapter.
The Test Scenario
The test scenario is very simple. The load generation tool distributes copies of the input file instance evenly across shares on both BizTalk servers. Using the file adapter (we’ll explore other adapters in subsequent blog entries), files are picked up from the local share and en-queued into the messagebox. A simple orchestration containing one receive and one send shape, subscribes to each received message. Messages sent back into the messagebox by the orchestration are picked up by a file send port and sent to a common share, defined on the SAN. Files arriving on the output SAN share are immediately deleted in order to avoid file buildup on that share during long test runs.
There are 4 hosts defined for the scenario, one to host the receive location, one to host orchestrations, one to host the send port, and one to host tracking. For the purposes of observing engine backlog behavior, tracking is completely turned off during the test runs. Turning tracking off involves more than just stopping (or not creating) and tracking host instance. To turn tracking completely off, use the WMI property MSBTS_GroupSetting.GlobalTrackingOption. For more information on turning tracking on and off using this property, please see: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sdk/ht m/ebiz_sdk_wmi_msbts_groupsetting_fqxz.asp.
Both BizTalk servers are identical in that they each have instances of the receive host, orchestration host, and send host. No instances of the tracking host were created since tracking was turned off to isolate core messagebox behavior for these tests.
A simple schema was used and the instance files used for the test were all 8KB in size. No mapping or pipeline components were used inbound or outbound in order to keep the test scenario simple to implement and keep the behavior observations focused on the messagebox.
As a first step, the system is driven at a level near, but below, it’s maximum sustainable throughput so that observations of a healthy system can be made. The growth rate of the messagebox backlog is a key indicator of sustainability. Clearly, the messagebox cannot continue to grow indefinitely without eventually running into problems. So, the depth of the messagebox database backlog, monitored over time, is used to evaluate sustainability. The messagebox table named spool contains a record for each message in the system (active or waiting to be garbage collected). Monitoring the number of rows in this table, and the number of messages received per second, while increasing system load provides an easy way to find the maximum sustainable throughput. Simply increase the input load until either (a) the spool table starts to grow indefinitely or (b) the number of messages received per second plateaus, whichever comes first, and that is your maximum sustainable throughput. Note that if you are not able to drive a high enough load to cause the spool table to grow indefinitely, it simply means that the slowest part of your system is on the receive side, rather than the processing/send side. The following graph shows key indicators after using this approach to find the maximum sustainable throughput of our test system (described above).
The blue line shows the total messages received per second by the system (i.e., for both BizTalk servers), the pink line shows the running average of the messages received per second, and the yellow line shows the spool table depth (x 0.01) for the test duration time provided on the X axis. What this graph shows is that, for the hour of the test, the spool was stable and not growing and making the sustainable throughput equal to the number of messages received per second, in this case 150 msgs/sec.
Part of any analysis of a BizTalk deployment performance should include checking some key indicators to understand resource bottlenecks. The key measures and their values used for this deployment running under maximum sustainable throughput (i.e., the test in the graph above) were as follows:
BTS Servers (each): Avg CPU Utilization = 59%
MsgBox DB Server: Avg CPU Utilization = 54%
Mngmt DB Server: Avg CPU Utilization = 13%
Physical Disk Idle Time:
MsgBox DB Server, Data File Disk: Avg Disk Idle Time = 69
MsgBox DB Server, Transaction Log File Disk: Avg Disk Idle Time = 98
MsgBox DB Server: Avg Total Lock Timeouts/Sec = 1072
MsgBox DB Server: Avg Total Lock Wait Time (ms) = 40
MessageBox_Message_Cleanup_BizTalkMsgBoxDb: Typical Runtime = 30 Sec
MessageBox_Parts_Cleanup_BizTalkMsgBoxDb: Typical Runtime = 15 Sec
No errors in any of the server application event logs.
From these data, we can draw the following conclusions: There are no obvious resource bottlenecks in our system. All of these indicators are well within healthy limits. CPU and Disk Idle Times show that there is plenty of headroom and they are not even close to being pegged. The SQL lock indicators look good. Lock Timeouts/sec doesn’t start to become an issue until around 5000 or so (depending on your SQL Server) and Lock Wait times under .5 – 1 second are also healthy. Finally, the cleanup jobs are completing successfully every time and are taking 30 seconds or less to run. If the cleanup jobs start failing often, or start taking over a minute, this is an indication that the system is being over-driven and will typically be accompanied by an increasing spool depth.
TIP: You can expose the number of rows in the spool table by using the user defined counter capability provided by SQL Server. Create a stored procedure (in your own database) as follows:
CREATE PROCEDURE [dbo].[SetSpoolCounter] AS
SET NOCOUNT ON
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
SET DEADLOCK_PRIORITY LOW
declare @RowCnt int
select @RowCnt = count(*) from BizTalkMsgBoxDB..spool with (NOLOCK)
execute sp_user_counter1 @RowCnt
By running this stored procedure periodically (e.g., once per minute) as a SQL Agent job, you can add the depth of the spool table to your counters in Performance Monitor. For more information, search on sp_user_counter1 in the SQL books on line.
For additional useful messagebox queries, check out Lee Graber’s paper on advanced messagebox queries: http://home.comcast.net/~sdwoodgate/BizTalkServer2004AdvancedMessageBoxQueries.doc?
Now that we have shown how to find the maximum sustainable throughput and seen what the key indicators look like for a sustainable, healthy system, let’s explore some behavior associated with a system that is receiving faster than it is processing and collecting garbage.
To simulate a continuously overdriven system, we configured the load generation tool to send in about 175 msgs/sec, 25 msgs/sec more than our measured maximum sustainable throughput. The test was designed not only to over drive the system, but to get an idea of how long it would take to recover from a spool backlog depth of around 2 million messages. To accomplish this, we drove the system at the increased rate until the spool depth was around 2 million and then stopped submitting messages altogether. The following graph shows the same indicators as in the sustainable graph above.
As can be seen from the graph, the spool depth started building up immediately, peaking at just above 2 million records. At this rate, it took just under 3 hours to get to the targeted 2 million record backlog. After the load was stopped, it took around 4.5 hours for the cleanup jobs to recover from the backlog.
Note that, even though the receive rate started out at 175 msgs/sec, it didn’t take long for it to degrade to an average less than our maximum sustainable throughput. This is primarily due to the throttling that BizTalk provides and to increased lock contention on the message box. During the overdrive period of the test, BizTalk throttled the receiving of messages (by blocking the thread the adapters submit messages on) based on the number of database sessions opened by a host instance and the messagebox database. This throttling is indicated by messages in the application event log that indicate when BizTalk starts and stops throttling.
Taking a look at our other key indicators during this test, we see some interesting trends. Consider the following graph showing the physical disk idle time for the messagebox data file disk, average CPU utilization (%) for the messagebox server, and average lock timeouts per second on the messagebox database ( x 0.01).
Comparing this graph to the one above it, we can see that, while the system is being over driven and the spool is building up, the disk gets more and more saturated (i.e., disk idle time is trending down). Also notice that, once the cleanup jobs are given “free reign” after the load is stopped, disk idle time drops to near zero. If it wasn’t for the fact that the cleanup jobs are configured for low deadlock priority, they would take much more of the disk I/O bandwidth even earlier in the cycle. Instead, what we see from the job histories is that they are failing nearly every time they are executed because of lock contention while the load is still underway (as indicated by the avg lock timeouts/sec). Once the lock contention is reduced (at the point the load is stopped), the cleanup jobs are able to succeed and begin removing messages from the spool. It’s interesting to note that the message cleanup job ran only twice after the load was stopped, but ran for hours each time in order to collect all the unreferenced messages.
Not shown in the above graph, the lock wait times were also quite high, averaging 7 seconds during the load period, and then dropping to normal sub-100ms levels during the recovery period.
QUESTION: “But what if I only have one or two “floodgate” events per day? Do I really have to build a system that will handle these peaks with no backlog, only for it to sit idle the rest of the time?
ANSWER: Of course not, as long as the system can recover from the backlog before the next floodgate event, you will be fine.
There are a number of scenarios where there are just a few large peaks (a.k.a., “floodgate events”) of messages that arrive at the system each day. Between these peaks, the throughput can be quite low. Examples of these types of scenarios include equity trading (e.g., market open and market close) and banking systems (e.g., end of day transaction reconciliation).
Other types of events cause backlog behavior simlar to floodgate events. For example, if a partner send address goes off line so that messages destined for that address must be re-tried and/or suspended, this can result in backlog building up. When the partner comes back on line, there may be a large number of suspended messages that need to be resumed, resulting in another type of floodgate event.
To illustrate how this works, consider a third test of our system as follows. We drove the system at around half the maximum sustainable throughput. This was, of course, very stable. Then, to simulate a floodgate event, we dropped 50,000 additional messages all a once (as fast as we could generate them) and monitored the system. The graph below provides our now familiar indicators of messages received per second and spool depth.
Note from the graph that the spool indeed built up a backlog during the floodgate event. However, because the event was relatively short lived and the subsequent receive rate after the event was below the maximum sustainable rate, the cleanup jobs were able to run and recover from the event without requiring a system receive “outage”.
Of course, every system is different, so “your mileage will vary”. The best way to verify that you can recover is to test with a representative load before going into production.
Findings and Recommendations
Know your load behavior profile: As our three examples above have shown, it is critical to know the profile of your load in terms of messages processed over time. The better this is understood, the more accurately you can test and adjust your system capacity. If all you know is your peak throughput requirement, then the most conservative approach would be to size your system so that your maximum sustainable throughput is the same or higher than your peak load. However, if you know that you have predictable peaks and valleys in your load, you can better optimize your system to recover between peaks, resulting in a smaller, less expensive overall deployment.
Test performance early: A common situation that we encounter at customers is that they have invested significant effort in designing and testing the functionality of their scenario, but have waited until the last minute to investigate its performance behavior on production hardware. Run performance tests on your system, simulating your load profile, as early as you possibly can in your development cycle. If you have to change anything in your design or architecture to achieve your goals, knowing this early will give you time to adjust and test again.
Emulate your expected load profile when testing performance: There are two primary components to this: 1) emulate the load profile over time and 2) run the test long enough to evaluate if it is sustainable. If, like many customers, your cycles are daily in nature, you should plan to run performance tests for at least one day to validate sustainability. The longer the tests, the better.
Emulate the production configuration: For example, the number and type of ports, the host and host instance configuration, database configuration, and adapter setup. Don’t assume that changes in the configuration will not be significantly different from a performance standpoint.
Use real messages: Message sizes and message complexity will have an affect on your performance, so be sure an test with the same message schemas and instances that you plan to see in production.
Emulate your normal operations during performance tests: Though the examples above did not include them, standard operations activities such as periodic database queries, backups, and purging will affect your sustainable throughput, so make sure you are performing these tasks during your performance and capacity test runs. This includes both DTA and BAM tracking, if you plan to use them in production.
The speed of the I/O subsystem for the messagebox is a key success factor: Remember that, for this scenario, we are using a fast SAN for the messagebox database files that is dedicated to this build-out. Even in this case, the cleanup jobs were able to drive the idle time to near zero for the SQL data file. The I/O subsystem is the most common bottleneck we have seen in customer engagements. A common strategy to optimize SQL I/O, for example, is to place the database data file(s) and log file(s) on separate physical drives, if possible.
Make sure the SQL Agent is running on all messagebox servers: Clearly, the cleanup jobs will never run if the SQL Agent is not running, so make sure these are running.
Spool depth and cleanup job run time are key indicators: Regardless of other indicators, these two measures will give you a quick and easy way to assess if your system is being over-driving or not.
I would like to thank the following contributors to this blog entry:
Mitch Stein: Thanks for helping set up the test environment and generate the test data!
Binbin Hu, Hans-Peter Mayr, Mallikarjuna rao Nimmagadda, Lee Graber, Kevin Lam, Jonathan Wanagel, and Scott Woodgate: Thanks for reviewing and providing great feedback that improved the content!
Disclaimer: This posting is provided “AS IS” with no warranties, and confers no rights.