Microsoft SQL Server Escalation Services
If you’re familiar with Microsoft SQL Server Replication technology you already know each Replication Agent has various parameters or knobs you can set via Agent Profile or SQL Agent job parameter. This series of posts provide insight into tuning Replication performance using these knobs. Following the 80-20 rule, 80% will find this helpful while 20% will have unique situations, okay, perhaps more like 1% will be unique, but I’ve never heard it called the 99-1 rule.
While applying these guidelines, monitor your server performance before and after, then change and monitor again. Run Agent under “Verbose History” profile or Windows Performance Counters for Replication to better monitor performance.
In this posting I’ll talk about Log Reader. This SQL Replication Agent works to move data from a “publisher” to the “distributor” where the data is cached. The Log Reader is 2 components. First the LogRead.exe called from SQL Agent job, next internal SQL engine component reading the log and sending commands back the LogRead.exe which then stores those commands in the Distribution database MSrepl_transactions and MSrepl_commands as a cache for the Distribution Agent to read and apply changes to 1 or more Subscribers. This dual component of Log Reader comes into play as we discuss tuning knobs.
PacketSize – For this on today’s networks I’ve rarely seen a different in performance, I’d leave it alone. If you wanted to change, contact your networking team, adjust to match see if measurable difference.
PollingInterval – Only comes into play when LogReader goes idle after delivering transactions. It is the “I’m done with backlog; now how long do I wait before starting again”. If your Agent is always working, it’s always working, never goes to sleep, PollingInterval never comes into play. For system with short burst of activity, perhaps dropping to 2 seconds, but were talking 1% of users will ever need to change or see performance difference.
LogScanThreshold – Changes the “ping” frequency from SQL engine log reader components back to LogRead.exe SQL Agent job. No changes should be needed (more below).
QueryTimeOut – How long LogRead.exe waits with 0 data movement and 0 ping back from SQL Server engine. Used to monitor timeout of new data from SQL Server publisher and timeouts while writing to the Distribution database. Under normal data flow change should not be needed. If you hit a timeout, investigate, don’t just increase and wait longer without first investigating the “why” LogRead.exe had to wait. Most likely a blocking and\or storage performance problems preventing data movement.
LogScanThreshold and QueryTimeOut working together- When LogRead.exe connects to SQL Server engine to perform actual reading of the transaction log, the Log Reader agent starts a “conversation” with the SQL Server engine. This conversation moves data from SQL Server to LogRead.exe then on to Distributor database. If the SQL engine components are reading the transaction log while looking for next records to transmit, they “keep alive” this conversation with the LogRead.exe based on the LogScanThreshold. If SQL engine log reader is reading a large transaction, it keeps alive the conversation pinging the LogRead.exe preventing a false QueryTimeOut abort. Normally this conversation is very chatty pushing new data from 500-50,000 commands per second, however, if serious problem prevents SQL Server from moving data or pinging back LogRead.exe the conversation stops and LogRead.exe aborts with QueryTimeOut.
MaxCmdsInTran – Yes, I‘ve blogged about this before and yes, I would recommend first changing your application to perform data changes in smaller batches, however, if you need Log Reader to breakup large transactions into smaller batches, adjust this setting high enough to prevent flood of tiny transactions. For example, if you Log Reader is moving data around 5,000 commands per second and your batches are touching 5,000,000 rows try setting at 5,000 (the commands per second rate). At this setting the Log Reader will commit a batch of 5,000 commands every 1 second for next 20 minutes. If you didn’t break up the transaction, Log Reader will report “no response in the last 10 minutes” messages during the 20-minute large transaction transfer leading you to incorrectly believe the Log Reader is “hung” and restart Log Reader.
If you’re running fast drives perhaps your Log Reader is running at 20,000 or even 50,000 commands per second, then bump up the value even higher archiving about 1 batch per second.
ReadBatchSize and ReadBatchThreshold – Finally we’re getting to knobs you can tune. These setting are caps SQL engine log reader components used to signal its time to send data to the SQL Agent LogRead.exe. ReadBatchSize is transactions cap and ReadBatchThreshold is the commands cap. Following our 80-20 rule, the ratio is 10×1, that is 10 commands for every 1 transaction. ReadBatchSize default is 500 sending data through system every 500 transactions. By default, the Log Reader doesn’t monitor the commands count. For systems processing +5,000 transaction per second, the default batch size 500 is committing batch every 1/10th second. To reduce overhead associated with “commit” try increasing range 2,000-5,000 then monitor Log Reader performance to achieve balance. If you have a lot of data changes to non-replicated tables, also try increasing batch size to skips non-replicated batches. If you have frequent large transactions the setting will have little or no overall impact.
ReadBatchSize 1 – I could not talk about Log Reader without talking about incorrectly setting ReadBatchSize 1 in attempts to “speed-up” the LogReader. This often proceeded by large transaction or other activity which bloated the transaction log and “no progress for 10 minute” messages. Setting to 1 does provide “feedback” every 1 transaction but setting ReadBatchSize 1 often hurts overall performance more than it helps.
Let’s say an application calls a stored procedure sp_Update2Rows which does Begin Transaction, updates 1 row, performs some other work, updates another row. Two users call sp_Update2Rows while large index rebuild is happening. The transaction log now contains 2 “begin”, about 100gb of index modifications, then 2 “commit”.
User 1: sp_Update2Rows (begin tran\update 1)
User 2: sp_Update2Rows (begin tran\update 1)
100gb of non-replicated log records (index maintenance)
User 1: sp_Update2Rows (update 2\commit)
User 2: sp_Update2Rows (update 2\commit)
Log Reader starts, find User 1 and User 2 calls to sp_Update2Rows Begin Tran statements, starts reading transaction log for the Commit Trans (technically not exactly what happens, but still shows problem). Log Reader engine components keep pinging LogRead.exe at LogScanThreshold with status update. After 10 minutes the LogRead.exe reports “no response”, thinking Log Reader is “hung” the user manually aborts Log Reader. In truth, the Log Reader was just going through 100gb of non-replicated data trying to find the complete transactions.
Believing something is broken with Replication, the user restarts LogReader with Batch 1 but this time waits out the Log Reader. Again, Log Reader see User 1 sp_Update2Rows begin trans, reads 100gb of non-replicated data, finds the commit, sends down 1 transactions. Replication monitor shows 1 transaction delivered. LogReader recycles back through the log for the next transactions, locates call for User 2 sp_Update2Rows begin trans, again re-reads that 100gb of non-replicated data finds the commit, sends “1 transaction delivered”. The LogReader had to twice skip over the 100gb of non-replicate data contributing to replication latency.
Little Tweaks – As you can see I’ve only documented a few tweaks to the ReadBatchSize for system with fast storage moving data at 5,000-50,000 cmds per second, for most Replication topologies, leave setting as default.