Why Extended Procedures Have Inherent Performance Issues

To follow up on the recent discussions about regular expressions in T-SQL, I thought it would make sense today to talk about why xprocs have inherent performance issues.  While they provide great extensibility to SQL Server, they have intrinsic performance and scalability implications that you should be aware of if you use them in SQL Server applications.  Moreover, most of what limits them in terms of performance applies equally well to other types of extensibility such as COM objects.  I’ll explain why in a moment.

To understand why xprocs have inherent scalability issues, let’s walk through what happens when a T-SQL batch or Remote Procedure Call (RPC) makes an xproc call.  Xprocs are always called via T-SQL batches or RPCs.  Even if the only thing you call from your client application is the xproc itself, it is still submitted to SQL Server as either a T-SQL batch or RPC, depending on how you call it. 

Once the client request is received by SQL Server, it is processed by the Open Data Services (ODS) layer and passed on to the database engine for execution.  This results in a work request being processed by SQL Server’s User Mode Scheduler (UMS) component.  UMS, as I’ve covered in my books and previous blogs, is SQL Server’s work scheduling system.  All work performed by the database engine is scheduled via UMS.

Because UMS implements a cooperative multitasking system, scheduling code that is external to the database engine (e.g., xprocs and COM objects) to run within the SQL Server process presents special challenges.  Chief among these is that external code, particularly code written by end users, cannot be depended upon to yield often enough to keep from disrupting the other work being scheduled by UMS.  For a cooperative multitasking environment to operate properly, each participant must avoid monopolizing processing time and yield often enough to other participants such that the environment runs smoothly and each participant’s work is carried out in a timely fashion.  This is a major design tenet of the database engine itself, but is not something that is enforceable (or even technically possible) for external code running within the server process such as xprocs.

To be more specific, there are special yield function calls that code running inside the database engine calls on a regular basis to yield to other tasks needing to run.  These are not even available to xprocs.  Even if they were available, the database engine could not count on external code to call them and, thus, the mere execution of an xproc could destabilize the database engine from a work scheduling standpoint.  One user could execute an xproc and keep another user’s work from being carried out.

To address this, UMS runs xprocs and other external code in “preemptive mode.”  When the server is not in lightweight pooling mode, this means the UMS worker (the thread) on which the xproc is to run is temporarily disassociated with UMS and allowed to run like any other Windows thread.  UMS fires up another worker to handle the work of the scheduler object from which the xproc’s thread was taken.  The two threads – the original one that has now been set aside for the xproc to use and the new one activated by UMS – then compete for scheduling time from the operating system.  Windows ultimately decides who runs and when they run – the type of “one-size-fits-all” situation that UMS was created to avoid.

If the server is running in lightweight pooling mode, the situation is even worse.  Because the UMS worker is a fiber when lightweight pooling mode is enabled, UMS cannot simply turn over its host thread to Windows for scheduling.  This would have the undesirable effect of taking all of the fibers hosted by that thread as well.  Instead, the xproc is moved to a separate UMS scheduler object whose whole purpose is to run external code such as xprocs for other schedulers.  This scheduler object is thread- rather than fiber-based, so the xproc execution can then be turned over to Windows just as it would have been in thread mode.

So, right off the bat, you have content switching, kernel mode switches, and additional thread resources being used in order to run an xproc, regardless of what it actually does.  If you’re running in lightweight pooling mode, the inherent xproc execution overhead is even more significant.

If the xproc makes any ODS calls – which is very common – (e.g., to send a message to the client via srv_sendmsg() or to send a result row via srv_sendrow()) it must be rescheduled by UMS.  In other words, it must be brought back into the UMS fold and scheduled like any other work.  This requires more context switching and switches into kernel mode since work is being moved between threads.  Moreover, when execution returns from ODS to the xproc, the thread it was running on may have changed.  Particularly when running in lightweight pooling mode, there is no guarantee that the switch back to the xproc scheduler after an ODS call will result in the xproc running on the same thread it ran on previously.  Beyond mere performance issues, this can actually break the xproc if it was not coded with this in mind.  It’s not that uncommon to see xprocs that run fine in thread mode behave erratically when fiber mode is turned on – yet another reason not to use fiber mode except when recommended by Microsoft.

If an xproc interleaves ODS calls such as srv_sendmsg() or srv_sendrow() with its own code – something that’s also very common – this results in a ping-pong like effect for the work request – it is continually scheduled onto and off of a regular UMS scheduler object.  It repeatedly bounces between being scheduled cooperatively by UMS and preemptively by Windows until this interleaving of ODS calls and xproc code ends.  Obviously, this can be devastating to performance.

Note that all of this is true regardless of what the xproc actually does.  All I’m talking about there is the overhead associated with calling xprocs in the first place.

And this brings me to my second point.  COM objects executed within the server process have the same issue.  Why?  Because they’re initially created and invoked via xprocs.  The sp_OA xprocs are implemented in odsole70.dll and are executed just like any other xproc.  Every COM object you instantiate, every method you call, and every property you set results in UMS scheduling the work to run preemptively because it occurs via an xproc call.  This means that every COM object call you make will have at least as much overhead as an empty xproc.  If they do the same amount of work, a COM object call cannot help but have at least as much overhead as a comparable xproc because it is accessed via xprocs.  Add to this interaction with the object via IDispatch and the other facilities the server provides for managing and interacting with COM (e.g., tracking the objects you create and automatically destroying them when your batch ends), and you can easily see why COM object interaction has the same performance implications that xproc execution does.  The same is true for any external code that runs within the server.  Because the engine cannot count on the code yielding often enough to allow the system to run smoothly, it must run external code in preemptive mode.

The moral of the story?  Running external code within the server isn’t cheap.  It’s nearly always better to use built-in facilities such as T-SQL commands, functions, and expressions when they meet your needs.  Xprocs have innate performance issues and should be avoided when scalability is critical and you can meet your needs through built-in mechanisms.  Also, don’t assume that you can move the functionality in an xproc to a COM object to get around this.  COM object interaction from T-SQL will have at least as much performance overhead as an xproc that carries out the same work.