ADO.NET new for 2.0, Clearing the pool on Server recycle


This blog talks about behavior in the SqlClient and the Oracle managed provider exclusively.

 

One of the biggest problems with pooling before ado.net 2.0 has to be the fact that we did not know how to handle the pool when the server went down. Let me describe just how ugly this behavior was using the following scenario:

 

PRE-ADO.NET 2.0

Open 3 connections, close 3 connections, we now have 3 connections in the pool. Now stop Server and start Server.

All 3 connections in the pool are no longer valid since they hold references to Server connections that no longer exist, the pool is corrupted.

So far this is by design pooling behavior, the ugly part comes next.

You open a connection, the pooler checks to see if we have connections waiting in the pool. We have 3 connections so it just hands you back one of these. No exception

You execute a command. Exception, when we tried to do a roundtrip to the Server it told us that it had no idea what the connection we were trying to use was.

The connection becomes closed and we throw it away.

You open a connection, the pooler checks to see if we have connections waiting in the pool. We have 2 connections so it just hands you back one of these. No exception.

You execute a command. Exception, when we tried to do a roundtrip to the Server it told us that it had no idea what the connection we were trying to use was.

The connection becomes closed and we throw it away.

You open a connection, the pooler checks to see if we have connections waiting in the pool. We have 1 connections so it just hands you back one of these. No exception

You execute a command. Exception, when we tried to do a roundtrip to the Server it told us that it had no idea what the connection we were trying to use was.

The connection becomes closed and we throw it away.

You open a connection, the pooler checks to see if we have connections waiting in the pool. We have no connections in the pool so we do a roundtrip to the server to get a new connection.

You execute a command. Everything works as expected.

 

 

As you can see we did a terrible job of handling something as simple as a server stopping and starting again. Developers that had applications that dealt with server reboots (like a failover cluster) definitely felt the pain. Some people tried to automate the process above to flush the pool and some people tried changing connection strings on exception (this creates a new pool), the result is quite ugly code and it is directly our fault.

 

In ADO.NET 2.0 we have improved this scenario (somewhat) by realizing that after certain types of exceptions our entire pool must be corrupted, so instead of throwing away an individual connection we will mark the entire pool as doomed and the next time that you attempt to open a connection you will always attempt to establish a fresh connection.

 

ADO.NET 2.0

Open 30 connections, close 30 connections, we now have 30 connections in the pool. Now stop Server and start Server.

All 30 connections in the pool are no longer valid since they hold references to Server connections that no longer exist, the pool is corrupted.

You open a connection, the pooler checks to see if we have connections waiting in the pool. We have 30 connections so it just hands you back one of these. No exception

You execute a command. Exception, when we tried to do a roundtrip to the Server it told us that it had no idea what the connection we were trying to use was.

The connection becomes closed and we realize that this exception means that the pool is corrupted, so we mark all the connections in the pool as doomed.

You open a connection, the pooler checks to see if we have connections waiting in the pool. All of the connections in the pool are marked as doomed so we do a roundtrip to the server to get a new connection.

You execute a command. Everything works as expected.

 

We still throw an exception in ado.net 2.0, but when we realize that the exception means that the pool is corrupted we walk the pool and mark each connection as doomed. A doomed connection is a connection that cannot be taken out of the pool, if the connection is currently out of the pool and in use marking it as doomed only means that the next time the connection comes into the pool it will not come out again. This is to protect against scenarios where I open 2 connections, start and stop the server and then try to execute on one connection. This will throw an exception and mark the pool as doomed. If I go to the second connection now and call Close() I will not get an exception and the connection will go back in the pool, but since it was marked as doomed when the exception happened it will not come out again.

 

Q: What do you mean by “realize that an exception means that the pool is corrupted”? Are you going to be throwing away pools that are perfectly fine just because I get an exception?

A: We determine that a pool is corrupted _only_ when we get a fatal exception from the network layer on a previously opened connection. Exceptions from the server, exceptions that happen during log in and of course client side exceptions are ignored.

 

Q: So when you get an exception from server X you will clear all the pools that connect to server X right?

A: No, we will only clear  the pool where this exception occurred.

 

Q: It is really bad that you still throw the first exception, couldn’t you have caught the exception, cleared the pool, opened a new connection and continued?

A: Yes we could have, this was too hard to get right at this time but I certainly hope it makes it into a future version.

 

Q: What if I know that the server has stopped and started again, is there anything I can do?

A: Yes, you can use the new static SqlConnection and OracleConnection methods ClearPool(connection) and ClearAllPools(). These methods expose the pool clear functionality described here. They will walk the pool (or all the pools) and mark all of the connections as doomed.

 

Q: So I can use ClearPool and ClearAllPools() to synchronously close all of my connections?

A: No, actually not at all. The only thing that these methods do is walk the pool and mark the connections as doomed. The only thing that marking a connection as doomed does is to stop it from being taken out of the pool. The connections are still cleaned in the background using the same 4-8 minute algorithm that we have for current idle connections in the pool.

 

Rambling out, The information posted here is provided “AS IS” and confers no rights.

Comments (14)

  1. lexp says:

    A: Yes we could have, this was too hard to get right at this time but I certainly hope it makes it into a future version.

    Future version means .NET 2.0 beta2 or we have to wait until Longhorn? Is it really too hard to implement such a behaviour?

  2. Great post! More! More! :-)

  3. Angel says:

    lexp,

    future version as in not in ado.net 2.0. Yes implementing this behavior is hard to get right. We need to keep track of too many things, it must be the first time you are executing on a connection, you could have started a local transaction, the connection could be in a distributed transaction. Then we need to catch the exception and evaluate it before you look at it, hide it from you(this is bad!), close the connection clear the pool open a new connection, if the server is still not up and we can’t open a connection you will now get a connection open exception during an execute (auch!)

    This has to be done at every possible place where we do a roundtrip and it makes our code much harder to maintain (opening us up for more bugs). I pushed hard for this feature but it is VERY hard to get right and any bug in this code can be disastrous.

    We felt that combining one exception with the ability to have no exceptions by using ClearPool was a clear improvement over our previous behavior. Quite frankly I am more looking forward to a syncrhonous version of ClearPool than for exception free connection retry logic.

    John,

    Thanks for the encouragement, very much appreciated!

  4. pat.piccolo@gmail.com says:

    When we got stomped by this over a year and a half ago I just added a custom .Open to my wrapper around the Oracle DAL. In the .Open I take the hit of doing a sure-win query… like Select SysDate From Dual; If the open succeeded but the query failed with a ORA-03114 I’d just close the connection, parse the connection string, and add POOLING=FALSE so that I wouldn’t get a dead connection, and do the open again. Sure this makes for somewhat inefficient use of the pool, but desparate times call for desparate measures.

    So what your saying now is that with all the great improvements in ADO.Net 2.0 I still have to take the hit, trap the exception close the connection, and try again, but this time I won’t have to forcibly bypass the pool. Wow… now THAT’s progress 😉

  5. Angel says:

    pat,

    "somewhat ineficient use of the pool" indeed… I feel your pain, the ora 03114 is a special case and not at all what I am discussing here. The problem with this particular exception is that after this exception instead of "The connection becomes closed and we throw it away." we did "The connection becomes closed and we put it back in the pool" This is a HUGE bug (you can get a QFE fix for this by contacting PSS directly) since we would never be able to get rid of this bad connection.

    As far as the extra roundtrip, the ODP.NET provider (Oracles managed provider) actually provides a connection string keyword that will automate that process for you, this almost makes sense because opening a new Oracle connection is expensive compared to the roundtrip cost. For SqlClient the cost of the roundtrip is so expensive as compared to the cost of opening a new connection that you might as well turn pooling off.

    If you want to ensure that your application works across server reboots you have three options, your current failsafe but slow verifying that a connection is working/turning pooling off. Finding out when the server goes down and using the ClearAllPools() method or passing on the exception to the user and continue. For asp.net applications that connect to a failover cluster the last is not unacceptable, for your specific scenario it might be.

    Of course pooling is only a convenience, there is nothing stopping you from turning it off and managing the connections in your application as you wish.

  6. Wallym says:

    What exception are you looking for regarding the attempt to execute the command, seeing that the connection is invalid and those marking the connections in the pool as doomed.

    Wally

  7. Angel says:

    Wallym,

    I don’t really have an answer for this. In the case of SqlClient we look for a fatal netlib exception at the tds level, things like out of memory, connection broken, server not found, invalid connection, encription not supported, network not working.

    Examples of non fatal netlib exceptions would be timeout expired, connection busy, too many connections, network busy.

  8. Do you still throw SqlException for fatal exceptions? It would be very useful if fatal exceptions threw a different exception, so callers could easily treat fatal (i.e., no database) and nonfatal (e.g., ‘your free text search contained only noise words’) differently.

    Example usage:

    try {

    cmd.Execute();

    } catch (FatalSqlException e) {

    log.Fatal("Database is down",e);

    } catch (SqlException e) {

    log.Error(cmd.CommandText,e);

    }

  9. DevPrime says:

    About a year and a half ago, a large project I was working on started having intermitten problems with…

  10. gay rape says:

    We are wellocme to it’s configuration.