Sometimes you see this error in the Get calls and this can freak out people. There are a few reasons why this error may come. Internally we have a 15second timeout for calls (either Get or Put) .So if are not able to satisfy the request within that time, we timeout and throw this error to the user.
This can be fine tuned using the DataCacheFactory.Timeout property to make it higher or lower. In typical scenarios, you should not his this error. However, there are cases where you will get this,
* The specific machine that the client is routing the request to has just gone down. We try to establish a TCP connection twice before deciding to refresh our routing table. The TCP connection open timeout is 15seconds. (The minimum Send timeout is 10s). These limits are not exposed. Since we retry twice for a connection, it is possible that it takes as long as 40 seconds before we raise a complaint. So in that timeframe calls coming in would start timing out at 15 second intervals.
The lease intervals are maintained between machiens and they are 3 minute long leases with a 1.5 minute update. So within 1 1/2 mins if the machine has not responded the neighbours will suspect something and try to establish a connection. If the machine is running, but hte process is down, then the connection will be refused instantly. If the machine itself is down, then the TCP timeout applies and then an arbitration process is started to kick the machine out of the cluster. So all in all, it could take about 2 mins before the server side decides that a machine is down and reconfigure the cluster.
* Machine is saturated – Here the connection would be slow and the Put/Get might start timing out. Nothing much can be done other than adding new machines or reducing the load. Post V1, we also have automated load balancing that will shift some of the load around in these scenarios. But typically the distribution of load is good enough that if you have uniform sized objects and load, you wouldn’t get in to this scenario unless you have not sized your servers properly.
* HA is needed and machine is down : This is the most common problem that we see – people install the Velocity on either a single machine with HA on or on two machines and one machine is down. In either case, if we dont have a place that we can write a backup copy to, we throw this error. In V1, we will fix it so that we raise a different error (Secondary not available) or some such thing, so that you can do somethign different other than retrying.
One of the diagnosability work that is happening for the V1 release is to make this error only occur when it is truly a transient problem.