The dreaded error – Cache::Get: The request timed out

Sometimes you see this error in the Get calls and this can freak out people. There are a few reasons why this error may come. Internally we have a 15second timeout for calls (either Get or Put) .So if are not able to satisfy the request within that time, we timeout and throw this error to the user.

This can be fine tuned using the DataCacheFactory.Timeout  property to make it higher or lower. In typical scenarios, you should not his this error. However, there are cases where you will get this,

 * The specific machine that the client is routing the request to has just gone down. We try to establish a TCP connection twice before deciding to refresh our routing table. The TCP connection open timeout is 15seconds. (The minimum Send timeout is 10s). These limits are not exposed. Since we retry twice for a connection, it is possible that it takes as long as 40 seconds before we raise a complaint. So in that timeframe calls coming in would start timing out at 15 second intervals.

 The lease intervals are maintained between machiens and they are 3 minute long leases with a 1.5 minute update. So within 1 1/2 mins if the machine has not responded the neighbours will suspect something and try to establish a connection.  If the machine is running, but hte process is down, then the connection will be refused instantly. If the machine itself is down, then the TCP timeout applies and then an arbitration process is started to kick the machine out of the cluster. So all in all, it could take about 2 mins before the server side decides that a machine is down and reconfigure the cluster.

 * Machine is saturated – Here the connection would be slow and the Put/Get might start timing out. Nothing much can be done other than adding new machines or reducing the load. Post V1, we also have automated load balancing that will shift some of the load around in these scenarios.   But typically the distribution of load is good enough that if you have uniform sized objects and load, you wouldn’t get in to this scenario unless you have not sized your servers properly.

* HA is needed and machine is down :  This is the most common problem that we see – people install the Velocity on either a single machine with HA on or on two machines and one machine is down. In either case, if we dont have a place that we can write a backup copy to, we throw this error. In V1, we will fix it so that we raise a different error (Secondary not available) or some such thing, so that you can do somethign different other than retrying.

 One of the diagnosability work that is happening for the V1 release is to make this error only occur when it is truly a transient problem.



Comments (8)

  1. MichaelGG says:

    We should be able to set the timeouts as low as possible, and initiate a rapid failover. Please consider this.

  2. slayerrr says:

    Thanks for your useful content.  <a href="; target="_blank" title="reklamcı, tasarımcı, medya, reklam">reklamcı</a>

    This type of information, again I wish you success in your work hope to share.<a href="; target="_blank" title="reklamcı, tasarımcı, medya, reklam">tasarımcı</a>

  3. slayerrr says:

    Thank you for useful information. With love …

  4. slayerrr says:

    Thank you for useful information. With love …

  5. fotograf Bielsko says:

    I got this problem but this realy help.

  6. web.tasarım says:

    This is very good news was well informed that the followers of the issue I am. Thank you ..

  7. bryars says:

    I’m reading up on your Velocity project and I’m interested in how the HA works. If I have two machines and specify that I need 1 backup and one of my machines fails then you’re saying that I can’t access the cache.

    1) I would have thought that Cache::Get would work in this scenario, but the save api would fail.

    2) This behaviour is good if you want disaster recovery. My application will stop working until I commission another machine to replace the broken one and then we will continue where we were at. But, if I want a different type of HA, where when a machine fails the cache continues to work albiet with no diaster recovery (until I get another server online), is this possible to configure?