Take it easy on the automatic retries


When I saw a discussion of how to simulate retry via try/catch, using as inspiration a Ruby function that retried a network operation three times before finally giving up, I felt the need to caution against automatic retry.

Your natural inclination when faced with a failure that has a good chance of being caused by a transient condition is to retry it a few times. The second, or possibly third, try will finally work, and your function can continue. The user gets what they want without an annoying "Abort, Retry, Cancel"-type dialog, one less support call for you. What could possibly go wrong?

I've seen this go wrong many times. So much so that my personal recommendation is simply never to retry automatically. If something fails, then report the failure. If the user wants to retry, let them be the ones to make that decision.

Here's how it goes wrong. This is a real example, but the names have been removed because I'm not trying to ridicule anybody; I want you to learn. There was a networking feature that implemented some type of distributed networking capability. It is the nature of networks to be unreliable, so the implementors of the functionality decided to retry ten times before finally giving up. The operation they were performing was implemented by another group, and that other group also decided to retry five times before giving up. That second group called a networking function with a timeout of thirty seconds. Meanwhile, the application that used this networking capability attempted the operation fifteen times.

Let's do some math. At the bottom was a timeout of thirty seconds. Five retries comes out to two and a half minutes. Ten retries from the next layer brings this to twenty-five minutes. Fifteen retries from the application layer takes us to over six hours. An operation that would normally have completed (with a failure code) in thirty seconds became, through the multiplicative effect of multiple layers of retrying, a six-hour marathon. And then you get a very angry call from one of your customers demanding that you deliver them a fix yesterday because this problem is taking down their entire sales force.

(Explorer is hardly blameless in this respect. Though the article's attempt to patch shell32.dll is doomed to failure since shell32.dll is frequently updated by security patches.)

Comments (54)
  1. I have many nice anecdotes on retries…once I refactored a device driver where the original implementor had decided to retry all operations 3 times instead of checking any error codes.

    But my all time favourite (because it took me two weeks to find) was in a network driver where someone had deiced that the correct way to write to the cards registers would be through a:

    do

    {

    outp(reg, val);

    } while(inp(reg) != val);

    Never caring that one of the bits in one of those registers was write-only…

  2. Gene says:

    And the tough thing is this sort of stuff is hard to test. It works fine on a normally reliable network. Plus you have to have code visibility down to almost the driver level to fix it.

  3. AC says:

    Well, this really explains how things get so slow on the modern computer which should, theoretically, be able to process billions of instructions per second.

    My biggest problem are the delays to change the current directory to the network folder which is not available. Apparently, the process gets stuck, even if it tries a thing like that from the separate thread (I didn’t try that myself, but I use Total Commander and the author claims that the symptoms are as described).

  4. andy says:

    Another common mistake I see often is increasing a server timeout value when it starts getting overloaded. What happens then is that those long running operations take up all the threads on the server and none of the other operations get a chance to run, and the entire server goes down, instead of just a few slow operations being denied.

  5. dhchait says:

    This is a solved problem; it’s called "Exponential Backoff". Subsequent retries wait longer and longer period of time (exponential); introduce a random small jitter to prevent several clients from harmonizing, and collar it at some max timespan to prevent it from getting ridiculous.

  6. BlackTigerX says:

    I use the retry technique (3 times) on systems where we have about 100 computers running the same process, if I had the user do the "retry" every time the network failed, we couldn’t operate at all

  7. Steve says:

    Daniel: Your solution doesn’t take into account the fact that even though you collar the retries, if the three guys in the chain below you all do the same thing the problem is still there. It’s just not as bad. Your solution only works if you are in absolute control of the whole sequence of operations.

  8. Laurence Hartje says:

    Is this related to what happens when you click on a network drive in Explorer, when the destination machine is unavailable? I’ve seen this hang up the specific Explorer window for over a minute…

    And annoyingly enough Explorer is "smart" enough to not open a duplicate window — so when the "My Computer" window is sitting there and spinning, I can’t open another "My Computer" window to try to navigate to the mapped drive.

    The only work around I’ve figured out is to start / run / explorer – but it’s still annoying.

  9. K.T. says:
    • "Exponential Backoff"

      Daniel, doesn’t this make this particular problem exponentially worse?

      To me it sounds like "Exponential Backoff" is designed to stop DOS rather than solve timeout woes.

  10. Peter Ibbotson says:

    This really does depend, for example with SQL server clustering if you are trying to keep connections over the server swap transparently from your upper layers then you need to have retry code that pops up a box to tell the user to wait.

    Often a good idea is to auto-retry but give the user the chance to abort. Basically after two to three seconds users assume the PC has hung completely, giving them feedback that if they wait it may recover (and doing that recovery automagically).

    The real problem here is that users won’t wait for very long

  11. Gabe says:

    Raymond’s example provides a good argument for open source. Most people aren’t going to sit there and rewrite their software, but when your whole sales force is down it would be nice and easy to just change that 15 to 1 and get on with business.

    Raymond is also correct that most times you do not actually want to retry. The only times you would are in situations where you know that specific errors are transient. For example, pretty much every UNIX system call can exit with the error EINTR if a signal was caught during the call, so after every important system call you might want to check for EINTR and try again because the error condition is transient. EVERY OTHER ERROR you would not want to retry though.

    This is particularly maddening with something like Explorer which will sit there trying to delete something 5 times because whatever is causing it to fail (like Explorer has the directory open) isn’t likely to clear up spontaneously. It’s even worse when the function that’s being retried is one that has side-effects. I’ve seen accounts get locked out because WinXP tried to log somebody in with the same incorrect credentials 3 times in a row! As if the wrong password would somehow become correct?

    On the other hand, there is nothing worse than having a program so stupid that it gives up at the first sign of failure. Most FTP programs are like this. Network connections are unreliable and prone to failure, so any large transfer has a decent probability of not finishing. If there is a transient failure (and most FTP failures are transient) retry automatically and let the user know so they can cancel it.

  12. Stu says:

    If the user wants to retry, let them be the ones to make that decision.

    Then there’s the programs that follow the "fail once, never try again, even when the user wants to" method. I just now had a certian microsoft program tell me "The service is busy" after a connection failure, then when the "Retry" button was pressed, the same message reappeared instantly, without an attempt to connect. Closing an re-opening the program fixed the issue, although it can be more annoying when the program decides that Outlook is using it’s services and refuses to close…

  13. oldnewthing says:

    If only it were as easy as "changing that 15 to a 1". It’s really fifteen different "1"s scattered all over the place.

  14. Tim says:

    "the article’s attempt to patch shell32.dll is doomed to failure since shell32.dll is frequently updated by security patches."

    Wouldn’t it, then, be super-duper fantastic if one day shell32.dll got patched to not retry 5 times by, ooh, someone on the Shell team? :-)

    "Exponential Backoff"

    Isn’t that for negotiating who gets access to a resource – it’s nothing to do with network reliability (e.g. link might be down)? It’s used by stuff like CSMA/CD (by e.g. ethernet) etc.

    "And the tough thing is this sort of stuff is hard to test. It works fine on a normally reliable network."

    I’ve always wanted a much simpler way of writing a file system driver on Windows, so that I can write file systems/layers that fail on demand. e.g., this file is undeleteable, when I read the 2nd 8k of data from this file, I want it to fail, etc. Or, for example, many years ago, I wanted a CD filing system that would mount an ISO and then simulate the speed of a CD drive seeks/reads, etc. Unfortunately, this seems to require IFS and kernel coding, which is the programming equivalent of paying a friend to come and stamp on your reproductive organs every 15 minutes. I’ve often wondered if it would be possible to write a generic IFS stub that somehow talked to userland processes, so you could just provide a nice simple set of entry points that could allow you to develop a filing system using the usual tools without having to get into kernel debugging. But I’ve never had the time. Or the protective clothing :-). Plus, the last time I checked, IFS dev involved paying £1000 for a header file, or downloading some iffy GPL’d subset of it.

  15. pass says:

    When logging on the network with windows, it tries up to 3 times with "Password", "PASSWORD" and "password". Some servers has a maximum logon-try before the account is locked (3 failed tries=lock for 30 min), when retrying automatically 3 times, the user must type the correct password at the first try or wait for 30 min. This does also confuse the user what his password actually is.

  16. Alex says:

    Thousand retries and lots of waiting? Sounds like DCOM to me. I feel sorry for the Industrial Automation folks who bought hook, like, and sinker into that stuff and spun their whole "OPC" methodology around it on the promise of easy linkages between devices. Everyone has to run Windows, everyone has to wait a long time, and everyone has to guess what the one error code returned by DCOM actually means.

  17. oldnewthing says:

    Now consider the compatibility risks of removing the 5 retries from shell32.

  18. Michael says:

    Gabe,

    Changing that 15 to a 1 is hardly that easy. Open source doesn’t always mean that you have that kind of control or flexibility. As Raymond noted, the 15 may be in several places, and referenced in several ways.

    But more importantly, you’re assuming that the company has taken the original source code and is now maintaining an in-house branch. In that circumstance, I can see your point, although it’s still a little more complicated than that, since I’m assuming there’s some red tape if the company is large enough to have an in-house development team (unless, of course, they’re a dev shop). Such a problem would still have to be reported internally, diagnosed, and a new build compiled, tested, and deployed. The big difference is now you’re sucking up the costs, rather than a vendor.

    Very often, however, you’re using an open source product that is not maintained by your company. While you can modify the source code, any changes made in-house now have to be carefully tracked and integrated into future product versions, which won’t reflect your change. More likely, you’ll end up making the same call to whomever supports the product that you would have made to Microsoft et al.

    My point isn’t so much to refute open source, nor to prove you wrong. It’s more to point out that open source does not equate literally to flexibility nirvana. Open source software is still software, and the process of debugging, coding, testing, and deploying must still be managed.

    Despite the differing philosophies of open and close source software, there are some things you just can’t shortcut.

  19. Ifeanyi Echeruo says:

    Retry is an application level policy decision.

    Components and libraries should fail fast and report an error or provide an alternative with a timeout (not a retry count).

    Whether the application should make the user decide, retry automatically, retry while informing the user of current status or just format your drive for giving it so much hard work, is another discussion altogether.

  20. Koro says:

    Really insightful post, I did not think about that point, in fact, I may have the same bug in one of my programs.

    Do you know, however, if TCP does attempt reconnection on behalf of the application if the host can not be reached when trying to connect to it?

  21. Cooney says:

    andy:

    Another common mistake I see often is increasing a server timeout value when it starts getting overloaded. What happens then is that those long running operations take up all the threads on the server and none of the other operations get a chance to run, and the entire server goes down, instead of just a few slow operations being denied.

    This is a symptom of poor server design. The fix here is to use roughly 20 threads that service all incoming connections and perhaps add a sweeper that notices dead connections.

    Tim:

    "Exponential Backoff"

    Isn’t that for negotiating who gets access to a resource – it’s nothing to do with network reliability (e.g. link might be down)? It’s used by stuff like CSMA/CD (by e.g. ethernet) etc.

    It’s for avoiding a progressive degradation that would kill a network in a high-traffic situation, so yes, it is reliability related.

    Koro:

    Do you know, however, if TCP does attempt reconnection on behalf of the application if the host can not be reached when trying to connect to it?

    Sort of. TCP rides on top of UDP, and packets that are lost get retransmitted. Once the connection dies, it’s dead.

  22. Norman Diamond says:

    Explorer is hardly blameless in this respect.

    Thank you, it is a relief to see this kind of frankness.

    Now speaking of drivers, I’ve read reports of open source IDE drivers doing the same kind of multiplicative retries so a bad block on a CD-ROM causes hangs for ages before the user can abort the operation. For some reason I have the impression that certain closed source IDE drivers do the same. What we really need is flexibility. If the user wants a tool to do maximum tweaks and retries in hopes of recovering some fraction of the data, do it. If a user is just doing an ordinary operation then let the abort be quick. Maybe the user has or can make extra copies of the defective CD.

    Monday, November 07, 2005 12:24 PM by Daniel Chait

    > "Exponential Backoff".

    Yes (for issues involving contention rather than defects).

    > […] longer and longer period of time

    > […] random small jitter

    Yes.

    > collar it at some max timespan to prevent it

    > from getting ridiculous.

    That I’ve never understood. It seems to me that collaring is going to recreate the same problem that you started out pretending to solve. If you think that the length of time has grown ridiculous, but the amount of contention is so high that you still can’t get served, you don’t want to collar it, you want to abort.

  23. Cooney says:

    If you think that the length of time has grown ridiculous, but the amount of contention is so high that you still can’t get served, you don’t want to collar it, you want to abort.

    That’s what you do – collar should refer to the total time spent attempting transmission, not wait time.

  24. Bilal says:

    You actually see the .NET framework doing something like this. (I know, this is not a .NET blog, but this was something that came into mind when I read this blog)

    Try this

    Set up an isolated network (hook up the PC to a router so it has a IP, but no Internet connection)

    Create a .NET assembly that is strong named, digitally signed and installed in the GAC.

    Now create an exe that uses that assembly. You will see that it takes 10+ seconds for the exe to load the assembly.

    I came to find out that the CLR was doing a certification revocation list (CRL) check and the reason for the delay is because the WinTrust API (which is used by the CLR) keep on retrying to connect to the CRL to verify the certificate of the assembly (even though it was in a trusted location, the GAC). The only way to get around this is to disable the CRL check completely(using the registry or IE). The number of retries or the delay is not configurable. Which kinda sucks… So it looks like your app initialization takes more than 10 seconds.. You can imagine how exciting poeple find that ;)

    But I’m just venting..

    Hey, my first comment on this blog :)

  25. Anonymous Coward says:

    Now consider the compatibility risks of removing the 5 retries from shell32.

    Compatibility is important, yes, but how exactly does anyone write a program that relies on 5 attempts to delete a file in some critical manner? That’s a very special achievement, and I didn’t know there was a special olympics for the computing world.

    Even if there’s some legitimate reason why a multi-threaded app wants to delete a file but isn’t sure when another thread will release it, this retry scheme is still not guaranteed to work, because the relevant thread only needs to get blocked for a short while and it’ll miss all the retries. It might increase the number of transitory failures, but it’s not like 5 retries is guaranteed to always work while only having 4 retries could occasionally fail.

  26. Michael Pryhodko says:

    I feel sorry for the Industrial Automation folks who bought hook, like,

    > and sinker into that stuff and spun their whole "OPC" methodology around it

    > on the promise of easy linkages between devices.

    Right… Guys, who decided to use COM/DCOM in OPC, should have their hands broken, fixed and broken again. I was writing some OPC-compliant stuff — it really hurts. Especially considering how simple and efficient it could be done without COM.

  27. abhinaba says:

    I am honoured to be linked Raymond for good or bad :)

    I agree that just retrying does not work in all situation. In http://blogs.msdn.com/abhinaba/archive/2005/10/01/476026.aspx

    the example I used does not arbitrarily retry the operation 3 times. It uses an Exception class which explicitly uses a public member to signal whether the operation is retryable.

    In all situation asking the user for retrying does not work. Lets take the example of one of the source code repository converters we are working on. This takes a VSS repository and migrates all data in it to Team Foundation Server repository. This includes thousands of files and hundreds of versions of each of them and takes a day to migrate. In this super-high stress situation VSS server sometimes acts-up and either times-out or throws an error. So what do we do, expect the user to sit around for the whole day and see each failure and prompt him Retry (Y/N)? or fail the migration that was going on for the last 8 hours?

    My point is in some situaion like an interactive program (UI client) prompting the user for retry is the correct thing to do. In long running un-attended batch conversion job where we know for sure that transient failures occur and get resolved on retrying, using retry is the right approach.

  28. Jerry says:

    as alway – it depends.

    We write a lot of automation stuff and know that every customer is totally different, having networks where small locations are connected with some Kb and the bigger ones with Mbit lines, different network load and latency.

    If you know that sometimes you reach a timeout – and that the layers "down there" just don’t care – it’s a good thing to retry.

    I think its about intelligent programming – carefully deciding if it is necessary, and if it makes the users life better.

  29. mirobin says:

    I’ll admit it, I’m an evil auto-retrier. :)

    My implementation lets the user know that the first operation failed, and that it is trying the operation again. The user can cancel the operation if they get tired of waiting.

    I’d have to give a nod to Ifeanyi, as I think he’s pretty much nailed the proper way to handle such cases.

  30. oldnewthing says:

    Giving the user the ability to cancel an automatic retry is great, but what if you’re a layer with no UI? How can the networking layer display a cancel button?

  31. oldnewthing says:

    "how exactly does anyone write a program that relies on 5 attempts to delete a file in some critical manner?"

    Ah, that was my exercise for you. Here’s one example: Consider an automation script. It opens Excel, creates a spreadsheet, saves it, then opens Word, embeds the spreadsheet into a document, prints the result, then closes Word and Excel and then deletes the temporary spreadsheet. Repeat four times, once for each department.

    Remove the retries and now the script gets a "file in use" error because it tried to delete the temporary spreadsheet before Word finished closing the document. Script fails, monthly sales reports not generated for three of the departments, angry call to Microsoft coming soon for breaking a script that is essential to their business.

  32. I’d posted along similar lines just recently. My concern was that nobody ever puts down WHY it retries. A simple comment in code like, "we retry because of an issue with the xxx connection failing consistently on the first try but then works" at least tells me that the developer was aware of the condition.

  33. Tim Dawson says:

    The more you pander to the types of people that write code like that (bad code) the more these compatibility issues will grow – exponentially.

    People are sick of waiting for explorer to timeout, as documented by various posters in this thread. Would the benefits of removing these retries not outweigh the small number of scripts that are broken?

  34. mirobin says:

    If you’re a network layer, how retries are performed, if any at all, should be determined by the API specification.

    It is up to the API to determine how to handle the cases that no application would ever care about (if such a thing is possible ;)) and to surface cases that an application could possibly care about.

    At minimum I would expect one or more of the following:

    1) API documents how long an attempt takes to time out or otherwise fail

    2) API documents how many times and under what conditions it attempts to perform a retry if at all

    3) API provides a mechanism to cancel a call

    4) API provides a way to specify or imposes a time limit on the total length of the call

    5) API provides a way to specify number of attempts and/or intervals between attempts

    6) API just fails and returns an error code on timeout

    In the example you site in this post, the issue could probably have been avoided if each API documented how long the worst case timeout for each operation could be (assuming someone using the API reads the docs beforehand and is smart enough to realize that making 5 calls taking > n minutes is a dumb thing to do).

  35. Derek says:

    I agree that in nearly all cases, lower level layers should not retry. The exceptions are when retries are a requirement, such as the TCP others have pointed out. It should NEVER be a "convenience" retry for the application.

    As for the .NET CRL thing, that’s broken. It needs to let the user know what’s going on so they can cancel or at least know what’s going on. And don’t say it’s not a UI layer. I bet it notifies the user if it actually encounters a revoked CRL, so it should be able to notify the user in the event of a retry. The whole CRL thing is doubly broken if the final default is to assume that the certificate isn’t broken. At that point, it should require admin intervention to even allow the component to run. Otherwise it’s a security risk.

    Raymond, the real problem isn’t that Explorer retries calls. As someone pointed out, that decision is an application-level choice. Explorer is application-level, so it can make that choice. The problem is that you’re leaving the user in the dark. It shouldn’t take explorer five seconds to let me know it’s trying to do something. TELL ME if you’re retrying. Show a dialog with a cancel button. That’s all I want, and then the behavior becomes tolerable (though still pretty much useless). The same with accessing a network folder. Long operations should not hang the UI. That’s basic design everyone knows (or should know). Give me an intervention mechanism, and then I won’t have to sit at my computer cursing Microsoft.

    Sometimes it seems Microsoft uses the "compatability" excuse to avoid work. Seriously, this kind of compatability shouldn’t need to be carried across OS versions, possibly not even service packs. What corporation can afford to deploy a new OS (or service pack) but can’t afford to check their scripts on the new OS? Put a disclaimer on the service pack saying that things have changed, and scripts should be checked before deployment. That should be enough, and OS versions shouldn’t even require that disclaimer.

  36. oldnewthing says:

    "Would the benefits of removing these retries not outweigh the small number of scripts that are broken?"

    The retries were originally introduced, not at the request of scripts, but a the request of beta testers! "If I try to delete a file too quickly after closing the program that had the file open, the delete fails. Explorer should sleep a little bit and try again because the reason it can’t delete might be transient."

    One man’s feature is another man’s bug.

  37. Gabe says:

    I think the easiest solution to this problem is not to retry N times, but to retry for at most X seconds. That way you don’t have to worry about how many times the layer below you has retried and retry delays won’t be multiplicative.

    Also, if an API is going to retry something, it should be documented and/or configurable (how many times, for how long) so that the layer(s) above it can do something intelligent. Batch processes would want to retry indefinitely or after the queue is empty, while interactive processes would want to be able to wait a brief period and allow the user to cancel.

    In the case of Explorer, it should pop a message box that says "Cannot delete file XYZ123 because it is locked. Retrying in 5…4…3…2…1" with a Cancel button. Even better would be to have a button that would show me what process has the file locked. As an added bonus, Explorer would be able to tell if it’s the process that has the object locked and automatically close all handles to it.

  38. Bryan says:

    Explorer should autoretry, but the code that runs when scripts are using it should not. Explorer runs at the app level, unless it’s being used from scripts.

    It should have always failed the call that the script made, using an error that was clearly documented as "this may be a transient error", and the script should have decided whether to retry (and how many times).

    In other words: The script should not be calling into the exact same routine that the Explorer UI code calls into when you press the delete key. The actual "delete file" code should be in a separate function, which is called by both the scripting interface and Explorer’s UI.

    Of course you can’t change that now (not without breaking a bunch of scripts, at least). But I still think that it’s the Right Way to do it.

  39. Cooney says:

    Remove the retries and now the script gets a "file in use" error because it tried to delete the temporary spreadsheet before Word finished closing the document. Script fails, monthly sales reports not generated for three of the departments, angry call to Microsoft coming soon for breaking a script that is essential to their business.

    So Word returns before the operation is complete? What kind of scripting support is that?

  40. oldnewthing says:

    What’s this "returns" thing? You’d be surprised how many scripts consist of replaying mouse clicks and keystrokes. Simulate a lcick on Word’s "X" button, then simulate a click on the temporary file, then simulate a press of the DELETE key, then simulate a press of the "Y" key. That’s why I’m kind of baffled by the people who talk about removing the delay from the "scripting interface" to Explorer. These scripts aren’t using the scripting interface in the first place. They’re simulating an end user.

  41. Bryan says:

    … Oh.

    OK, I read "script" and thought "VBscript, i.e. COM/ActiveX". I figured there was an ActiveX interface to most of what Explorer allows (even though I’ve never used it, that obviously doesn’t mean it doesn’t exist), and these scripts that you were referring to used it.

    So never mind; many of those objections don’t hold if the script is acting like a user.

    (Of course, in that case, Explorer has a UI, too — that’s what the script is using. So Explorer could show a window saying "delete in progress, cancel?" or something.)

  42. Bryan says:

    > How can the networking layer display a cancel

    > button?

    The network layer shouldn’t be retrying *in the first place*! (Except for the TCP layer’s retransmissions — but that stuff’s specified by the appropriate RFCs; you can’t get rid of it and claim to have a TCP/IP stack. It should also be documented in the APIs.)

    As several people have said — retrying should be an application level decision, *NOT* a lower-layer decision. No lower layers should be auto-retrying *AT ALL* — not network, not DCOM, not the .net framework code, nothing.

    Put information in the API documentation to tell developers that operation X might fail due to transient conditions, and if they’re writing *application layer* code, they may want to retry the operation. It would probably be a good idea to make it explicit that code without a UI should pass the failure up to its caller instead of retrying on its own.

    Services may be a problem, unless their only goal in life is to provide some functionality to a higher layer of code. (Then they should not be retrying.) Although when are people waiting on a service? Usually, only on boot — and that might be solvable by making the "OK, this service is started" call happen earlier.

    And yes, I realize Windows maybe can’t fix this anymore, but it’s still the right fix. In the meantime, documenting which API calls auto-retry (and how many max retries happen) should forestall a lot of future issues with this. If I see in the documentation for some IE activex function that it will retry up to 10 times, I’ll definitely think twice about making that call more than once. If it fails back to me, that means it’s already tried 10 times and failed all 10, so there’s likely not much I can do about the problem by just autoretrying.

  43. Derek says:

    I think the easiest solution to this problem is

    >not to retry N times, but to retry for at most X

    >seconds. That way you don’t have to worry about

    >how many times the layer below you has retried

    >and retry delays won’t be multiplicative.

    The problem is that lower layers shouldn’t be retrying at all. If you decide to retry for, say, 5 seconds, and the function you call decides to keep going for 3 years, what can you do about it? If the layer underneath doesn’t have some sort of timeout functionality, you’re just going to have to wait. You could create a second thread to monitor and kill the first thread if it takes too long but that’s problematic because it’s 1) way to complex a fix for something that shouldn’t be happening at all, and 2) going to cause massive problems with resource leaks in the inknowingly-terminated thread.

  44. Universalis says:

    But the whole trouble here was that Windows Explorer *was* an application layer program and as soon as you added scripting to it, it ceased to be.

    This is an example of the philosophical incoherence of the "automate by scripting an application program" model. Unfortunately, apart from being philosophically incoherent, that model is damned useful!

  45. Bryan says:

    Universalis:

    > But the whole trouble here was that Windows

    > Explorer *was* an application layer program and

    > as soon as you added scripting to it, it ceased

    > to be.

    Sort of. It was an app-level program, yes. So the retries should have been done in Explorer’s code, yes.

    But when scripting got added, it should not have been added as a layer above this retry code; that’s the problem. The code that handles the retries should have been at the same layer as the scripting consumer.

    "Automate by scripting an application program" is broken, I’ll agree — instead of automating the app itself, I think it’s smarter to expose the code one layer below the autoretry code to scripts.

  46. Cooney says:

    > You’d be surprised how many scripts consist of replaying mouse clicks and keystrokes.

    *shudder* Maybe not.

  47. Anonymous Coward says:

    As I said – if 4 retries are not sufficient, then 5 retries are no guarantee of success. The current system can *at best* mask problems up until something makes excel take 5.01 seconds instead of 4.99 seconds to release a file.

    This means that scripts will fail erratically, not consistently, and if they’re really that important then they damn well ought to be written by someone who’s done more than read half way through "Learn Visual Basic For Applications in 21 Days". (Either that or the accountant writing scripts should quit whining when some coder who’s read half of "Accounting For Dummies" thinks they know more about business management than the guy who spent 5 years at Harvard. Apparantly there’s something in the water that makes people have no respect for any profession besides their own.)

    Yeah, the current system can mask problems and make amateurs think that their badly written code is actually reliable. I’m not sure that’s worth the irritaion of every user who gets an unresponsive user interface, but there’s programers all over the world who’ld disagree with me.

  48. mirobin says:

    Terminating a thread to abort an operation is never a safe thing to do, and is a good way to introduce a deadlock in your application. You have no idea what kind of resources/locks the thread you are terminating has a hold on (ex: the loader lock) when you yank the rug out, leading to unpredictable behavior later down the line.

    At best, you would write the thread in such a manner that it could be abandoned by your application.

    As far as the points the AC raises about retrying always being an unreliable solution, I would disagree to a point. If you expect that the application could take up to 5 seconds to shutdown, then yes retrying for 5 seconds is not reliable. If you expect that the *worst case* is that the application takes 5 seconds to shutdown then it is reliable.

  49. Tim says:

    "Sometimes it seems Microsoft uses the "compatability" excuse to avoid work."

    Somewhere, Raymond’s head is exploding :-D

  50. Dan says:

    Ok – a bit off topic, but tangentially related to the retry delay on delete. When you are viewing a sub directory in an explorer window and right-click on a parent directory (in the folder list) and choose delete, you experience the delay while explorer retries deleting something that it has locked itself.

    I can’t tell you how many times I do this as it is so natural. You’d think I’d learn…

  51. Neil says:

    Similarly I look at the files on a floppy before deciding I can format it (see later blog!) and of course the format fails because explorer has locked the drive…

    Surely that "Deleting…" progress dialog should display while Explorer is retrying? Or maybe it does now, and I should upgrade.

  52. Nekto says:

    "If I try to delete a file too quickly after closing the program that had the file open"

    ;)

    The "best" solution whould be for DeleteFile function is to trace who is using file and find out if it trying to close itself. If yes – show message "Sorry we are waiting for that slow thing.." in notfication area. And to delete it after.

    *joke*

    What do you think about "unix way"? To delete file from list but not from disk until any handler exist and delete it only when that is freed. So user see file deleted instantly.

  53. BryanK says:

    Nekto — I would love to see that done (for one, it would mean you could install most patches without a reboot: the only updates that would require a reboot would be the ones that affect services or drivers that are required at all times by the kernel).

    Unfortunately, I don’t know how many programs use that (IMO mis-) feature as a way to do a lock-file with automatic clean-up (where if a process holds the lock-file and crashes, it’ll get cleaned up the next time someone requests the lock). A change like that would break compatibility, which is extremely unfortunate.

Comments are closed.