This code would be a lot faster if it weren’t for the synchronization


This is a story from a friend of a friend, which makes it probably untrue, but I still like the story.

One of my colleagues jokingly suggested that we could speed up our code by adding these lines to our project

#define EnterCriticalSection(p) ((void)0)
#define LeaveCriticalSection(p) ((void)0)

I replied, "You think you're joking, but you're not."

According to legend, there was a project whose product was running too slow, so they spun off a subteam to see what architectural changes would help them improve their performance. The subteam returned some time later with a fork of the project that they had "tuned". And it was indeed the case that the performance-tuned version ran a lot faster.

Later, the development team discovered that part of the "tuning" involved simply deleting all the synchronization. They didn't replace it with lock-free algorithms or anything that clever. They just removed all the critical sections.

Comments (56)
  1. I was at a conference recently where a researcher got up and told us that his major performance improvement was due to removing synchronisation around a result variable being accessed by multiple threads… (to be fair, the writers did not have to read the variable or write to it in sequence, but it was still too big for an implicit atomic write)

  2. Jack B Nimble says:

    You can't argue with progress.

  3. Marc says:

    This is uncomfortably familiar.  We had a team "speed things up" by throwing a bunch of threads into the application.  Synchronizing everything, not so much…

  4. NotARealName says:

    Reminds me of a dev I know who shall remain nameless. He thought he didn't have to add any synchronization to an application with ~15 threads because the target hardware was single core.

  5. Gabe says:

    I thought the punchline was going to be that it was the proper thing to do. I've seen programs with completely unnecessary threads and synchronization. Sometimes it is faster to just remove the cruft, while still ending up with a correct program.

  6. Mc says:

    I thought you just go in and reduce the sleep value a little bit.  

  7. 12BitSlab says:

    Given that I am old enough to have voted for Lincoln — when he ran for the Senate — I have seen my share of code over the years.  It has been my experience that when code is not "fast enough", very rarely is the CPU used by synchronization the true root cause.  In most cases, performance problems are poorly thought out algorithms.  

    I am still amazed at the so-called skill level of some developers.  Please note that it is not worse today than it was in the olden times.  My guess is that 10% of the people who work in IT, truly understand what they are doing – today or in times past.  The rest are charlatans.

    As the old saying goes:

    Those who can, do.

    Those who can’t, teach.

    Those who can’t teach, teach PhyEd.

    Those who can’t teach PhysEd, work in IT for the government.

  8. nksingh says:

    A former wall-street software type told me that their high frequency trading algorithms work in a similar way. Lock-free, Unsynchronized, and running the world's financial market.

  9. tain says:

    First rule of optimization: it doesn't need to be correct as long as it's fast.

  10. Sigh, I always hate it when the thing I care most about (stability) takes a distant second place to cost and speed in the minds of our customers.  I mean they want stability of course, but if we're going to be slower or more costly than our competitors, that's a whole different kettle of fish.  To stay in business, the company goal became "somewhat less buggy than our competitors" :-).

    What does one say to the fact that de-emphasizing quality control saved the company?

  11. jmdesp says:

    I once had a code which synchronization routine was broken like that. On one platform, some code that was supposed to run only once at end of application was actually running everytime soon after the start, and it's biggest side-effect was actually just to disable the multi-thread synchronization call-backs.

    It took us more than one year before we saw a case where a client actually had a problem with the synchronization disabled.

  12. Riataman says:

    That's why I always sprinkle sleep() calls all around my version 1.0. Makes easy to make speed improvements in later versions.

  13. Frank says:

    While this story might be apocryphal, I did run into something similar.  Our project was handed off to a junior dev to profile and fix the biggest offenders he would find.  Much later he reported his results:  His profiling had determined that the synchronization methods were very slow and accounted for most of the execution time.  He was able to fix that by removing the synchronization methods, but sadly that appeared to uncover a host of other existing bugs he was working to fix.  He was flabbergasted when we suggested removing the synchronization methods might be the bugs.

  14. j b says:

    Another (true) story about synchronization: For years, the Norwegian state railroads ran a "distributed" database across five computers in different towns, keeping track of all their carriages. In those days, communication was slow and expensive, so they didn't… Each regional database was operated independently. When carriages moved between regions, each regional database made an educated (more or less…) guess about what happened to it on the other side of the region border. Every two weeks, logs were exchanged among the regions to see which assumptions had been made, and if the guess was incorrect, a database guy went into the database to patch up, by hand, the data structures to correspond to reality!

    Since the great majority of carriages were running according to fairly static schedules, the frequency of wrong assumptions was quite low; there wasn't much manual database patchup. But when called for, such procedures were an established part the database maintenance work. The database had a common data model, with a distribution scheme clearly defining which data was affected.

  15. JM says:

    I had trouble on an SQL Server once where contention during one particularly heavily executed transaction was a major performance killer.

    I fixed the issue by taking an exclusive table lock at the beginning of the transaction and holding it until commit. Although only a single application could now actually do anything, overall performance improved.

  16. Aaron Murray says:

    jb – that Railroad story still happens all of the time, only it has a fancier name now: Eventual Consistency

  17. j b says:

    While removing sychronization may lead to catastrophes, locking only what needs to be locked and only for as long as it needs to be locked requires skill and experience. Seemingly, some professors have scared the sh*t out of their students about the dangers of access conflicts – junior programmers more or less starting any operation by locking more or less all the data, and keeping it locked until they are done, "to be on the safe side". Which certainly can reduce system throughput to a crawl.

    My impression is that synchronization frequently is taught as an add-on, not as an integral part of neither system design nor algorithm design. Most students will establish all their fundamental concepts of programming before they are introduced to multithreaded or multiprocess designs. Some programming languages provide a far lower threshold for parallellism, with synchronization being an integral part of the language design (e.g. to start a thread, call something called a 'proc' rather than a 'procedure'), but few of these languages are in much use today. In most popular languages the syntactical overhead for process management and synchronization is so large that you, sort of, can understand why the lecturers don't want to overload the minds of freshman students with such complexities.

    In the long run, we probably would have gained a lot if parallellism and synchronization was introduced from day one as a fact of life, rather than being kept back as something "advanced" which the students don't have to worry about when learning basic concepts. Because they ARE basic concepts!

  18. JustSomeGuy says:

    Zeroth rule of optimisation: you can't get any less optimised than "wrong".

  19. Reminds me of something that happened to me very early in my career.  It was my first Windows, multi-user SQL application – way back in 1990/91.  My previous – brief – experience of DB work was with BTrieve, so both myself and my colleague at the time, knew nothing of "transactions" or implicit locking.  We had been told "With RDBMS's, that's taken care of for you."

    So when our code was deployed into a multi-user situation for the first time (pre-production beta, thankfully), we found that our app was riddled with locking issues.  After a crash course from an RDBMS consultant who told us how things really were, we spent a couple of weeks going through the app (already quite large at this point) adding the additional Commit calls we needed at appropriate points.

    A few weeks later, it happened that both of us were on leave at the same time, and our Project Manager said he would cover for us while we were away, dealing with any major bugs that came up etc.

    When we got back, our inbox started filling up with reports of the app locking and hanging – the exact same problems we fixed with the transaction handling.

    After looking at the code, we discovered that all our carefully inserted COMMIT calls had gone !!!

    In our absence, the Project Manager had been exploring our work and seeing some calls to COMMIT that he thought were unnecessary, had decided to go through and "tidy up" our code by removing them all !!

    It was this incident that also taught us the lesson that your backup strategy is only as good as your most recent restore test.  We had no VCS at the time (we had no real excuse for that, thinking in our youthful naivety that our daily backup's were a crude but sufficient alternative).

    I think it's fair to say we both learned a great deal on that particular job.  :)  

  20. Evan says:

    @j b: "In the long run, we probably would have gained a lot if parallellism and synchronization was introduced from day one as a fact of life, rather than being kept back as something "advanced" which the students don't have to worry about when learning basic concepts. Because they ARE basic concepts!"

    There's some push toward this — introducing concurrency early. Curricula are slow to change, and that's compounded by the problem that we don't really know how to teach programming well already, let alone parallel programming. And of course universities, even if they wanted to teach some language that has parallelism worked in from the ground up, face a lot of pressures from companies and stuff to teach more immediately practical skills, and so they stick with Java and stuff like that. Even the new-fangled movement for intro courses, Python, isn't very good to teach parallelism in.

    @JustSomeGuy: "Zeroth rule of optimisation: you can't get any less optimised than "wrong"."

    Reminds me of a joke I've heard for which I forget correct attribution: Every program can be optimized, and every program has bugs. Therefore, every program can be optimized down to a single procedure which does nothing.

  21. foo says:

    Depending on your choice of framework, you don't need to hand-roll a typedef. Just (mis)use a built-in framework class. msdn.microsoft.com/…/65s59wwc(v=vs.110).aspx

  22. James says:

    Reminds me of a project that I once worked which was having performance and deadlock problems in the database. The "fix" was to add WITH NOLOCK to all select statements… No consideration as to whether dirty reads are acceptable or not. Following this, there were countless non-reproducible bug reports of impossible data being generated.

  23. Henning Makholm says:

    Sounds like a perfect example of test-driven development. Remove the synchronizations one at a time and check each time that the code still passes the test suite. If somebody wants synchronization in the code, they'd better start by writing an automated test case that demonstrates the need for it!

  24. JM says:

    @Henning: if there's one thing that's very hard to catch with testing, it's synchronization bugs. What you're essentially arguing is that we care more about the performance of the code than about getting it correct, because this errs on the side of not having synchronization.

    At the very least you'd need something more advanced than your regular testing frameworks, because a undesirable concurrent execution may be possible to demonstrate in theory, but impossible in practice on the hardware you happen to be using. You'd need to instrument the code just to make it appear in the first place — yet "it works on this machine" is not a valid argument for leaving out synchronization.

  25. Smit-Tay says:

    What's not indicated in the story is whether the synchronization was needed in the first place.  Maybe that really was a valid solution, we, the readers of this story, have no way to know.

    One thing that is certain is that most developers use multi-threading far too eagerly.  In my experience, over 25 years worth, is that 99% of all applications written don't require any more than 2 threads.  One for the UI, and one to do work.  If your app doesn't need a UI, it probably doesn't need more than a single thread either.

  26. Neil says:

    Well of course the obvious example of this is the single-processor version of the kernel. I was particularly gratified when I added a second processor to a PC and Windows proudly proclaimed that it had found new hardware and upgraded me to the multiprocessor kernel.

  27. Danny Moules says:

    @Henning: Good luck making bulletproof deterministic tests for all possible race conditions. Very excellent as your particular hammer may be, this isn't the nail you're looking for.

    @Neil: Um… multitasking (and the problems it creates) occurs regardless of how many processors you have on Windows.

    @JustSomeGuy Stealing that quote!

  28. Anon says:

    Therac-25 used the "no locks" solution.  On a radiation therapy machine used by hospitals.  The results were what you'd expect – fatal.  See en.wikipedia.org/…/Therac-25

  29. foo says:

    @Anon. The therac 25 problem was due to bugs, not due to someone trying to optimise the program. Duh.

  30. foo says:

    [Neglecting the hardware reductions due to cost or whatever optimisation] Sorry – didn't consider that at first.

  31. ulric says:

    we did that in one of our apps, stub out all the lockincrement/decrement in a IUnknown implementation every object used. but we didn't really need it, not for every object anyway. it was just one of those thing where a dev wants to write the most future looking code evar.  however, the actual problem was IMHO that we shouldn't have been calling addref and release so much that is would come up in a v-tune as the top bottleneck in a perfermonance critical scenario.  the whole approach was probably wrong

  32. Gabe says:

    Danny Moules: When the kernel is running on just a single processor, nothing else can be running simultaneously, so it doesn't need locks. Of course it can still be interrupted by hardware, so it just has to disable interrupts when it enters critical sections.

    In other words, the uniprocessor kernel just defines EnterCriticalSection to be the CLI instruction and LeaveCriticalSection to be STI. This makes it faster and smaller.

  33. alegr1 says:

    @Gabe:

    This is incorrect.

    The uniprocessor kernel (which doesn't exist anymore, by the way), has the same implementation of Enter/LeaveCriticalSection. Its implementation of spinlocks, however, only does set IRQL, and doesn't bother with setting/clearing the spinlock variable. In the big picture, this "optimization" is not worth it, because an interlocked operation on single-proc Pentium and later is as fast as non-locked, and only has a trivial cost.

  34. Joshua says:

    > Thread A can be interrupted by timer expirations after its read but before its write.  In the meantime, while Thread A is suspended, Thread B can read the (old) value (that has not been incremented by Thread A), increment it, and write it back.  Then Thread A is dispatched again, when it writes its incremented value back from its own copy.  

    How did you split the assembly instruction inc [address], which is what my compiler generates for increment?

    [Your analysis may be true for a single instruction on a uniprocessor system, but in the general case, the operations are multi-instruction. -Raymond]
  35. j b says:

    The first 32-bit CPU I ever programmed (well.. I had been programming 36 bit ones earlier) had a user level instruction SOLO, disabling all interrupts for up wto 255 clock cycles, or until a TUTTI instruction was executed. If no TUTTI was execeuted, a non-maskable exception was raised in the running process on the 256th clock cycle. Each of these instructions took one clock cycle to execute.

    You can do quite a lot in 255 clock cycles (most non-FP instructions were single cycle on this CPU). If the entire update couldn't be completed in that time, you could certainly test and set a semaphore (even maintain a queue for it). Not involving any sort of privilege change and no MMS related changes, just two single-cycle instructions, made a very low-cost (in terms of CPU load) synchronization mechanism. It is a pity that instructions like SOLO / TUTTI are not available on every machine architecture today.

    (For the curious ones: I am talking about a machine called the ND-500, introduced to the market in 1980. Obviously, I/O was handled by DMA and interrupt handlers which could buffer input for 256 cycles or more; low latency interruptw were provided through programmable controllers, so no system component were dependent on the CPU giving them attention in less than 256 clock cycles.)

  36. <a href="http://bing.com">BingItOn</a> says:

    Little things we ignore in life turn out to be most important. Same way your blog post is talking about "Presence of Synchronization" being the culprit but in reality its "Bad Algorithm" that's causing the problem.

  37. dsn says:

    @joshua – on a multiprocessor, the inc instruction turns into the uops

    load [adddress]

    inc

    store [address]

    which can be interleaved.  To test, just run two threads which increment n times simultaneously.  when I did, I didn't get 2n in the counter at the end.

  38. steveg says:

    Reminds me of the time I told the client that their application was not, in fact, running on all 5 (very expensive) servers, it was running on just 1, and it could not work on more than 1 server due to complete absence of synchronisation between servers. It took a while before they believed me. Exhibit A: Task Manager on servers 2->5.

    "But it used to work on all of them." Not possible. "But…". Not possible. "But…". Not. Possible.

  39. Joshua says:

    @dsn: Oh right I forgot it needs to be lock inc [address] (which it is when I need to force it to be safe).

  40. David Walker says:

    @Gabe:  you are completely wrong.  Consider one thread reading a value from storage, incrementing it, and writing it back, while another thread reads the same value, increments it, and writes it back.  On a single-processor system.  Disabling all interrupts is not always possible or desirable; should interrupts from the keyboard, or I/O completion, or timer expirations, be halted while a thread is running?  What about runaway threads?

    Thread A can be interrupted by timer expirations after its read but before its write.  In the meantime, while Thread A is suspended, Thread B can read the (old) value (that has not been incremented by Thread A), increment it, and write it back.  Then Thread A is dispatched again, when it writes its incremented value back from its own copy.  

    Thread B's increment has been LOST; its incremented value was overwritten by Thread A.

    You need to realize that single-processor systems have multiple things going on at once, and you shouldn't (or can't) disable all interrupts in user code.  Critical sections are designed to prevent this.  Threads A and B *each* need to obtain a lock of some kind before reading the value, and release that lock AFTER each has written the value back.

  41. j b says:

    steveg,

    Do you know for sure that the application was not differently structured in an earlier version?

    After all, some (maybe most) applications have some parts that are completely independent of other parts, so they don't NEED any synchronzation. Old *nix programmers, from before *nix got decent threading mechanisms (well, do they have decent ones, even today?) were often good at splitting up their applications into *nix processes that didn't share working data structures. Synchronization had to use the file system, which required far more resources than simple critical regions or semaphores. You completed your part of the work before you wrote it to a file/pipe to another process, and very often, shared data had one producer, one consumer (pipe fashion).

    If the application had old *nix origins. but a more recent refactoring gathered five *nix-style processes into one process with five threads, then it could be that the old version could utilize five CPUs (assuming that there were e.g. pipe communication between them) while the new version can not.

    In the old days of the VAX 730 (known as "Turtle VAX") switching between VMS processes could take up to 100 ms (yes, one tenth of a second!). A friend of mine who worked on a DBMS on a Turtle VAX significantly speeded up the system performance by merging three VMS processes into two, to avoid those process switches. If your application were run <i>on a single server</i>, redesigning it from five processes to five threads might have speeded it ut significantly, but killing the opportunity to distribute it on five servers.

  42. Zan Lynx' says:

    j b,

    "Old *nix programmers, from before *nix got decent threading mechanisms (well, do they have decent ones, even today?)"

    Yes. Some bits are much faster than WNT. I am always annoyed by how slow an inter-process Mutex is. A shared pthread_mutex on Linux is about 5x faster and doesn't usually take a kernel syscall hit. Not to mention the WNT lack of condition variables (in XP which everyone must still support, sigh) or inter-process read-write locks.

    [Actually, inter-process read/write locks do exist. They're called file region locks. -Raymond]
  43. Cavaler says:

    As horrible as it may seem, this is indeed a valid solution in at least one case I encountered myself:

    Developing a single-threaded application with MS Visual Studio 2005 or later, which don't have single-threaded runtime at all now, only multi-threaded. And I did not need any std::_Lockit really.

  44. Medinoc says:

    @JM, @Danny Moules regarding Henning Makholm's comment:

    I'm not in Henning's head, but in his shoes I'd answer with the "That's the Joke" reaction image.

  45. alegr1 says:

    @ZanLynx:

    I am always annoyed by how slow an inter-process Mutex is. A shared pthread_mutex on Linux is about 5x faster and doesn't usually take a kernel syscall hit.

    How do you implement inter-process sync object without having to handle it in kernel mode? Shared memory?

  46. alegr1 says:

    @ j b:

    Unfortunately, SOLO/TUTTI would only work for single processor systems. You can't have protection by just using some glorified CLI/STI.

  47. j b says:

    Zan Lynx',

    There is a lot more to "decency" than mere execution speeed – such as syntactical cleanliness and simplicity, consistency with other architectural elements, protection against incorrect use (/non-use) etc. And there is more to threading than just semaphores.

    I got my parallellism education in a time when mutexes where not synchronizing mechanisms, but building blocks (at a level comparable to assembly coding) for _creating_ synchronizing mechanisms: Critical regions, monitors, rendevouzes, buffers. This was in the age of Brinch-Hansen and Concurrent Pascal, and the only *nix provision for synchronization was to use the (non)existence of a file as a "semaphore". A few years later, semaphores did creep into *nix, and those of us who had come to know all sorts of high-level mechanisms expected *nix development to progress to mechanisms we had considered basic since around 1980. It never came; *nix people simply replaced the binary file "semaphore" with an in-memory binary semaphore, and never established regions and monitors as the standard way of doing things. Sure, you can build them yourself from the primitives, but that is sort of like building your own OO mechanisms from traditional c structs…

    Furthermore: Being faster than WNT doesn't make you the king of the world, especially when we are talking about primitives at the very lowest level. (You would never argue that the speed of one specific machine instruction makes one machine faster, or "more decent", than another.) I am quite sure using the mechanisms I mentioned in an earlier posting (about 8 entries before this one), that a machine with the SOLO and TUTTI instructions could implement mutexes a magnitude faster than the Linux pthread_mutex. If your programming system had decent(!) sync mechanisms, such a the CHILL language, the compiler could generate the code for protecting its critical regions or monitors in inline, user space code. But as long as you rely on the application programmer to make the asocciation between a semaphore and what it protects, you cannot easily do much optimization, such as not setting the actual lock until immediately before variables are accessed and releasing them immediately after the last access. (Certainly, the use of, say SOLO/TUTTI instructions, is in principle independent of, say compiler supported regions/monitors, but combining the two could give you synchronization at a very low cost.)

  48. Mike Dimmick says:

    alegr1: Jeff Richter's 'Programming Server-Side Applications for Windows 2000' book has sample code implementing a shared-memory critical section object. He calls it an 'optex'. I'm using it to protect another shared-memory structure for old-style performance counters (also borrowed from that book).

  49. j b says:

    alegr1,

    Certainly true, but for thread synchronization it was very valuable. Even if they do not solve all sorts of problems, that shouldn't be a reason for reject in cases where they might have an essential effect.

    These instructions could still be valuable as building blocks for making a distributed sync mechanism. But once you go into distributed synchronization requiring message exchanges, you raise the minimal cost by a few orders of magnitude anyway.

    Could any mechanism based on atomic operations (i.e. various variants of turning off the interrupt system) by themselves serve in a distributed system? This also includes the case when the kernel (or whatever handles the sync mechanism) runs at a higher hardware priority level: While it prevents local processes from interfering with the semaphore update, it will not be communicated to other CPUs.

    As long as the processors have physically common RAM, the CPU and memory controller could offer some uncached and memory locked read&set or increment instruction, where one CPU may ask the memory controller to delay accesses from other CPUs until the composite access is completed. That wouldn't help you if the CPUs don't have common memory. SOLO might also set similar flags to the memory controller. (In fact, you could have up to four ND-500 CPUs working on the same RAM, but I never checked out the effect of the SOLO instruction – I never worked on a multi-CPU system in those days.)

  50. alegr1 says:

    @Mike, ZanLynx:

    With cooperating processes, one can come with very elaborate shared memory synchronization, including a shared memory equivalent of CRITICAL_SECTION.

    But a KMUTANT solves the generic problem of synchronization for different processes, even across the security boundary, which is not possible with shared memory. It also solves the priority inversion problem, which CRITICAL_SECTION and equivalents don't do.

  51. David Walker says:

    As an alternative example, if two threads each need to read a value from storage, increment it by four or eight, and write it back… most machine languages don't have a single instruction primitive for that.

  52. steveg says:

    @j b: yes, the application never worked properly — it had been in production for a year when the old vendor lost the contract to the company I worked for. I discovered it didn't/couldn't work in Test, so I went looking at Production to find out what was going on, checked the logs and confirmed servers 2-5 had been sitting idle for a year. FWIW it was Windows not Unix.

  53. David Totzke says:

    @Anon – thanks for the reminder of the Therac-25 story.  I remember reading the full account of this by Nancy Leveson some time ago.  It is a very well written article and a fascinating story.  I would recommend giving it a read.  The bullet-point findings don't do the story justice.  IIRC, what seem to be obvious bugs to us now (irrespective of the lack of hardware lockouts) were in fact quite subtle and even examination of the code seemed to confirm that what was happening was "impossible".  

    Of course, I myself have created many "impossible" bugs in my career; sometimes even before lunch on a good day.  Fortunately, the only biological damage was to my ego. :)

  54. alegr1 says:

    As an alternative example, if two threads each need to read a value from storage, increment it by four or eight, and write it back… most machine languages don't have a single instruction primitive for that.

    This is where InterlockedCompareExchange helps a lot.

  55. alegr1 says:

    increment it by four or eight, and write it back

    In IA32, XADD (InterlockedExchangeAdd) would do.

Comments are closed.

Skip to main content