Anybody who says "I can write correct multi threaded code" probably should be saying "I don't test my multi-threaded code".
It is very difficult to write correct multi-threaded code.
One way to appreciate this is various "find-the-bug" pop quizzes that occasionally come up.
Another approach that I'll go after here is to look at the state space.
If single-threaded apps have O[n] execution paths where n = number of atomic operations, just two threads have O[n!] paths or worse. (It's not O[n*n], or even polynomial - see below). And n is usually a very large number.
Each of those execution paths is an angle to hit your program and find a bug. Maybe you can test against O[n], but you're delusional if you think you can protect against O[n!] or worse.
15 is a small number, and 15 factorial is 1,307,674,368,000 = ~1.3 trillion. Testing a million cases is hard enough. "Trillion" may sound like an unrealistic number, but listen to some of the ship-stopper stress bugs that show up in product's end games followed by the "what are the odds that could ever have actually happened" comments, and it's not so unrealistic.
Practically, that means multi-threaded apps have so many states that you can't possibly keep track of them all in your head. And even if you do, the next guy that goes and changes a line of code just ruined your model.
Counting the execution paths
Suppose you have a function that executes sequence of atomic operations: A0, A1, A2, A3.
I want to just compare single-threaded and multi-threaded cases, so for now, let's ignore loops, conditionals, etc. because that's common across both and so it just complicates the comparison. (This simplification actually downplays the complexity of adding additional threads). In this model, there's only 1 code-path through this sequence of N states: A0, A1, A2, A3.
Now suppose you have a second function B0, B1, B2, B3 executing on another thread.
The threads can interleave these operations in any order. The convenient case would be (A0,A1,A2, A3, ... B0,B1,B2,B3). But maybe it's something goofy like: (A0,A1,B0,A2,B1,B2,A3,B3). If the bug only occurs when the pattern (A1,B0,A2, B1) is executed, good luck finding it.
You can see how much more complex just simple execution gets with threads. Now imagine how much more complicated threads make loops, conditionals, etc.
How many states?
Locks don't make your code safe. They just reduce the number of states and hedge your bets. At one extreme, 1 giant lock around everything can reduce you to 1 state per thread. If you put 4 states under 1 big lock (eg, A0...A3), and you put that same lock around every other state (eg, B0...B3), then you've reduced those 4 states to 1 state, thus eliminating 3 states.
You likely don't control the number of states. An operation that appears atomic (like a memory read) may actually be multiple states. Maybe you're looking at the source-line and the compiler split it out into multiple states. Or maybe there are windows where you communicate with the file-system, or other processes - and so they can be effectively injecting states into your state space. Or maybe the hardware has multiple states (usually around memory models). That's why there's InterlockedIncrement(&i) instead of just doing i++.
Threading vs. Reentrancy:
Now threading models aren't the only way to rack up states. Single-threaded models (like UIs) that queue messages can create a ton of states too, and those tend to create nasty reentrancy bugs. And there will be times when another thread would actually produce fewer overall states then a single-threaded model. Just be wary.
What if the states independent from each other?
Granted many of the states may appear to be independent from each other. So one may claim that O[N!] is way to pessimistic. But unless you have a very thorough isolation model, it tough to prove the states can't impact each other.
For example, states A0 and B1 may not directly interact, but maybe A0 impacts A1 which impacts B0 which impacts B1. Maybe A0 impacts how much memory A1 allocates, and that impacts whether an allocation in B0 fails (especially if there's a process wide memory allocator), and then the bug is in B1 handling the out-of-memory failure case.
And even if most of your states actually are isolated, factorial growth is so fast that you only need a small number before you're doomed.
See also: Measuring how difficult a threading bug is.