Ok, I admit it. I oversimplified, again. But this is the last time, I promise!
To obtain maximum parallelism between different stages in a pipeline, you should ideally buffer up an entire frame of data in between each stage. This is exactly what happens when sending data from CPU to GPU, but the buffers between the internal GPU pipeline stages are much smaller, holding at most a few hundred triangles or pixels.
This means the GPU bottleneck can change over the course of a frame. For instance you might find your terrain rendering limited by texture fetches, while animated characters are limited by the vertex shader, and bloom postprocessing is limited by pixel shading.
If you use the techniques from my previous post to examine such a game, you will find it speeds up when you shrink the textures (because this makes the terrain rendering faster), and also when you simplify the vertex shaders (because this makes the character rendering faster). Performance of the game as a whole is affected by more than one thing, but each individual piece still has just one bottleneck. We would see no performance gain if we optimized our terrain vertex shader, or compressed the character textures, for instance.
To diagnose such a case, we must split the game into pieces and examine each part individually. But this is easier said than done! Sherlock Holmes is of limited use here: instead we must call in Dirk Gently to understand the fundamental interconnectedness of all things.
For instance you might think we could examine terrain performance by temporarily commenting out the character and bloom rendering, letting us measure the terrain in isolation. Trouble is, if we make that change our game is likely to go from being GPU bound to CPU bound, at which point we can no longer experiment with GPU performance at all!
A better technique is not to remove anything, but instead add a loop that will repeat the same terrain rendering 100 times in a row. The framerate will plummet (if it doesn't, that proves terrain rendering must be an insignificant part of the overall performance profile) and this new lower framerate will be entirely dominated by terrain. By increasing the amount of terrain being drawn, we can dwarf the other scene elements into insignificance, letting us use the techniques from my previous post to measure the terrain while basically just ignoring everything else.
If you split a game into pieces and measure each one in isolation, then put everything back together and measure the final result, it is common to find the whole runs faster than the sum of its parts. This is because, even though the GPU pipeline buffers are limited in size, they do offer some parallelism from one task to the next. You might find that out of 10 characters, 9 are limited by vertex shading, but the vertices for the first are being processed in parallel with the last few hundred terrain pixels, so you are effectively getting that first character for free. It is also common that the amount of such parallelism will change in unpredictable ways if you swap the order in which things are drawn.
There are no hard and fast rules here. Measure everything you can. Try to break out different parts of your rendering and measure them in isolation. Don't expect everything to make perfect sense: all things are interconnected when they run in parallel, so a seemingly trivial change in one place can have unexpected performance implications somewhere entirely different. It often takes a leap of intuition to look at a collection of measurements and figure out which contain important clues, which are just side effects of the measurement process, and which were caused by the fundamental weirdness of parallel processing.