Stress-driven development


Grady Booch notes a great post by ckw on Stress-driven development.
I’ve long been an advocate of reversing the “premature optimization is evil” mantra when it comes to the design of system architecture.
 
In my recent past life in building enterprise systems, my top three pain points were:
 
  1. Getting enough time with the right people in customer organizations to get reasonable specs.
  2. Deploying complex systems reliably and repeatably.
  3. Being hit by non-functional requirements (especially perf goals) late in the lifecycle.
 
My old group got pretty darn good at (2) along with some of our partners. However (1) and (3) are still bugbears as far as I’m aware.
 
Now in many ways, (3) is just a special case of (1) and I’m here to tell you that I don’t have many good answers to (1) – I’m not sure that I believe the agilists do either to be honest.  However, I do think there is a better way to get to an agreement earlier with (3) simply because a supplier has the right domain knowledge to be able to provide one possible answer.
 
Typically the only performance goals you can get from your customer without spending a lot of time with a lot of hard-to-get-hold-of people will be of the form “The whole population of Ecuador must be able to transact concurrently with a response time of 1 microsecond” or “Erm, we’ll we’ve got 300 users on the current system and it seems quite slow”.  If you spend that time without putting anything more concrete and useful on the table then it might be too late to design your architecture to meet the goals.
 
I found the best starting point for driving a realistic discussion was to go in with realistic numbers for what is achievable at the expected cost.  What you need to do is to build out a representative rig of the system architecture, including the packaged products and middleware involved in the solution and drive it via a stress rig.  Simulate things like WANs with whatever tools you have available – make worst case guesses if you need to – better yet build out a link.  Do your best to get the actual hardware you’ll deploy onto.  Deploy just enough custom code to reach all the way through the architecture without implementing any real functionality.
 
Now what numbers do you have?  Can they get worse than this?  Sure – almost all the custom code you write will degrade your performance from these numbers.  But you have fairly fine-grained control over this code and you can do optimization late in the day.  Can you optimize the hardware, middleware, packages and network infrastructure – sure, but it is typically a much coarser grained process and much harder to do late in the cycle when hardware and licenses have been procured.
 
Talk to your customer about the numbers you’re getting with this proposed architecture.  Going in with a line like “With the current plan it can’t get better than X” certainly opens a few ears.  If they really need more, you’ve got time to work out how to miss out a network hop with a cache or double spec the number of disk spindles somewhere.
 
And of course, once you have this infrastructure, keep it running permanently to catch the points in development where you dip below your targets.
 
Comments (9)

  1. Dennis says:

    Some C++ guy once said yeah, but also keep in mind that "belated pessimization is the leaf of no good."

  2. Jim Arnold says:

    As an ‘agilist’, I can confirm that we don’t have many good answers for (3), other than to make our architecture flexible enough to support last-minute optimisations (I have witnessed post-hoc optimisation phases which were very successful because the system was so easy to modify).

    I absolutely agree that some ‘up-front’ performance analysis is important, as long as it is informed by the requirements. Similarly, I think customers should be encouraged to think about performance as just another requirement, and that it is highly irresponsible for developers to release poorly-performing software, wait for the customer to come back and complain, and then say ‘hey, performance wasn’t in the spec…’

    Jim

  3. Damon Carr says:

    Jim,

    I would actually disagree with you on your point that ‘agilests’ do not have a good answer for 3. I would say we have the best answer available in fact and I will explain why.

    “As an ‘agilist’, I can confirm that we don’t have many good answers for (3), other than to make our architecture flexible enough to support last-minute optimizations (I have witnessed post-hoc optimization phases which were very successful because the system was so easy to modify). “

    Being Agile means we know that optimizations should occur towards the end of iterations or we would be predicting the future and possibly optimizing the wrong thing which is waste, the very thing we are trying to reduce if not eliminate, Therefore we must stress test to find bottlenecks EVERY ITERATION. We know this because we build iterations with promised production quality. We have loads of Unit Tests, Continuous Integrating, etc. but those are functional. If we fail to stress test we cannot know if we met our promise of a production quality code base after any given iteration. Many agile teams fail to understand this and wait until the VERY end. By then it is probably too late without major disruptions to fix anything serious. I have missed scope in iterations due to focus on a serious scalability problem, but based on your implicit contract you must. There is no choice here. As an Agilest I am assuming you know this.

    In my Agile process which will finally make it into book form this year, let me walk through what happens (I am leaving out many major innovations that are not relevant to performance including the Probabilistic Risk Assessment and the Core Agility Ratio Calculation, both tracked metrics over each iteration and both can cause an emergency fix it session in the same way scalability can. The solutions are different however.

    Let me walk you through an Iteration so this makes more sense to everyone:

    1) We do two week fixed iterations – No flexibility, it is timeboxed.

    2) Around Tuesday of the second week we start our dedicated QA person on stress tests (who is also doing the System level regression tests NIGHTLY starting right after we have a build that does something and he runs those ever growing SYSTEM regression scripts every night, with him writing scripts one step behind the developers. When a pair claims completion on their stories/tasks, it is reviewed for functional correctness and the scripts are written (captured actually with some tweaks) but again these are purely functional and don’t help in performance assessment).

    3) Starting on the Tuesday in week 2 stress tests start and bottlenecks are discovered and discussed starting in a serious way on the Wednesday daily stand-up meeting. Serious problems are fixed and scope might get bumped to the next iteration because of it. This is the key to why Agile has a better story then others. I could have developers stress test before checking in code after all green on TDD but I need SYSTEM tests, not Unit level. They could wait for everyone to check in their code but then it is a one man job. It makes no sense for developers to do this on code that is not the final build for the day as they do a ‘get latest’ typically in the morning (although they are configured to get latest at the start of Visual Studio every time.

    We always architect inter-layer communication be stateless. This could be a load-balanced ‘web farm’ for horizontal scalability or the use of a load balancer for TCP or HTTP (depending on what is being used – if .NET to .NET then obviously TCP. If .NET to J2EE for example we may not have any choice).

    If it is WinForms or ASP.NET and the business tier is out of process on a physically separate set of servers (a real performance hit that requires fairly wide grained design of your API) it is still a web farm as we DEMAND that we architect our inter-tier communications to be stateless. That is part of the process! Just like Refactoring to Design Patterns. You might say ‘But that is architecture not process’ and I would say ‘TDD is architecture as it drives design, Refactoring and RTP is architecture that is just formalized as actions. Well you can formalize the actions of architecting for stateless communications (unless for whatever reason it is not possible). Do you follow?

    So I will come back to that as it is core to Agile success.

    Let’s say for performance we add servers horizontally and perhaps add vertical capability as well. As we are stateless ad using a load balancer on the appropriate transport we should achieve good ‘close to linear; scalability if we did our job right (again we must stress test on each iteration to know).

    4) Statistically, we know bottlenecks are usually out of process calls, and we don’t physically separate the business layer all that often (it only makes sense for security reasons as the performance hit is massive). So that leaves the nasty case of the RDBMS or External System Connectivity being the culprit and we cannot leverage a Message Queue (it must be synchronous).

    5) When the database is the bottleneck we have many decisions to make: Denormalize, Use/Add More prepared statements and precompilations (stored procs as well), more well placed Indexes (if it is not insert dominant activity as the only real downside to indexes are inserts or commonly modifying the Clustered Index – Primary Keys should NEVER change). But, unfortunately, #1 case I see is a bug (or design flaw) in .NET 1.1 SP1 and before related to DataSet Remoting.

    DataSets in .NET 1.1 SP1 and before always go to XML regardless of serialization method first, THEN they go to binary (if you are using binary). So the payload is HUGE on DataSets as you have XML converted to binary! This is a nasty one…Luckily fixed in 2.0.

    There is a surrogate that fixes this with reasonably good results here from Microsoft here:

    http://support.microsoft.com/default.aspx?scid=kb;en-us;829740

    There are other solutions as well, like applying compression before sending and decompressing. This was a MAJOR problem and it still astounds me that even many Microsoft Consulting Services ‘Architects’ recommend DataSets for Inter-Layer Serialization and are not aware of this issue. Bill Gates should send a critical memo to all MSFT staff globally on this one as I can count on two hands the companies that were nailed by this one (it makes boxing look like nothing in comparison).

    I can say that in all the consulting I have done related to performance problems (and all we do is Agile – specifically agilefactor, the name of our process, when we take on a project) the #1 reason is session state stored in SQL Server and large DataSets being serialized there (or general DataSet remoting of too many records or too wide a resultset). 5-10MB is not uncommon in my experienced. Why do people learn so late about this? Even really good teams? It’s so easy: They develop with InProc session state and move to SQL Server session state at the very end. All of a sudden PANIC! It must be a bug! Something must be wrong!

    Well that 5,000 row resultset was great In-Proc but Serialize that sucker into a SQL Server table and you’re looking at perhaps 30 seconds on a web page. I have seen people close to tears. Luckily I can spot it fast and there are good solutions.

    This is unique to the .NET environment and we know we can use this Microsoft surrogate, or we can write our own class(es) that are lightweight and do just what you need them to do (but if you need the advanced functionality of DataSets I recommend staying with them and using the surrogate and making the retrieval of data more intelligent, where say no more then 100 rows are ever retrieved, or shared read-only data is stored in the cache of each server). You can even compress the result post new surrogate I believe but you are trading CPU Cycles at that point.. If you have 2 CPUs then grab a thread from the pool and do it on a separate thread perhaps. A DataSet in 1.1 can be made to work. And they are do great in terms of functionality.

    If that doesn’t get you there you can beef up your RDBMS hardware to a certain point but you hit a wall that is quite messy if even a large cluster is not enough.

    Remember to PIN certain key tables in memory using the DBCC PINTABLE command. SQL Server (or Oracle) optimization is WAY beyond the scope here.

    At this point I have shown this Agile is the best way of ensuring this does not happen or at least you know about it the earliest. This is something everyone would face and Agile processes would know sooner due to the iteration based stress testing.

    This is exactly the same problem any other project would face, they would just learn at the end.

    In closing I believe firmly that to succeed in Agile you must combine Architecture with your process. ‘Legacy’ Agile already does with refactoring Fowler style and TDD (both core drivers of your technical design).

    In agilefactor (the latest Agile process), the architecture becomes a part of the process and we do not try to separate church and state (which is why we do a formal system wide ‘refactoring to patterns’ analysis based on a reversed engineered UML Class Diagram when the system gets too big for people to do it off the top of their head and why Design Patterns are a critical success factor (something I prove in the book) when you look across scope, time, quality and cost. YOU MUST aggressively refactor to patterns (and refactor in general) and get your ‘Core Agility ratio’ down or you will never achieve a flat cost curve. This is a technical design activity, just like our mandate (as a process mandate) that we be stateless.

    ***************************

    I absolutely agree that some ‘up-front’ performance analysis is important, as long as it is informed by the requirements.

    Huh? Every Iteration you mean? How can you upfront test without a system?

    Similarly, I think customers should be encouraged to think about performance as just another requirement, and that it is highly irresponsible for developers to release poorly-performing software, wait for the customer to come back and complain, and then say ‘hey, performance wasn’t in the spec…’

    I agree, but anyone who would do that would not be Agile by definition and based on their ignorance and based on the fact this is all out there now and had been for what, 7 years at least? I don’t feel all that sorry for them as they have chosen to not change and chosen to fail by their inaction (always easier). In fact they are why only 16% of all corporate software projects succeed and they can only deliver 42% of what they promised (see the Standish Group studies).

    My point? I have proven (not here.. In the book and work at Columbia) that the only rational (no pun intended) method for a corporate development project is Agile (and to be more specific – agilefactor as it is the only Agile process that can pass a rigorous audit by most internal IT compliance staff that I have ever seen). It is also the only Agile process that doesn’t try to separate out things that must be considered together, while providing metrics to measure your progress and known patterns to see if your metrics are following a ‘success’ pattern or a ‘failure’ pattern. It would take far more to explain in full and my 2 minutes are up (grin).

    Sorry for the long rant… But this is information people need I believe.

    Kind Regards,

    Damon Carr, Chief Technologist and CEO

    agilefactor