My old team coined a phrase that they use to describe their Performance engagements called “Test and Attack.” It is meant to conjure up a picture that can represent exactly what happens during the execution phase of a performance tuning and optimization engagement. So what does happen?
Test With You, Not For You
We refer to the typical engagements as “Toss it over the wall” gigs, where an app team hands the desired testing plan to a test team, who then builds a harness, executes the tests, compiles some results and then tosses the results back to the app team. The motto of many corporate test teams is “We’ll test for you.” Our motto is “We test WITH you, not for you.” In order to do this, we need to partner with the application team, the IT department, the testing team, and any other teams that may have a stake in the outcome. Once we complete a lengthy and thorough test plan, we can then prepare for and execute the testing.
For the execution phase, we have a testing room set aside that contains several workstations with access to the various servers and test rig machines. We also have a large overhead so we can share screens as needed. The room is setup so that various Microsoft Subject Matter Specialists are partnered with the corresponding customer representative: SQL-SQL, .NET/IIS – .NET/IIS, Tester-Tester, etc. We also have a designated Project Manager representative for MS and for the customer. These people will all interact as a group and also dive off into their respective groups as the testing progresses.
Each day starts out with a 5 minute stand-up meeting where we review the current list of tests and actions for the day. As soon as we are done with that, we kick into the work, which follows a pattern similar to:
- Run a Test of Record and save the results (no instrumentation or monitoring beyond what is normally allowed in production).
- Take 2-3 minutes to discuss the results. Reset the testing environment as needed.
- SMEs turn on any extra monitoring or logging to look for issues.
- Run a Break/Fix test. SMEs collect logs, analyze data, find a pain point, implement a change, make a note of the change, and reset the environment.
- Run a Test of Record. Compare the results to the previous Test of Record. Determine if the change is worth keeping.
- If YES: File a bug in the TFS DB so the app dev team can implement the change in the main code branch and perform appropriate regression testing and documentation update.
- If NO: File a bug in the TFS DB and mark it as “WON’T FIX” along with a note as to why. This will help prevent duplicated efforts down the road.
- Repeat steps 2-5 as needed.
The Pain Points
When I start testing engagements with customers, they often start telling me that they know where the major issue is with their system and start adding it to the test plan. I tell them that I do not want to add it to the plan yet, because when we start testing, the system will tell us where it is hurting. Besides, if you already know where it is hurting, why are you planning to test it before you fix the injury? Maybe it’s just me, but last time I tried to go out and play with my teenage boys while I was sick, my wife smacked me and said I was an idiot for trying to keep up with healthy kids when I knew I was not healthy. I knew I was being foolish, but I wanted to go play. My wife had to be the sensible one and bring me back to reality. Now I can return the favor to you. If you know where the system hurts, go fix it…. I’ll wait….
Now that all of the known issues are fixed, we can test. Trying to ask the system where it hurts is not difficult, but it does take a fair amount of knowledge about system monitoring, application and system behaviors and experience in several different areas. This is why we pull several people together as a team. It is also why we work to setup boundaries on the setups (see the part on monitoring and tuning SANs for an example).
When a test run is completed, we will pull up the results on the big screen and look through several of the key metrics. Based on what everyone sees, we will either be able to drill into a specific problem, or we will let each group go analyze their own data. SQL DBAs might read through DMVs for indexing issues or blocking or whatever. .NET people are looking for CPU, Memory, GC, Marshaling, etc. Someone will scour the event logs on the machines. We might have a network trace to analyze, or use LogParser to crank out statistics from web logs or application logs.
When you discover something that is an issue, or an area that you want to explore more, it is fine to make a note of it, but do NOT immediately act on it unless it falls within the scope originally laid out in the test plan. You will have time at the end of each day during the EOD Stand-up to discuss whether to add it to the mix. One of the more common reasons for engagement failures and/or slow downs is getting off track and growing scope, so make sure you do not fall victim to scope creep.
One of the most critical parts of really tuning an application is the ability to share information across many different levels and apply that knowledge to the issues at hand. If a SQL SME tells me that a particular table is suffering from a poor design and is really slow, I might ask the business partner if the functionality of the app that uses that table is critical to our efforts. if less than 1% of the traffic uses it and it does not block anything else, we may lower that priority and move on. But it takes the business partner collaborating with the SME to make a good decision.
Another critical part is determining which section to focus on when multiple areas get flagged. This is another good example of having multiple SMEs together in one place. If SQL is screaming because of a missing index that will potentially help performance 200% and IIS is screaming because it is getting 60+ unhandled exceptions/second on a small piece of code that would be quick to fix, which fix should I take? Both should be quick and easy fixes, but I should only make one change at a time since I want to KNOW what the impact to each change is. In this case, I usually let the SMEs discuss it for a couple of minutes, or I might try to work in the order where it makes the most sense:
- If the app is bound by CPU and the index is slowing down SQL, fix the app side first because the SQL change will most likely not impact the app at all.
- If the app is not showing excessive CPU, I might fix SQL first to see how the app’s CPU changes when SQL is faster.
Staying in Character
I have more than 20 years experience with Microsoft and over 30 years of computer usage, but despite this background, there’s no way I could tune an application by myself. I rely on my teammates to fill in gaps and to bring up other ideas. I also stay focused on what my job is. I spent 10 years working as an IIS Debug Engineer. I wrote the original IIS Critical Problem Management workshop, and delivered it more than 50 times. However, I have been working as a test consultant for 8 years now and my job is to drive the testing and manage the engagements, so I do not perform any debugging. I let the .NET SMEs handle that.
SAN Monitoring – An example of “Over-Engineering”
This section is included to show you how our team approaches parts of the application and infrastructure that may be out of our control. When we setup a lab environment for a customer, we try to get a SAN that is similar to the SAN they will be using, but we then “over-engineer” the SAN, meaning that we configure it to be as fast as possible, regardless of the final setup. We would use a RAID 0 layout instead of RAID 1 (as an example) because we do not need the fault tolerance, and 0 is faster than 1, e.g.
The reason we do this is because there is NO WAY we will be able to successfully mimic all of the traffic on a shared SAN, so we are already skewing results. However, if we make the SAN as fast as possible, then we can limit the impact that has as a choke point, thus allowing us to more easily find choke points in SQL, or the app that uses the SAN. We can fix those, run our tests and get valid results. When it comes time to roll out the system to production, we can use disk and drive metrics gathered from the testing that show how the system behaved. If it is slower in production, we can gather the same numbers from the prod system and see if the issue is indeed in the SAN or not.
This approach is the same we take for all external or shared systems. It is the only way I know of to get accurate repeatable testing accomplished and still consider the system end-to-end.
It takes a lot of effort and resources to perform one of these engagements, but the payoff is well worth it. I will write more posts drilling into some of the specifics as time permits.