Performance and Scalability Checkpoint to Improve Your Software Engineering

When a patterns & practices deliverable would be ready to ship, our General Manager (GM) would ask me to sign off on the performance and security.  I would usually be pulled thin so I needed a way to scale.  To do so, I created a small checkpoint for performance and scalability.  The checkpoint was simply a set of questions that are a forcing function to make sure you've addressed a lot of the basics (and avoid a lot of "do-overs").  Here's what we used internally:

Checkpoint: Performance and Scalability

Customer Validation

  1. List 3-5 customers that say performance/scalability for the product is a "good deal" (e.g. they pay for play)

Product Alignment

  1. Do you have alignment w/the future directions of the product team?
  2. Who from the product team agrees?

Supportability

  1. Has Product Support Services (PSS) reviewed and signed off?
  2. Which newsgroups would a customer go to if performance and scalability problems occur?

Performance Planning

  1. Performance model created (performance modeling template)?
  2. Budget. These are performance and scalability constraints. What are the maximum acceptable values for the following?
    1. Response time?
    2. Maximum operating CPU usage?
    3. Maximum network bandwidth usage?
    4. Maximum memory usage?
  3. Performance Objectives
    1. Workload
    2. % of overhead over relevant baselines (e.g. Within 5% performance degradation from version 1 to version 2)
    3. Response Time
    4. Throughput
    5. Resource Utilization (CPU, Network, Disk, Memory)

Hardware/Software Requirements

  1. List requirements for customer installation.
    1. Hardware?
    2. Minimum hardware requirements?
    3. Ideal hardware requirements?
    4. Minimum software required:

Performance Testing

  1. Lab Environment. What is the deployment scenario configuration you used for testing
    1. Hardware?
    2. CPU?
    3. # of CPUs?
    4. RAM?
    5. Network Speed?
  2. Peak conditions. What does peak condition look like?
    1. How many users?
    2. Response time?
    3. Resource Utilization?
    4. Memory?
    5. CPU?
    6. Network I/O?
  3. Capacity. How many users until your response time or resource utilization budget is exceeded?
    1. What is the glass ceiling? (e.g. the breaking point)
    2. Number of users?
    3. Response time?
    4. Resource Utilization?
    5. CPU?
    6. Network?
    7. Memory?
  4. Failure. What does failure look like in terms of performance and scalability?
    1. Does the application fail gracefully?
    2. What fails and how do you know?
    3. Response time exceeds a threshold?
    4. Resource utilization exceeds thresholds? (CPU too much?)
    5. What diagnostic/monitoring clues do you see:
    6. Exceptions?
    7. Event Log entries?
    8. Performance counters to watch?
  5. Stress Scenarios.   What does stress look like and how do you respond?
    1. Contention?
    2. Memory Leaks?
    3. Deadlocks?
    4. What are the first bad signs under stress?

Instrumentation

  1. What is the technology/approach used for instrumenting the codebase?
    1. Are the key performance scenarios instrumented?
    2. Is the instrumentation configurable ( On/off? Levels of granularity?)?

The checkpoint helped the engineering team shape their approach and it simplfied my job when I had to review.  You can imagine how some of these questions can shape your strategies.  This is by no means exhuastive, but it was effective enough to tease out big project risks.  For example, do you know when you're software's going to hit capacity?  Did you completely ignore the customer's practical environment and use up all their resources?  Do you have a way to intstrument for when things go bad and is this configurable?  When your software is in trouble, what sort of support did you enable for troubleshooting and diagnostics?

While I think the original checkpoint was helpful, I think a pluggable set of checkpoints based on application types would be even more helpful and more precise.  For example, if I'm building a Web application, what are the specific metrics or key instrumentation features I should have?  If I'm building a smart client, what sort of instrumentation and metrics should I bake in? … etc.  If and when I get to a point where I can do more checkpoints, I'll use a strategy of modular, type-specific, scenario-based checkpoints to supplement the baseline checkpoint above.