This week I hosted an interactive session at the Architect Insight Conference with my colleague Josh Twist… if you were there thanks for being so interactive! We really enjoyed it and found it useful so I hope you did too.
One concept that I mentioned during the session was an approach I use to help guide performance and scalability problem resolution, or to guide architecture and design of scale-critical systems. It’s by no means a formal process, just something I’ve found useful and has formed into some kind of shape in my head over time.
This concept is “Boundary Analysis”. Having vocalised this concept to other Architects I’ve been interested to hear how many people think along similar lines – or quote their own mantra’s back at me (such as “architects care about boundaries”). This seems to validate my thinking further. I’m not claiming invention rights here; just that this is a useful way of structuring your thoughts, so if you do something similar please shout up!
Simply put, the idea is that scalability and performance blockers can be unearthed by looking at the boundaries in both Logical and Physical architectural views. This is contrasted to Node Analysis, which focuses on monitoring individual components (nodes) within an architecture rather than the interaction between them. Both have their place, but my observation is that people have very few methods for looking at boundaries, and hence get stuck once they’ve determined that the database is the bottleneck based on Node Analysis (usually using performance counters).
If you consider many modern systems (for example, online retail sites) a huge amount of the logic is based on taking data from a shared source and displaying it in the web site, and then taking more data and pushing it down to the shared data source. This means that a lot of the work your system is doing is pretty much moving data around. Not all systems are like this, but if yours is you should be considering Boundary Analysis.
A “boundary” is basically the line between two nodes on an architecture diagram. It might be physical, logical, cross-machine, cross-network, cross-boundary, or just inter-component.
When looking at each boundary I consider 4 measures;
1. The frequency of calls across the boundary. Are there any ways you can reduce the frequency of calls? Does the unusually high frequency lead you to suspect a problem somewhere in the system?
2. Whether the communications are asynchronous. If they are, the chances are you’ve got a good design pattern in place so the duration of the call is not directly affecting the performance of the system. If they are not, consider ways you could adopt this pattern as it is usually significantly more scalable.
3. The distance the boundary represents. Is it across processes, or across machines separated by a WAN? If you’re trying to improve performance and scalability, try to reduce the distance for boundaries within the confines of your other requirements.
4. The size of the messages exchanged over the boundary. Many modern systems communicate with XML, but serializing data into XML adds significant size. If you are transmitting large amounts of static data, you’ll find that is often a problem too. Look at ways to shrink your messages
You might notice these spell FADS. That’s how I remember them J
This is the easy bit. Grab your Logical and Physical Architecture diagrams, and start looking at each boundary in turn. Prioritising them can help – in which case choose those with the highest “distance” quality first. Beyond that prioritisation can be difficult – because often the scalability blocker is some unexpected behaviour, so you might make incorrect assumptions during prioritisation.
You’ll find a lot of tooling helps with this kind of analysis; IIS logs, Fiddler, SQL Profiler, FireBug, and more. If you can’t measure what’s crossing a boundary, you should consider custom instrumentation.
Always measure rather than guess (unless you’re designing up front, in which case model rather than measure). Prove performance bottlenecks before remediation as you’ll often find they’re not where you expect.
This approach is by no means a complete solution to your problem… but it does lead you to thinking about the right things.
For example, if you find that a huge amount of static data is being transmitted from the database to a web tier via a services tier, and this message coincides with observed poor performance, you would next consider;
1. Reducing the frequency by adding caching at the front end.
2. Altering the front end to fetch the data asynchronously if the cache expires.
3. Keeping the cache as close to the front end as possible to minimise the distance.
4. Breaking the static data into chunks that are cached independently, to reduce the size of individual messages.
Makes sense huh? Like I said – it doesn’t give you the solution, but it leads you in the right direction.
So give it a go – and let me know how you get on by commenting below!