A few weeks back me and Micke (one of our Architect Evangelists) had a session at TechDays where we talked about “things that looked good on paper” i.e. things that sound pretty ok in the design/development phase but sometimes turn out to be a disaster in production.
We are both pretty passionate about making the lives of the ops. people easier by thinking about the maintenance of the sites/apps at design time, rather than having it be an afterthought. I stole the title of this post from one of Mickes talks about bridging the gap between dev and ops.
The topics we brought up are based on issues that we commonly see in prod. environments and we started off each section with a quote and dissected the pros and cons and what we think people should think about…
Here is a summary:
1. With web services we can use the same interface from both our web apps and win forms apps
While this is perfectly true, there is a right and a wrong time and place for everything. When you make a web service call, remoting call or WCF call for that matter there is a lot of stuff that goes on behind the scenes, like getting a connection, serializing and de-serializing parameters and return values, spawning up new threads to make the new httpwebrequests etc.
I’ve talked a lot about issues with serialization and de-serialization, specifically when it comes to serializing large sets of data, complex objects or datasets for example. Serialization of these types of objects generate a lot of memory usage and is often quite expensive when it comes to CPU usage. Also, if you call web services within the same app pool you can run into issues like thread pool exhaustion.
The moral of the story? Use web services if you need to get data that you couldn’t get by loading up a component in the app. In other words, if you need to go to a DMZ or a different network to get it.
If you want to create a web service (hosted on the same server as your asp.net app) so that you can get the same functionality both from your asp.net app and your win forms apps, a good option is to write a component that does this, and then wrap it in web service calls for your win forms apps to use.
At the very least you should be really frugal with the amount of data you send back and forth. For example filter the data before bringing it back so you transfer as little data as possible.
2. Bob, just turn on tracing on the WCF end point
A lot of app configuration these days is done in XML. You often hear that XML is so great because it is human readable/writable, but is it really??? Even with XML configuration some things are extremely wordy and require a lot of xml code to configure. Imagine that you have an issue in production where you need Bob (or Jerry, or Ruth or <replace the name of your favorite ops guy/gal here>) in operations to turn on WCF tracing on all the servers in the web farm. He probably doesn’t have Visual Studio handy to swap this through the UI so he’ll probably use the incredibly useful configuration tool Notepad to write the 10+ lines of XML needed to enable the tracing. Rinse and repeat for all servers in the web farm.
Is that fair? What if there is a mistake in the XML?
To make it a bit easier on Bob you could provide him with two web.config files (with and without tracing) and that is at least an improvement, but then there is of course the issue of forking, if you have to change something in one config you need to change it in both etc.
A nicer way would be to create some powershell commandlets to enable tracing or whatever config items you want, like connection strings or whatever else you might have stored in your configs. The nice part about this is that it is scriptable so you could create one script and run it on all servers.
While you’re at it, why not implement a powershell provider for your app that allows the ops guys to configure parts of your application, or get values from your application from powershell. Powershell objects are .net objects so scripts etc. are written in .net.
Taking it one step further, you can even call powershell command lets from an MMC snapin in case you want to configure things from there.
3. Let’s put this data in session scope so we don’t have to go back and forth to the database all the time
Got a tweet from Fredrik earlier this week where he suggested a title for a pod-cast “Session state is the Achilles heel of ASP.NET”. I definitely agree… Again, everything has it’s pros and cons, and session state is nice for saving SMALL pieces of user specific data, but if you have a high-load web app, I would say that you seriously need to consider going stateless.
Over the years I have seen many many web apps with lots of data in session state. A favorite seems to be to put datasets in session scope to avoid hitting the database all the time. Especially if the query to get the data is pretty complex.
Now imagine that this site grows and needs to be replicated on different servers in a web farm so we can’t use in-proc session state anymore. In that case you would need to put session objects in an out of proc session store like state server or sql server. Just like with the web service calls you need to serialize and de-serialize data as you are now doing cross process calls, which again uses a lot of mem and a lot of CPU.
Even if you have one server and store it in-proc, there is a real chance that you will rack up a lot of memory if you have a lot of concurrent sessions.
For out of proc session state it gets even worse… without putting it in session scope you would go out and get the data when you need it. If you have out of proc session scope you will go out and get every single session var for the given user on every single request (with session state enabled), and then put it back in the session store on end request. That’s a lot of serialization/de-serialization.
Another thing about hoarding data, unrelated to session state that is a bit of a pet-peeve of mine is when apps bring in loads of data from the database and process it in-proc, based on the notion that they don’t want the DB to be a bottle neck.
Just some food for thought there… I think that very few applications are better/faster at handling/processing data than database engines, hence if the code in the app is not better at data processing than the DB, then aren’t chances pretty good that this will just create a bottleneck in the app instead?
Some related posts:
4. HttpUnhandledException, does that mean I should restart the server?
There are a lot of posts and discussions around which logging framework is best, and I think at least some of you agree with me that a lot of time is spent in the design phase to work out which one to use, but how much time do you spend thinking about what to log?
Often when i get cases and ask for event logs the event logs literally look like a nice and very ornate Christmas tree. Most of the entries contain stack traces, some contain exceptions that are handled mixed with some that are not. That’s ok, at least the part about stack traces, I love them, they make sense to me and I can use them to troubleshoot once I have waded through the unimportant events with log parser or some other tool. But… does this really make any sense to Bob in operations? Unless he has a dev background chances are that it makes no sense at all and unless it is something he has seen a million times before, he probably wont know how to act, or not act on the events.
In my humble opinion Bob should only get about max 5 events a day in his log, and that’s on a busy day. Every event should have a nice problem description and most importantly action like restart the server, run diagnostics on the DB etc. Sometimes you don’t know the action or even the problem description and then maybe the action could be “report this unknown failure to dev”. I bet that your ops guys/gals would be a lot happier…
I am not saying that you should stop logging the exceptions, but preferably not in the same log as the ops logs.
Oh, just one more thing about this… I’ve talked before about apps that throw a lot of exceptions and the perf impact this has even if the exceptions are handled. In fact, if you forget about the perf impact, there is another disadvantage to throwing a lot of exceptions and that is that the app is a lot less supportable… why? because if you need to debug the app and dump or log on a specific exception type, this can become very hard if you have a lot of benign exceptions as you will generate lots of dumps or logs which takes time, disk space, and more importantly it is very hard to find the needle in the haystack.
5. With ASP.NET we can update the sites even when they are live, ASP.NET will handle the rest
When you update the site with a number of new assemblies for example, old requests will finish running with the old assemblies and a new appdomain will be created when the next request comes in with the new assemblies loaded.
So far so good…
Now, picture that you have a lot of assemblies in your update and that a new request comes in when you’re halfway done copying the assemblies. In that case new requests will be serviced in a partially updated application, and you may even see locking issues if the load is really high.
If you have a web farm and don’t use sticky sessions a post can be done from one server (updated) to another (not updated) and if you have changed user controls etc. then the view state might become invalid.
So, for low volume sites, updates to a live environment is usually cool, but if you have a lot of load you need to consider taking the server out of rotation before updating, or update when load is not as high.
6. It works on my machine, let’s go live
Test and load test, with lots of scenarios and with appropriate load levels, Nuff said.
I know that you know this already, and yet we get so many cases where issues that come up in production and become crisis situations could have been avoided if the applications had been properly load tested.
A lot of issues, such as a specific method causing a leak, or a hang, can even be discovered with very simple load testing at the dev stage.
There are plenty of really good load testing tools and profiling tools out there like Load runner, Ants profiler, Visual Studio Team System Test etc. I’m sure you have your own favorites.
For poor mans stress testing, that can be done on the dev machine, you can use the free tool tinyget that comes with the IIS 6.0 Resource Kit
- Do we have a plan for crashes?
- We’ll document it in phase 2
Crashes, hangs and memory leaks is not usually something that people really plan for. I have seen extreme examples of planning for these types of issues at some of my customers like:
#1 One company that I work with has a full fledged plan for what will happen if a crash/hang/memory leak is discovered in production. The plan includes documentation for ops with step by step instructions on how to get dumps, how to upload them etc. They even do fire drills with ops to test that their plans work.
#2 Another company I work with has included code in their app to dump the process under certain conditions and the dumps are then automatically bucketized by issue type and scripts are autorun to debug the dumps and collect vital information about the issue. In other words, most of their analysis for these cases is automated to the tee.
Not everyone has to go to these extremes, especially if the app isn’t mission critical, but a good recommendation would be to have some documentation for ops on how to act in general cases so that you get the most data possible about the issue. Like how to gather performance counter data or dumps.
In my last post i described how you can set up rules with debug diag that ops can activate as needed.
8. What do you mean baseline? I think CPU usage is usually around 40%, maybe 50
When you troubleshoot a problem like a hang, memory leak or crash a key piece of information is often “what is different in the failure state compared to the normal state”.
Setting up a performance counter log that rolls over for example every 24 hours and alerts ops when certain values exceed some predefined number is very easy and has extremely low impact on the system.
Having this history when you troubleshoot something is as i mentioned very useful since you can see things like, right around the time it crashed memory went up to x MB, or we started seeing a large number of exceptions etc. Although it might not solve the issue it can often give you a good direction to move in.
This article about Performance monitoring and when to alert administrators is from 2003, but except for a few changes in a few of the counters most of it still holds true. The article describes what counters to look at, suggested trigger values and some typical causes for issues that cause the counter to hit the trigger. It’s simply a must read.
9. I don’t need to care about memory management, isn’t that what the GC is for?
True, true, the GC manages the .net memory in the sense that it will automatically collect any objects that are collectable and free them so you don’t have to call free or release as you would in native languages.
Memory problems are very seldom caused by the GC not collecting as it should. Instead they are often caused because the app is unintentionally still holding on to the objects, through references or through not disposing/closing/clearing disposable objects.
I have written tons of articles around memory management and memory issues in .net so rather than listing them all, just look at the Memory #tag to get more info about .net memory issues, how the GC works and how .net memory management works.
The moral of the story here is that even if you have a garbage collector, you still need to make sure that your memory is ready to collect.
I would love to hear your comments on these topics.
This is by no means supposed to be a complete list, so I would also love to hear about your own tips on how Devs can make life easier for the Ops guys and avoid production issues.
On the soapbox,