In this post, Principal Application Developer Manager, Reed Robison explains how a periodic review of IIS Log data, often ignored and archived, provides valuable insight into the operational health and performance of web applications.
We recently posted about how a focus on DevOps is changing the way IT Operations and Developers work together to deliver higher quality solutions. IT/Dev collaboration takes practice and finding ways to effectively work together might be a little different in every organization. Examples are always helpful, so let’s talk about web application performance.
It is pretty common for teams to do performance testing before releasing an application. While testing, developers intensely analyze and optimize code. After applications are released into production, they rarely receive the same level of operational scrutiny. The focus tends to shift toward basic monitoring of a few high level metrics like CPU, memory, or response time. While this may be sufficient for daily monitoring, it can mask details and trends that are captured in logs and creep in over the lifecycle of an application. If testing is not a part of every release cycle, it’s not uncommon to for new bottlenecks to manifest through code updates, changes in usage patterns, or a steady increase in load as business grows. These little problems may go unnoticed until they become big problems. A good way to regularly and comprehensively assess operational environments is through RAP as a Service, however, you have data available right now that provides operational insight into the health of your web apps. Almost every organization records and archives web logs. In this post, we’ll look at some tips to use IIS Logs to quickly produce operational insights about the performance of your application.
Using data you already have
In Premier Support, customers routinely engage us with performance and stability problems of web applications. While Testing Services engagements can comprehensively help customers plan for and avoid capacity issues, not every application gets this level of attention. Similarly, there are many powerful diagnostics and monitoring tools that can help isolate problems after integrated into a solution, but they can require significant code and architectural changes. Sometimes, we have to work with whatever data we can get out of a live production system, and very often that means IIS Logs. Most operational teams can provide a history of IIS logs for any web application. It turns out that IIS Logs are generally full of valuable trend data that developers can use to isolate performance problems in their code. They also provide a historical record of valuable data to analyze trends over a period of days or months.
In order to size up the health of a web application very quickly and with limited data, consider three things:
- Are the top requests completing quickly?
- How much load is the system handling on a given day?
- Are there any errors being thrown?
IIS Logs record data that answers all these questions. Let’s take a look at some of the data that IIS logs provide and examine how it is not only useful for isolating performance problems, be a great way to gain insight into the operational health of your web applications.
IIS Log Parser – Fast and simple
A simple tool to that can quickly analyze log data and answer these questions is IIS Log Parser. It has been around for years, you can download it for free, and it includes sample queries that quickly parse and present this data in a variety of formats. This tool is still really useful to quickly analyze activity for any app or service that records data into IIS log files. It’s worth noting that there are many community tools/samples that build on Log Parser. I won’t cover all those here, but I encourage you to explore all this tool can do.
In the simplest form, running a command line query against log files in the current directory is as follows:
Using Samples and Building Queries
Request per Hour quickly identifies how busy a system is and when peaks may be occurring. It’s good to keep this number in mind when you are monitoring production load against tested capacity limits and to anticipate future growth. Every application has capacity limits and it is important to understand when these thresholds are being hit. This also tells you how much room you have to grow before adding more hardware or further optimizing the system. (RequestPerHour.sql form the LogParser Samples\Queries directory)
Top 20 URIs is a health indicator of the top 20 most requested resources. Top requests are very important to overall performance and should be optimally tuned for the best response times possible. Anytime you see a page that is both a top requested page and a slow page, that’s a red flag. (Top20URIs.sql from the LogParser Samples\Queries directory)
Slow Pages are simply the top 25 slowest requests on average (in milliseconds). Every application will have different response time targets, but anytime the average response time exceeds the patience of a user waiting for data, that’s a problem. Consider how long you will wait on a page before hitting F5 (refresh) or clicking on the same option to try again. Pay special attention to both Average and Max Times. Average Times represent how a page normally behaves. Max Times represent the slowest recorded case.
LogParser does not include a sample of a slow page query, but you can find many examples in the LogParser community and query syntax in the Help file. I use something like this:
SELECT TOP 25 cs-uri-stem as URL,
MAX(time-taken) AS Max,
MIN(time-taken) AS Min,
Avg(time-taken) AS Average
GROUP BY URL
ORDER BY Average DESC
Finally, it is always good to run an Errors query to identify any errors that are being reporting in the logs. Go through all of these errors and make sure you understand why they are occurring. If you are seeing 500s or applications errors that are the result of code exceptions, you should fix those immediately. (Errors.sql and AspErrors.sql from the LogParser Samples\Queries directory)
Log Parser can slice through megabytes of log files and can even aggregate multiple log files in a single query of you want to combine results from several servers together.
Interpreting Log Parser Results
Requests per Hour provide insight into peak load times during a production day and exactly how many requests are being served. For example:
In the query results below, you can see that the site is serving under 5000 requests per hour until about the 13th hour, then peaks at about 19k requests in hour 15. 19k requests is about 6 req/s. It’s a good time to think about: What is the tested capacity of this server? Do I need more servers in this farm? Do I need to look at optimizing code? Should I be able to achieve more than 6 req/s?
Hour Total TotBytesSent
------------------- ----- ------------
2014-12-05 00:00:00 5166 91816872
2014-12-05 01:00:00 3271 11851947
2014-12-05 02:00:00 3078 5350705
2014-12-05 03:00:00 3004 15081820
2014-12-05 04:00:00 3273 19099935
2014-12-05 05:00:00 2880 3780950
2014-12-05 06:00:00 2924 5315683
2014-12-05 07:00:00 2879 3779657
2014-12-05 08:00:00 2583 3375392
2014-12-05 09:00:00 2880 3780215
2014-12-05 10:00:00 2936 5903824
2014-12-05 11:00:00 2951 7119071
2014-12-05 12:00:00 5301 176485655
2014-12-05 13:00:00 10797 671642461
2014-12-05 14:00:00 16422 976279264
2014-12-05 15:00:00 19052 935087820
2014-12-05 16:00:00 17926 981329410
Top 20 URIs represent the requests that are going to be more important in terms of potential impact to site performance. These are the requests served up more than any other and they should be fast and healthy. If you see slow requests it’s important to dig in to understand WHY.
cs-uri-stem Total MaxTime AvgTime AvgBytesSent
-------------------- ------ ------- ------- ------------
/ 141729 41965 6 1305
/Common/Search.aspx 8224 305233 7993 140597
/ScriptResource.axd 6729 3837 70 5398
/CustomerProfile.aspx 5505 224034 1043 444144
/SiteWrapper.aspx 5325 3822 111 1530
In the data above, the root page clearly requested more than any other page (141729 times). The average performance is very good (6ms) but there is was an instance where the page takes over 40 seconds. You could further explore the logs to understand when that request occurred and what else was going on with the system.
More important is the Search page which is one of the most frequently requested pages. It takes an average of about 8 seconds to complete and in the worst case scenario, took almost 5 min to finish. That represents a huge bottleneck and could have significant impact on the site performance. That page is worth spending some time to tune and explore which search queries might lead to these results.
Slow Pages are a great way to spotlight any requests that do not meet your minimal performance requests for a site. It is a good, common practice to establish a response goal for any web site (every page returns in less than 5 seconds). A slow pages query can help you understand if you are meeting that goal. In addition, slow requests can be expensive requests in terms of resources (CPU, memory, complex database queries, etc.). They can block important worker threads and may represent opportunities to increase a sites performance. You never want to see Top URIs also showing up in the Slow Pages results. Simply put, slow pages are bad. It only takes a few slow pages that are also top requests pages to have a dramatic and negative impact on a site.
Looking at the Slow Pages query results below, there are several pages that are taking over 10 seconds to complete on average.
URL Max Min Average
--------------------------------- ------ ---- -------
SupervisorAudit.aspx 349957 78 46376
Employee.aspx 302409 0 38396
FTEInformation.aspx 38267 156 19211
HRDiscounts.aspx 314297 31 13051
AllEmployees.aspx 40202 46 10336
NotificationHistory.aspx 9672 9672 9672
ChangesAndApprovals.aspx 54709 62 9463
ManageEmployee.aspx 184160 124 8609
While there is no hard rules about how long a request should be given to complete, the user experience is always paramount. Long running requests may be fine if done asynchronously and the use is properly informed about the progress and status. Requests that take 20-40 seconds to complete without informing the end user will lead to a bad experience and commonly leads to serious performance problems as the result of tying up worker threads. The real goal is to identify slow requests and ensure this behavior is what you expect. It’s common to see sites that start out very well and degrade as infrastructure changes or as load increases (growth). Periodic review of this information will help ensure you stay ahead of capacity problems.
Running the Errors query is a great way to see if there are significant errors occurring in your app as well as a good way to validate bad URI references and malicious activity/security scans occurring within your environments. While 404 won’t impact your performance, they can impact user experience if ignored. 500s can represent a range of exceptions. It is advised that you investigate, understand, and eliminate all of these. If you see malicious looking requests, check the IP source (also in the logs) to see if it is coming from a known source (internal security scans) or unexpected sources to make the appropriate inquiries.
Request Status Total
-------------------------------------------------- ------ -----
/touch-icon.png 404.0 43
Images/MenuButtons/tab_left_on.gif 404.0 40
/scripts/biztalkhttpreceive.dll 404.0 1
/scripts/..%2f../winnt/system32/cmd.exe?/c+dir+c:\ 500.0 1
Even after applications are tested and turned over to production, it’s important to pay attention operation health. Environments change. Use cases change. Companies grow and demand on operational services grow with them.
- If you write code for web apps, ask your IT/Operation teams for production IIS logs periodically and review for ongoing awareness of health and performance insights.
- Use LogParser to run RPH, Top URI, Slow Pages, and Errors queries and see how production metrics hold up against the results you observed during performance benchmarks and against operational targets. Consider Requests Per Hour results to gauge how much your load has increased since initial release and how much room you have to grow.
Premier Support for Developers provides strategic technology guidance, critical support coverage, and a range of essential services to help teams optimize development lifecycles and improve software quality. Contact your Application Development Manager (ADM) or email us to learn more about what we can do for you.