As a server administration (SA), your responsibilities may include supporting web applications in the production and pre-production environments and this may include deployments and performance monitoring of the application. There will be times when the application's performance begins to lag or the application is experience errors and the more you know about what or how the application is performing *before* a situation, the easier it will be for you to begin the troubleshooting process.
Based on my experience working with SAs, the troubleshooting experience can be long and painful due to the lack of information on how and what the application is doing and what technologies are being used to access databases or third party resources. Here are some questions to ask the developers, tools to use to inspect the application, and information to catalog to assist with the maintenance of the web application.
Questions for the Development Team Before Transition to Production:
There are times where an SA will inherit a server with legacy applications and there is limited to no documentation available to assist with the understanding of the application. In these situations, the tools mentioned below will assist in inspecting the application and providing the details needed to troubleshoot.
In the situation where the developers are available to provide information, ask the following questions and catalog the details in Excel, SharePoint, or OneNote and share amongst the other SAs on the team:
- What version of .NET is used in the application? If not .NET, what is the managed engine or runtime used (PHP, CGI, ColdFusion, etc..)
- What dependencies are used by the application? List all third party dependencies and the versions of the assemblies.
- Are web service calls made external to this server the application is hosted?
- What are the IP addresses of the web services.
- Are the web services maintained in-house or are these third party?
- What database engine is used by the application? SQL Server, Oracle, DB2, etc.
- Are there imbedded SQL statements in the application (potential performance and security issues!)
- What is the IP address of the database server?
- What is the database name on the server? Who is the DBA responsible for the database and server?
- Is a reporting service used by the application? Can this impact performance of the application with long running reports?
- Is out of process session state used for the .NET application? If so, where is the SQL Server or ASP.net session state service running?
- Is the web application load balanced? If so, and in-process sessions state is used, are sticky sessions used on the load balancer?
- Has the application been load tested?
- What are the average response times for the pages? What are the longer running pages and why?
- What are some issues encountered during the development and testing process that could impact the user experience or performance. There is always something to be of concern in every web application.
Obtain information about the release cycle for the web application in order to plan for deployments and ensure the SA team is prepared for the deployments for the application.
- What is the cadence of upcoming releases?
- Is there documentation with each release to explain the new features or bugs fixed as well as files being changed?
- Is there a back-out process in case there is a major issue in the latest release?
- Is there a test process in place for the application in the pre-production environment?
- Is there a process for emergency hotfix releases?
- What is the SLA for this application. Meaning, how critical is the application to the business if it goes offline or the performance is impacting the work of the employees?
Determine the Web Application's Heaviest Load Period:
To better identify when issues can arise due to load on the application, run the following log parser queries to determine the heaviest load periods on the application. It may be easier to determine the load pattern if this is a line-of-business (LOB) application versus and an application exposed to the internet. With several months of IIS logs and persistence, load patterns can be identified and this can assist with supporting the servers as additional servers are services can be added for the anticipated load on the servers.
Future posts will dive into using log parser to inspect the web application using the IIS logs. These two queries will provide both number of requests as well as average response time per hour for the pages. Run the queries on the IIS logs and output the details from the datagrid to Excel and determine the load pattern.
Load per hour on the server (includes all pages and HTTP Status Codes):
Logparser "SELECT quantize(time,3600), count(*) as Frequency from <path to logs>\*.log GROUP BY quantize(time,3600) order by quantize(time, 3600)" -i:IISW3C -o:datagrid
This query list the page requests per hour that are returning HTTP status code 200 and group by the page:
logparser "SELECT TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)), avg(time-taken), cs-uri-stem from <path to logs>\*.log WHERE sc-status=200 GROUP BY cs-uri-stem, TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) ORDER BY TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) DESC" -i:IISW3C -o:datagrid
This query shows the number of requests and average time for a single page or a grouping of pages. Just modify the where clause to zero in on a single page:
logparser "SELECT TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)), avg(time-taken), cs-uri-stem from <path to logs>\*.log WHERE sc-status=200 AND (cs-uri-stem like '/<pagename>.aspx/%') GROUP BY cs-uri-stem, TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) ORDER BY TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) DESC" -i:IISW3C -o:datagrid
Identify the Longest Running Pages:
A user can have a fantastic experience in the web application with fast response times as they are working through the application. However, there could be one or two pages with terrible response times and these pages can impact the web application overall. It is crucial to identify these pages and work with the development to try and improve the response times.
The first steps is to identify the pages using the following log parser queries:
This query identifies the slowest ASPX pages and can be modified to capture other page types:
logparser "SELECT TOP 20 cs-uri-stem, avg(time-taken) as AvgTime, count(cs-uri-stem) as RequestCount FROM <path to logs>\*.log WHERE (cs-uri-stem like '%.aspx') GROUP BY cs-uri-stem ORDER BY AvgTime DESC" -i:IISW3C -o:datagrid
The second steps is to output the log parser results to an Excel file and set a threshold for the response times in which to discuss with the developers. Meaning, if a dynamic page takes longer than 2 seconds for the average response time then the development team will look into the issue and try to reduce the response times. Or, the development team may state this is the best it will be for the page and this is documented in case a support case arises about the response time.
Inspect the Web Application:
Whether this is a new or legacy application, it is important to understand the calls the application is making to internal and external services. Using utilities such as process monitor and TCPview will allow the SA team to see these application level requests. Using these tools is also very helpful to identify the files and other resources being requested from the application if there is limited documentation about the application.
The first step is to review the web.config for the application and catalog the following:
- Web service(s)
- Database connection(s)
- Assembly references and version numbers
- .NET version and other managed engines used
- Custom security through Forms Authentication
- Certificates used
- File references from UNC or the local server
- Any other references or configurations that could cause a support issue
These are the standard utilities used to understand application level process calls:
TCPview (http://technet.microsoft.com/en-us/sysinternals/bb897437.aspx) will show database and web service calls.
Process Monitor (http://technet.microsoft.com/en-us/sysinternals/bb896645) captures file, registry, and thread activity.
Process Explorer (http://technet.microsoft.com/en-us/sysinternals/bb896653) identifies the PID for a worker process and DLL files loaded in the process.
Perfview (http://www.microsoft.com/en-us/download/details.aspx?id=28567) low impact .NET profiler
Baseline the Server using Perfmon:
Once the highest load is identified from the IIS logs, setup a Perfmon capture to understand how the server at the OS level is performing under load. A perfmon capture is easy to setup on the production web servers and the overhead on the servers is minimal. It is best to use a template of counters and schedule the perfmon to start just before business hours and stop afterwards. This can be done after the IIS log review to identify the busiest periods on the web servers, such as month end reporting or a deadline for entering data, and the schedule can be setup to capture this load.
It is best to perform the perfmon capture and obtain the IIS logs for the same time period to provide a holistic overview of the server and the web application performance.
Here is an example of the counters to capture:
- \.NET CLR Data(*)\*
- \.NET CLR Exceptions(*)\*
- \.NET CLR Interop(*)\*
- \.NET CLR Jit(*)\*
- \.NET CLR Loading(*)\*
- \.NET CLR LocksAndThreads(*)\*
- \.NET CLR Memory(*)\*
- \.NET CLR Networking(*)\*
- \.NET CLR Remoting(*)\*
- \.NET CLR Security(*)\*
- \.NET Data Provider for SqlServer(*)\*
- \ASP.NET Applications(*)\*
- \ASP.NET Apps v1.1.4322(*)\*
- \ASP.NET Apps v2.0.50727(*)\*
- \ASP.NET Apps v4.0.30319(*)\*
- \ASP.NET v2.0.50727(*)\*
- \Active Server Pages(*)\*
- \HTTP Service Request Queues(*)\*
- \HTTP Service Url Groups(*)\*
- \HTTP Service(*)\*
- \Internet Information Services Global(*)\*
- \Network Interface(*)\*
- \Paging File(*)\*
- \Web Service Cache(*)\*
- \Web Service(*)\*
After the capture is completed, use the PAL utility (http://pal.codeplex.com) to analyze the capture and produce the output in a report.
Categorize the Support Cases:
So often support cases handled by the SA team are through email, phone conversations, or a cubicle drive-by from the end users to discuss an issue about a web application. The information is essentially lost for the support case in terms of problem statement, troubleshooting steps, root cause, and time spent troubleshooting the issue. A ticketing system or problem management system will assist in the capturing of this information and will provide the most important detail, which is how much time is spent troubleshooting a web application. If an SA spends 75% dealing with issues from 2 out of the 10 applications on a server, it is time to have a meeting with management and the development teams to find a way to reduce the support cost of the application. The SA does not have the metrics to have that discussion until they are captured and analyzed.
Bottom line, get or create a problem management system and use it!