Diagnosing failures in DTM

So you've run all the tests in DTM and tried to create a submission package, but you find that you are failing tests. Don't panic: DTM was built with testing in mind and will give you all the information you need to fix the problem fast.

The first thing you need to know to diagnose a test failure is that each test consists of a job made up of tasks. These tasks may be copy tasks, execute tasks, run job tasks, or copy result tasks. Because one job can call another you get a tree structure of tasks for any particular job. This can seem intimidating at first, but don't lose hope: with a systematic approach to examining the failure we can immediately converge to the root cause.

When analyzing a failure you need to hone in on the task that contains the point of failure. You may see many failures in a task, but not all of them matter. I know this is confusing, but stay with me. There are two key concepts to understand: roll up results, and failure action. Whenever a task fails (be it determined from the log, or the exit code, or an unexpected reboot, etc.) DTM marks the task with a red X. Just because a task has this, it does not mean that it is what caused the job to fail. If the task marked with the X has roll up results set to false, it will not cause the job to fail. Another common attribute to be set on a task that does not rollup results is a failure action of IgnoreFailAndContinue. This tells DTM that the failure should not designate a halt in job execution. Lets use this knowledge to analyze a failure.

I have consolidated all the task information for this job into one view. You can get the same information by selecting the job in job monitor and viewing the Job and Task panes. To look at the Tasks of a job called by a Run Job task, right click the Run Job task and choose “Child Job Result”; I call this “pushing down”. To go back to the tasks of the parent job, right click the Job and choose “Parent Job Result”; I call this “popping up”.

If we look at the iSCSI Digest results, our first intuition might point to the “Create Skip Parameter” task as the cause of the job failure. This is incorrect. We can look at the Result Report for the “iSCSI Digest Setup” job to see that the “Create Skip Parameter” task has roll up results set to false. To see this push down so that “iSCSI Digest Setup” is in the job pane, right click the job, and choose “View Result Report”. The next possible failure is “Copy SDStress ntlog based log” under iSCSI Digest Disk Header. If we view the result report for this we’ll see again that this task does not roll up results. A good rule of thumb is investigate failures in Copy Results tasks and Cleanup phase tasks last: these are generally non-essential. The next failure we see is the “Execute SDStress” task. Execute tasks are normally the prime candidate for the cause of a test failure. This is where all the testing happens. Another key thing to notice is how all parents of this task are marked as failed. Finally we can look at the result report and see that indeed this task rolls up results and is shown as failing. The next thing to look at is the “Task Result Error Details” section of the report. I will discuss a few common errors and how to debug them:

· Task is Marked Failed as it had non-zero Fail Counts in the LogFile.

o This is the most useful error. It means that the task produced a log and it contains an error message. To open this, you can either click the link in the result report or right-click the task and choose “View Task Log”

· Task Cancelled Because of an Unexpected Reboot

o This usually designates a system crash or bugcheck. In order to debug this issue you can either connect the test client to a kernel debugger or enable crash dumps and analyze the resulting memory dump after the crash using the –z flag in your preferred debugger. Check the documentation for Debugging Tools for Windows for more hints.

· The Execute Task with Commandline ...Failed with ExitCode XXXXXXXX

o Check to see if the exit code matches a Win32 Error Code. If the error code looks like an HRESULT (starts with 0x8 for failures) try using this tool to decode it. Alternatively you could enter the code into the Error Lookup tool in Visual Studio.

o If this doesn’t get you anywhere check the tests parameters. You can do so by checking the Effective Parameters section of the Result Report of the top most (pop up all the way) job. Normally this should be caught by our Parameter Validator (the annoying thing that puts the yellow exclamation mark on all the storage jobs and brings up a dialog if there’s a problem) but this only runs on storage jobs.

o A number of issues resulting from network outages and unavailable sessions will report a failure based on exit code. Check DTM’s “Resolution” section in the error view for more details on this.

 

digest_fail.png