Troubleshooting guide - part 1

Below is a well executed document from guestRx: Bulent Elmaci. Bulent has worked with Windows Mobile debugging for a long time and backs up his writing with a lot of experience. It is the first of a series of articles he has written to help our OEM's become better debugging developers, we will follow up with the second part soon. Enjoy!

 

 Jumpstart Guide to Troubleshooting on Windows Mobile – Part I

Every Windows Mobile device goes through a full software project cycle, before it can be commercialized, and made available to mobile operators and is working in customers’ hands. In this cycle, development, customization and testing phases play a crucial role in achieving partner and end user satisfaction, and in making sure Windows Mobile phones are of high quality.

Like in any other project, during these phases, often various technical problems are discovered related to various aspects of the device and the experience it offers. In terms of their effects in overall device quality and project schedule (i.e. time to market), these issues or challenges can range from insignificant ones to device blockers. Needless to say, regardless of their size or effects, all of these issues need to be attacked by engineers and experts, to identify and implement effective, timely and high quality resolutions.

Troubleshooting (i.e. investigation of issues’ root cause for the purpose of removing it) is the first step in reaching an acceptable resolution for any issue we are challenged with during device/application development. It is more accurately a series of steps that make sure issues are correctly identified with all their aspects, so that effective, timely and high quality resolutions can be reached. To achieve that, using sound strategies, techniques, and tools is crucial. Without the correct approaches and tools used, troubleshooting could take more time than it actually requires or is available, or even worse, could lead to incorrect resolutions.

This article will present general guidelines and strategies that should be employed while troubleshooting Windows Mobile device issues. It is the first part of a series of articles I’d like to do. In the second part of this series, I will provide an overview of the tools that are available to make life a little easier for engineers.

STRATEGIES

Using sound strategies while approaching a problem and analyzing it directly effects how sound the resolution at the end will be, how long it will take to reach to it, and how painful the process will be. The strategies briefly discussed below are actually general in nature, and can be applied to any development problem, but still act as the basis for our purposes.

Get a Clear Picture of the Problem

The issue in question might be found by the same engineers who will be troubleshooting it, or by some other team/partner. Regardless of the source, getting a clear picture of the problem at hand is the most important point.

Asking and finding the answer to the following questions would be helpful to better understand the problem:

- Which version of the device is the issue reported/found for? The version applies to WM version, BSP version, radio version, etc.

- What is the expected behavior?

- What is the actual behavior?

- Did the issue exist in previous versions?

- Does the issue occur only on one device (a particular device), one set of devices (with same hardware, WM version, BSP version, configuration and customizations), or all devices with the same hardware and WM/BSP versions?

- Is the reported issue an isolated one? This means finding out whether a similar problem (e.g. connectivity, etc.) exists on the device on another area or use case that might not have been reported initially, but might be related.

- What is the use case and the expected user experience? Although this is closely related to the “expected behavior” mentioned above, in some cases it might be completel different. An example is the case where the expected behavior might be related to a part of the user interface, but the actual expectations underlying this might be completely different (which would cause us to look at the problem from a completely different angle).

Get a Consistent Repro, If Possible

Even if it is the same engineers who reported the issue and working on it, it is always a good exercise to write down the steps to reproduce the problem in detail (commonly referred to as “repro steps”). Having this is especially important, if the issue is being reported by another team, or partner.

Repro steps should at the minimum include the following data and characteristics:

- The pre-conditions that existed before the repro is done. This can include the same information that we mentioned above for understanding the problem, or some other relevant data specific to the issue itself (e.g. is there radio connectivity, is the SIM used, what is the meta network used, what are the applications that were running on the device, etc.)

- What are the steps, in order, that were done until the issue occurred? Supporting these with screen captures from the device, and including the user interface (UI) elements’ names as they appear on the device, would be more than helpful.

- What was the result? Supporting this with data related to the result, e.g. screen captures, error message texts/numbers, etc. would be very helpful.

- Is the repro consistent? What s the failure rate?

Although not always possible, one thing that would ease the analysis of the issue is getting the same repro on a device you have access to. This can reveal a lot of new info, especially the ones that you might not have gotten when the issue was reported.

Identify Direct and Indirect Factors

Indirect and direct factors can include a lot of things. They are the data points that can give you important clues on the environment and pre-conditions on the device when the issue occurred.

Direct factors can simply include the configuration on the device relevant to the issue at hand (e.g. connection manager configuration for a network issue, the active theme for a UI issue, etc.), or they can be the pre-conditions for the repro that might not have been mentioned before (e.g. the initialization API’s used before the failed API, the radio state for a connection issue, available memory for an issue involving processing a large amount of data, etc.).

Indirect factors are similar to the direct ones, which at first look don’t seem related to the issue. Collecting this data can prove itself useful for some issues, especially if there is a chance a different view on that data could reveal some indirect relationship to the issue (e.g. another application running on the device holding a lock on the same file, another application keeping the radio busy causing your connectivity issue, etc.).

Do Your Homework

Attacking a problem effectively always require knowing the problem surface. Some issues require a deep level of knowledge about the problem domain, while for some issues, a general familiarity with the workings is sufficient. If you feel there are things that you can’t explain in the problem, or don’t know enough to collect relevant data or analyze them, take a look at the available documentation, and if possible, find an expert who can quickly demystify things for you.

Collect Sufficient and Relevant Data

To make sure the investigation has the chance to reveal useful results, all possible data that relate to the issue should be collected. Since the data should be relevant to the issue and should be sufficient (not too little or too much), a good understanding of the end-to-end scenario as well as the underlying modules/architecture/inner-workings is almost a must. An example is, for a connection issue, it is almost always useful to collect the Connection Manager configuration on the device as well as a capture of the network traffic. Another example is collecting Radio Interface Layer (RIL) logs for an issue that involve radio connectivity.

Collecting these logs will require knowing and having access to the right tools for the job. Some of these tools will be discussed in the second part of this series.

Analyze with Correct Tools

Once all data is collected, analysis phase also require using the right tools, which can convert your raw data to useful information and clues about your issue. There are a variety of tools available for Windows Mobile, which will be discussed later.

Use an Unlocked, KITL-Enabled Devices, If Possible

Having access to an unlocked device can save you a lot of time, when you need to collect data from the device. Having a KITL-enabled image on the device, on the other hand, is the ideal situation that would allow you to connect Platform Builder to the device and run the repro under a debugger.

Isolate the Problem

The goal of the data collection and analysis is obviously to find the root cause of the issue as well as a resolution. But since in most cases, it won’t be possible to reach this goal right away, intermediate goals are required, where the problem area is narrowed down gradually and isolated from all other irrelevant factors one step at a time. At each step, new data might need to be collected. As the problem is isolated more and more, the problem area will get more manageable and have a better chance to leading to the root cause.

Trace-back, Review and Evaluate Assumptions

While trying to understand the problem, to identify different factors that are in play and to gradually isolate the problem, the need for tracing back to a previous step and re-doing some steps is sometimes unavoidable. But to improve this, a good strategy is to review the assumptions made (consciously or not) and to evaluate them, so that unnecessary trace backs are avoided.

Don’t Eliminate Alternatives Too Early or Too Late; Don’t Jump to Conclusions

Especially for tough problems, where either it is never enough to collect sufficient data, or to have a deep-enough understanding of the factors in play, it might be tempting to cut some shortcuts and eliminate alternatives by jumping to conclusions. Although not completely avoidable especially for cases where trial-and-error (the ones that are done based on sound and informed assumptions/decisions) is the only way left to go, for most issues, considering all available/seen alternatives and having high degree of confidence for all the conclusions, will make sure a possible rework is eliminated. One principle that is always useful to aid in this is to separate facts or observations that are based on concrete data collected, from possibilities and conclusions that are not yet proven.

Focus on Both Short and Long Term Resolutions

For some issues, troubleshooting might end up identifying bigger underlying problems. For those cases, it is important to evaluate the cost of the resolutions offered with the project timeline and time-to-market requirements. If the proposed resolutions are costly, the investigation might need to re-focus on identifying cheaper alternatives for employing in the short term, as well as the longer term resolutions.