Remote Kernel Tracker: More things you should recognize

I have a few more examples of patterns you can recognize when you’re looking at CeLog threading data in Remote Kernel Tracker.  Note: these images are wacky heights because I tweaked them to cut off white space that didn’t contain any CeLog data.

Thread Migration

I realized that in my previous Kernel Tracker post I meant to add a screen shot of thread migration, and forgot.  This is an example of what thread migration looks like in Remote Kernel Tracker.  There’s a raised dashed line which represents that the thread is still running, but inside a different process.

(scroll down if you can't see anything)

Screen Shot 1: Example of what thread migration looks like.

KITL / debug messages

I like this example which hammers home how much work is involved in heavy KITL traffic.  I run a program called enumfiles.exe which enumerates all of the files in the device file system.  The KITL traffic is due to a few different things: mainly the heavy debug messages, and the enumeration of files inside the release directory via RELFSD.  (The release directory file system driver, which mounts the release directory from the desktop PC as \Release on the device.)  In this case I also loaded the program over KITL from my release directory, so some of this traffic is the work copying code onto the device.

Screen Shot 2: For a program that’s just enumerating the file system, that’s a lot of time on the KITL thread.

As you can see, when the enumfiles.exe program starts, a lot of other activity begins too.  The KITL interrupt (25 on this device) and the KITL IST start working overtime.  I like this example because it shows that your code doesn’t always live in the vacuum you might expect.  The system calls you make, or the pure existence of your program, can fire off other activity in the system.  Does the system really behave as you expect?  Tools like CeLog and Remote Kernel Tracker are the way to find out for sure.

Mike Hall uses an older version of my enumfiles program (walktree.exe) when he gives labs to teach people how to use Remote Kernel Tracker, so if you’ve taken any labs from Mike at DevCons or other events, you may have played with this same program yourself!

Flushing

It’s good to know what activity represents side effects of data gathering itself.  Whether you have Remote Kernel Tracker actively connected to a device and collecting data, or you’re running CeLogFlush.exe to save data on a standalone device, the thread that manages the data impacts the system a little itself.  First, Screen Shot 3 shows CeLogFlush.exe flushing periodically.  In this case I specified that CeLogFlush.exe should flush at 3-second intervals, so that more than one flush would show in the same window.

Screen Shot 3: CeLogFlush data flushing at 3 second intervals when data’s not being logged heavily.

As you can see above, when there’s not a lot of activity, flushing can be a pretty fast operation.  The vertical lines showing a lot of thread switching aren’t very wide; the flush is over quickly.

Incidentally, you can see that the switching is between the CeLogFlush.exe thread and the KITL thread – OK now tell me where I was storing the log file?  That’s right, in \Release which required KITL to transport the file data to the release directory on the PC.  Otherwise there’d be less thread switching because it could write to local storage on the device without switching to the KITL thread.

In contrast, Screen Shot 4 shows how much of a load flushing can be on a busy system.  There are a lot more events being logged, so there is more data to flush and flushes take longer.  The RAM buffer that the data is temporarily stored in fills more quickly than 3 seconds, so the flushes are more frequent.

Screen Shot 4: CeLogFlush flushing during heavy logging activity.

For completeness I’m including Screen Shot 5 which shows what flushes look like when you’re using Remote Kernel Tracker to transport the data to your PC, instead of using CeLogFlush.exe.  You can see that RTH.EXE does the flushing, and that it flushes over KITL about once per second.

Screen Shot 5: Remote Kernel Tracker data flushing.

Data Loss

I wrote up a couple of sections about data loss in our recent newly-released CeLog documentation:

· Analyzing Data Loss with Remote Kernel Tracker

· Analyzing Data Loss with Readlog

But I could not include screen shots.  Here is a screen shot to show you how to recognize data loss when you’re looking at it in Remote Kernel Tracker.

Screen Shot 6: The signs of data loss.

The three characteristics I describe in the Kernel Tracker article are visible in Screen Shot 5.  First, it pops up a dialog box telling you how many bytes of data have been lost over time in the log.  Second, the interrupt activity appears to stop completely, and one thread appears to be running for a long period.  In this screen shot I didn’t expand explorer.exe to show that thread, but it was a single thread and it had no icons showing events occurring during the thread run-time.  Finally, when activity picks up again, the flush thread (RTH.EXE) is busy working.  All of these characteristics are strong hints that the apparently “inactive” period is really a period of time from which data was lost.

More Tips

While I was poking around building up screen shots for this write-up, I caught one view that surprised me a bit.  Look how much priority inversion is going on here!  This is a database test program spewing lots of debug output.  KITL isn’t something that actually requires “real” real-time, but perhaps I should get someone to examine why we hit priority inversion so regularly.

Screen Shot 7: Debug output leading to lots of priority inversion.

Let it be a lesson that while you’re testing real-time response in your system, you should disable KITL.  And that while you’re running performance tests, for heaven’s sake don’t write a lot of debug messages.

Another little exercise.  On the left and right of Screen Shot 7 are some periods where activity drops dramatically.  Do those periods represent data loss?

The answer is no, because of two clues.  First, SYSINTR 0 is still going strong.  Keep an eye on that whenever you’re running a profiling build. (I’d say all the time but only profiling builds record interrupts.) SYSINTR 0 occurs frequently enough you should always be able to find it.  Second, there are lots of events occurring on the testdb.exe thread during those periods.  There are so many icons drawn on that thread that they don’t show up clearly in this screen shot.  If it was data loss, there’d be no such activity.  Since this particular test program was writing a lot of debug output; the less busy periods probably represent time when the test was doing solid periods of work instead of printing debug messages.  When you’re not running a profiling build, look for events on the thread that was supposedly executing during the quiet period, to see if that really is a period of data loss.