Event Tracing for Windows (ETW) -- Part 2

Article
06/09/2005

So, there were two major groups of comments on the last post, and I'll try to address them.

The first was a question about managed support
for ETW. I talked to the ETW team, and the current state is that
there is no official managed interface for ETW. Being a standard
Win32 API, it is posisble to PInvoke
the functions involved, and several internal teams have written their
own managed wrappers around ETW. This isn't expected to change up
through Whidbey (Visual Studio 2005); for Orcas (the version of VS
after that!) an official managed interface is on the table. The
second is one of backwards compatibility
-- ETW is only available in Win2K and later OSes. Users will
expect that software works similarly on all OSes; thus, if you want to
support 9x-era OSes, you have to write your own logging code
anyways. So, if you are already putting in old-style logging, why
use ETW? I'll try to answer that with this entry.

One of the gains of ETW is that it's fast; you can
spit out thousands of events per second while using relatively little
CPU, far faster than you can fprintf() a string to disk. The
biggest gain, though, is combining multiple sources -- including
information outside your own process. And the most notable
external source is the kernel.

The XP and Codename Longhorn kernels are extremely extensive providers, and can be enabled to log any or all of these to a log, and we publish decoding information for:

Hardware configuration events -- notes on the system's CPUs, hard drives, NICs, video card, and ACPI power states
Disk-level I/O -- every I/O on the system, including IRP flags, operation time in microseconds, number of bytes, and target disk
File-level I/O -- every access to every file on the system (including information to tie it to the disk I/Os above)
Image layouts -- filenames, locations in memory, and PIDs for every image in the system
Page faults -- pointers
to instructions and pages whenever a fault occurs (including COWs,
demand-zero faults, hard page faults, transition faults, and guard
pages)
Network I/O -- all TCP and UDP actions, including connects/accepts, transmits (and retransmits!), recieves, etc.
Registry I/O -- all Registry key/value creation/deletions/changes, registry flushes, etc.
Process and thread info -- all creations/deletions of processes and threads

And Codename Longhorn adds even more events -- most notably,
the ability to trace extremely fine-grained high-frequency events such
as individual context switches, interrupts (ISRs and DPCs), etc.

(FYI, since all of the above information is
exceedingly detailed, you can only enable the kernel provider if you
have Administrator privileges, are part of the Performance Log Users
group, or a service running as LocalSystem, LocalService, or
NetworkService.)

Thus, ETW can be used as an effective debugging
tool. By allowing ETW to pull from, sort, and combine events from
multiple providers, you can get a powerful log of everything the system
was doing, probably the most accurate log available (save for running
the entire OS in a debugger). It's an incredible tool for
noticing "hey, things act strangely when X, Y, and Z, but not W, are
happening" at a system level, as well as a code level, and it takes far
less time than getting symbols and attaching a debugger/profiler to the
system. And it's all available to devs -- and even to users,
given a generic ETW tool such as tracelog in Server 2003!

Even if you don't specifically use the kernel as an
information source, ETW's ability to combine providers is useful for
mixing and matching information from multiple DLLs, EXEs, etc. in a
system. ETW events are timestamped by the kernel to extremely
high resolution (RDTSC on stable machines, converted to microsecond
intervals; MM timers on others) and are automatically sorted at process
time, so you don't have to write or parse plaintext date/time formats.

The goal I'm driving at is that when you have more
than one DLL or EXE providing information, individually implementing
logging for each component means that you usually need a third app to
read in the logs from each component and combine them into a single log
with coherent event ordering, and this can be difficult -- especially
if you have to tie it to some event. ETW allows you to automate
all that, and it's exceedingly efficient at it as well. Even if
you are only personally maintaining one component, ETW can log very
quickly, and it can be shipped in retail builds -- and if you publish
the structures of some or all of the events you provide, you can give
valuable information to your consumers and to future devs without ever
needing to work with them.

Next entry, I'll start discussing how providers are
written, starting with thread structures and common ways of publishing
event structs.

Event Tracing for Windows (ETW) -- Part 2

Additional resources