Error handling part 6: ETW logging example


<< Part5

I want to show an example of how to do the error logging through ETW, for the error objects described in the Part 5. It goes against some of the commonly accepted ETW principles but I think that it would out better this way. I think that these commonly accepted ETW principles are largely wrong (and indeed the newer manifest-less logging goes largely against these principles too, although in a different way). Basically, ETW was born as a part of Xperf, for logging the performance profiling data. It’s great for this purpose. Then it was repurposed for the general logging, and I think that it really sucks at that. ETW includes a lot of complexity that is useless for logging but causes a great amount of pain.

Recapping the part 5, the error reports there consist of a chain of messages, each element of the chain containing a code and a text message. Creating an ETW message type for each possible error code would have been a terrible pain. So in my approach one message type is created per the severity level, otherwise with the same format. ETW supports up to 255 severity levels but as it turns out, there are other limitations: the rule book says that the Admin and Operational logs must only contain the severities Error, Warning and Info, and the rest are for the debugging logs. So for the normal service logging this pretty much says that there are only 3 message types to define.

Each message contains the “main” error code, the text that contains the combined text of the whole chain of messages, and a variable-sized array with the whole list of nested error codes starting with the main one, in case if anyone is interested in them. The manifest fragments with these definitions look like this:

  <instrumentation>
    <events
        xmlns="http://schemas.microsoft.com/win/2004/08/events"
        xmlns:win="http://manifests.microsoft.com/win/2004/08/windows/events"
        >
      <provider
          <!-- make your own GUID with uuidgen -s -->
          guid="{12345678-9012-3456-7890-123456789012}"
          message="$(string.Provider.Name)"
          messageFileName="%ProgramFiles%\my\myprogram.exe"
          name="My-Program"
          resourceFileName="%ProgramFiles%\my\myprogram.exe"
          symbol="MyEtwGuid"
          >
        <channels>
          <channel
              chid="admin"
              enabled="true"
              name="My-Program/Admin"
              type="Admin"
              />
        </channels>
        <templates>
          <template tid="GeneralMessage2">
            <data
                inType="win:UnicodeString"
                name="Text"
                />
            <!-- the top-level error code -->
            <data
                inType="win:UInt32"
                name="Code"
                />
            <!-- the error codes of the nested messages, including top-level -->
            <data
                inType="win:UInt32"
                name="NestedCodeCount"
                />
            <data
                count="NestedCodeCount"
                inType="win:UInt32"
                name="NestedCode"
                />
          </template>
        </templates>
        <events>
          <event
              channel="admin"
              level="win:Error"
              message="$(string.Text.Contents)"
              symbol="ETWMSG_LOG_INST_ERROR2"
              template="GeneralMessage2"
              value="0x0001"
              />
          <event
              channel="admin"
              level="win:Warning"
              message="$(string.Text.Contents)"
              symbol="ETWMSG_LOG_INST_WARNING2"
              template="GeneralMessage2"
              value="0x0002"
              />
          <event
              channel="admin"
              level="win:Informational"
              message="$(string.Text.Contents)"
              symbol="ETWMSG_LOG_INST_INFO2"
              template="GeneralMessage2"
              value="0x0003"
              />
        </events>
      </provider>
    </events>
  </instrumentation>
  <localization>
    <resources culture="en-US">
      <stringTable>
        <string
            id="Provider.Name"
            value="My Program"
            />
        <string
            id="Text.Contents"
            value="%1"
            />
      </stringTable>
    </resources>
  </localization>

Now on to the logger object. I’ve implemented the logging as a base Logger class defining the interface and the subclasses that provide the specific implementations for the various destinations: memory buffer (very convenient for testing), stdout, files, and ETW. This is very convenient, you can redirect your logging anywhere by just changing the logger object.

The logger also has the concept of “log entity”: a name that identifies where the log message is coming from. It might be the name of a thread, or if multiple processes write logs to the same channel then it might include the name/id of the process, or any extra information. Very convenient for analyzing the logs.  If you don’t care about the entity, you can always use NULL in its place.

Typically a message gets logged like this:

logger_->log(err, Logger::SV_ERROR, logEntity_);

or with the error object creation right in it:

logger->log(ErrorSource.mkString(1, L"Assigned host:\n%ls",
    strHexDumpBytes(ptr, 18, 0, L"  ").c_str()),
    Logger::SV_INFO, NULL);
logger_->log(
    ErrorSource.mkMui(EPEM_WMI_OPEN_STATUSLOG, rawcfg->statusLog_[0].c_str()),
    Logger::SV_INFO, logEntity_);

The severity level of a message is specified when it’s passed to the logger, since the error objects don’t have one, and the severity is really determined at the top level.

The logger has the ability to throw away the messages below a certain minimal severity, to keep the log cleaner from the excessive detail. The generation of the message might be pretty expensive, so the program might want to skip it in case if the message will be thrown away anyway. One way to do it is to check it explicitly:

if (logger_->allowsSeverity(Logger::SV_DEBUG))
  { ...create and log the message... }

The other way is to use a macro that does the same internally:

LOG_SHORTCUT(logger_, Logger::SV_INFO, logEntity_,
    ErrorSource.mkMui(EPEM_WMI_OPEN_STATUSLOG, rawcfg->statusLog_[0].c_str())
    );

In retrospect, I should probably have used the same order of parameters for log() and LOG_SHORTCUT(), so feel free to change it to be the same.

There is also a convenient method for the small tools that treats the current error as globally fatal: i.e. it checks if the error object reference is not NULL, and if so, it logs the errors and exits with the code 1:

 logger->logAndExitOnError(ErrorSource.mkString(1,
        L"Encountered an incomplete buffer."
        ), logEntity);

Before we go into the ETW specifics, here is what the base class looks like:

/*++

Copyright (c) 2016 Microsoft Corporation

--*/


class LogEntity
{
    // Normally stored in a shared_ptr.
    // The address of this object is used as a token used to group
    // the messages from this entity inside the loggers.
public:
    LogEntity(const std::wstring &name):
        name_(name)
    { }

    std::wstring name_; // name of the entity
};

class Logger
{
public:
    enum Severity {
        SV_DEBUG,
        SV_VERBOSE, // the detailed information
        SV_INFO,
        SV_WARNING,
        SV_ERROR,
        SV_NEVER, // pseudo-severity, used to indicate that the logger logs nothing
        SV_DEFAULT_MIN = SV_INFO // the default lowest severity to be logged
    };

    // Specify the minimum severity to not throw away.
    Logger(Severity minSeverity):
        minSeverity_(minSeverity)
    { }

    // A Logger would normally be reference-counted by shared_ptr.
    virtual ~Logger();

    // Log a message. Works even on a NULL pointer to a logger
    // (i.e. if the logger is not available, throws away the messages).
    //
    // err - the error object
    // sev - the severity (the logger might decide to throw away the
    //      messages below some level)
    // entity - description of the entity that reported the error;
    //      may be NULL
    void log(
        __in Erref err,
        __in Severity sev,
        __in_opt std::shared_ptr<LogEntity> entity
        )
    {
        if (this != NULL && err) // ignore the no-errors
            logBody(err, sev, entity);
    }

    // May be called periodically in case if the logger
    // needs to do some periodic processing, such as flushing
    // the buffers. The EtwLogger uses this method to send
    // the collected backlog after the provider gets enabled
    // even if there are no more log messages written.
    // The default implementation does nothing.
    virtual void poll();

    // A special-case hack for the small tools:
    // If this error reference is not empty, log it and exit(1).
    // As another special case, if this logger object is NULL,
    // the error gets printed directly on stdout.
    // The severity here is always SV_ERROR.
    void logAndExitOnError(
        __in Erref err,
        __in_opt std::shared_ptr<LogEntity> entity
        );

    // The internal implementation of log()
    // This function must be internally synchronized, since it will
    // be called from multiple threads. Implemented in subclasses.
    virtual void logBody(
        __in Erref err,
        __in Severity sev,
        __in_opt std::shared_ptr<LogEntity> entity
        ) = 0;

    // The smarter check that works even on a NULL pointer.
    Severity getMinSeverity() const
    {
        if (this == NULL)
            return SV_NEVER;
        else
            return minSeverity_;
    }

    // Check that the logger will accept messages of a given severity.
    // Can be used to avoid printing message sthat will be thrown away.
    bool allowsSeverity(Severity sv) const
    {
        if (this == NULL)
            return false;
        else
            return (sv >= minSeverity_);
    }

    virtual void setMinSeverity(Severity sv);

    // Translate the severity into an one-letter indication.
    // Returns '?' for an invalid value.
    static WCHAR oneLetterSeverity(Severity sv);

    // Translate the severity into a full name.
    // Returns NULL for an invalid value.
    static const WCHAR *strSeverity(Severity sv);

    // Translate the human-readable name of severity to
    // enum value.
    // Returns SV_NEVER if it cannot find a name.
    static Severity severityFromName(const std::wstring &svname);

    // Returns the list of all supported severity levels.
    static std::wstring listAllSeverities();

public:
    volatile Severity minSeverity_; // the lowest severity that passes through the logger;
        // normally set once and then read-only, the callers may use it to
        // optimize and skip the messages that will be thrown away
};

// A shortcut if the message is intended only for logging:
// skips the message creation if the logger won't record it anyway.
#define LOG_SHORTCUT(logger, severity, entity, err) do { \
    if (logger->allowsSeverity(severity)) { \
        logger->log(err, severity, entity); \
    } } while(0)

Logger::~Logger()
{ }

void Logger::poll()
{ }

void Logger::logAndExitOnError(
    __in Erref err,
    __in_opt std::shared_ptr<LogEntity> entity
    )
{
    if (!err)
        return;
    if (this != NULL) {
        log(err, Severity::SV_ERROR, entity);
    } else {
        wprintf(L"%ls", err->toString().c_str());
    }
    exit(1);
}

void Logger::setMinSeverity(Severity sv)
{
    minSeverity_ = sv;
}

WCHAR Logger::oneLetterSeverity(Severity sv)
{
    switch (sv) {
    case SV_DEBUG:
        return L'D';
    case SV_VERBOSE:
        return L'V';
    case SV_INFO:
        return L'I';
    case SV_WARNING:
        return L'W';
    case SV_ERROR:
        return L'E';
    default:
        return L'?';
    };
}

const WCHAR *Logger::strSeverity(Severity sv)
{
    switch (sv) {
    case SV_DEBUG:
        return L"DEBUG";
    case SV_VERBOSE:
        return L"VERBOSE";
    case SV_INFO:
        return L"INFO";
    case SV_WARNING:
        return L"WARNING";
    case SV_ERROR:
        return L"ERROR";
    default:
        return NULL;
    };
}

Logger::Severity Logger::severityFromName(const std::wstring &svname)
{
    if (svname.empty())
        return SV_NEVER;
    // just look at the first letter
    switch(towlower(svname[0])) {
    case 'd':
        return SV_DEBUG;
    case 'v':
        return SV_VERBOSE;
    case 'i':
        return SV_INFO;
    case 'w':
        return SV_WARNING;
    case 'e':
        return SV_ERROR;
    };
    return SV_NEVER;
}

std::wstring Logger::listAllSeverities()
{
    std::wstring result;

    for (int sv = 0; sv < (int)SV_NEVER; ++sv) {
        strListSep(result);
        result.append(strSeverity((Severity)sv));
    }

    return result;
}

Now to the ETW part. From the API standpoint, the new part is only the constructor, the rest of it are the common virtual methods:

// the GUID symbol gets generated by the manifest compiler
shared_ptr<EtwLogger> etwlog = make_shared<EtwLogger>(MyGuid);

But as far as the implementation is concerned, there are more tricks.

The first trick is that the ETW logging has its own concept of what minimal severity is accepted, and communicates this information back to the program. The program can then use this to void creating the messages than nobody reads. This is actually a pretty neat concept for the debug logs: the debugging information can create a lot of overhead, so if nobody is listening to it, skip it, but if someone is interested then generate it. I’ve never got much hang of it but I think it can be selected like this:

wevtutil sl Provider/Debug /level:200

It doesn’t seem to work on the Admin logs which seem to always get all the levels enabled but the code that implements even the admin logs has to still honor it.

The second trick is that when you open the log handle, you can’t start writing the events to it right away. If you do, they will be thrown away. Instead the ETW subsystem will go initialize things, and when it’s done it will call a callback function that among the other tings will tell you what severity levels are accepted. Very convenient for the ETW subsystem, very inconvenient for all the programs that write logs.

My implementation handles this by keeping an in-memory buffer of the early messages, and then flushing them to ETW after it gets a call-back. The buffer size is limited, so if more messages get written in this time interval, the extra messages will be lost. A normal program shouldn’t be writing the messages that fast but if you write the debugging messages, your luck may vary. You might want to add a bit of a timeout after opening the logger. An even better approach might be to have an event that gets signaled in the ETW callback, feel free to add it.

Oh, and another trick is that as the error codes get written into the ETW messages, their upper 2 bits get cleared. These bits denote the severity of a message but I’ve found that the severity doesn’t make much sense except at the very top level anyway, so it’s better not to add to the confusion.

The rest is fairly straightforward:

/*++
Copyright (c) 2016 Microsoft Corporation
--*/

class EtwLogger: public Logger
{
public:
    // The approximate limit on the message string length
    // in one ETW message, in characters. The message chains longer
    // than that will be split up. The individual messages in the
    // chain will not be split, so if a real long message happens,
    // it will hit the ETW limit and be thrown away.
    //
    // Since the strings are in Unicode, that the character number
    // must be multiplied by 2 to get the byte size.
    // Keep in mind that the ETW message size limit is 64KB, including
    // all the headers.
    enum { STRING_LIMIT = 5 * 1000 };
    // The maximum expected number of fields in the message,
    // which drives the limit on the number of the error codes per message
    // (i.e. the nesting depth of the Erref).
    enum { MSG_FIELD_LIMIT = 100 };

    // guid - GUID of the ETW provider (must match the .man file)
    // minSeverity - the minimum severity to not throw away.
    //
    // The errors are kept, and can be extracted with error().
    // A EtwLogger with errors cannot be used.
    EtwLogger(
        LPCGUID guid,
        _In_ Severity minSeverity = SV_DEFAULT_MIN
    );

    // Closes the file.
    virtual ~EtwLogger();

    // Close the logger at any time (no logging is possible after that).
    // The errors get recorded and can be extracted with error().
    void close();

    // from Logger
    virtual void logBody(
        __in Erref err,
        __in Severity sev,
        __in_opt std::shared_ptr<LogEntity> entity
        );
    virtual void poll();

    // Get the logger's fatal error. Obviously, it would have to be reported
    // in some other way.
    Erref error()
    {
        return err_;
    }

protected:
    // Callback from ETW to enable and disable logging.
    static void NTAPI callback(
        _In_ LPCGUID sourceId,
        _In_ ULONG isEnabled,
        _In_ UCHAR level,
        _In_ ULONGLONG matchAnyKeyword,
        _In_ ULONGLONG matchAllKeywords,
        _In_opt_ PEVENT_FILTER_DESCRIPTOR filterData,
        _In_opt_ PVOID context
    );

    // The actual logging. The caller must check that
    // the logging is already enabled and that the severity
    // level is allowed.
    // The caller must also hold cr_.
    void logBodyInternalL(
        __in Erref err,
        __in Severity sev,
        __in_opt std::shared_ptr<LogEntity> entity
        );

    // Check if anything is in the backlog, and forward it.
    // The caller must also hold cr_.
    void processBacklogL();

    enum {
        // Up to how many entries to keep on the backlog.
        BACKLOG_LIMIT = 4096,
    };
    struct BacklogEntry {
    public:
        Erref err_;
        Severity sev_;
        std::shared_ptr<LogEntity> entity_;

    public:
        BacklogEntry(
            __in Erref err,
            __in Severity sev,
            __in_opt std::shared_ptr<LogEntity> entity
        ):
            err_(err),
            sev_(sev),
            entity_(entity)
        { }
    };

protected:
    Critical cr_; // synchronizes the object
    std::wstring guidName_; // for error reports, GUID in string format
    REGHANDLE h_; // handle for logging
    Erref err_; // the recorded fatal error
    Severity origMinSeverity_; // the minimal severity as was set on creation
    std::deque<BacklogEntry> backlog_; // backlog of messages to send when the provider becomes enabled
    bool enabled_; // whether anyone is listening in ETW
};


EtwLogger::EtwLogger(
    LPCGUID guid,
    _In_ Severity minSeverity
):
    Logger(minSeverity),
    guidName_(strFromGuid(*guid)), h_(NULL),
    origMinSeverity_(minSeverity), enabled_(false) 
{
    NTSTATUS status = EventRegister(guid, &callback, this, &h_);
    if (status != STATUS_SUCCESS) {
        err_ = LogErrorSource.mkMuiSystem(status, EPEM_LOG_EVENT_REGISTER_FAIL,
            guidName_.c_str());
        return;
    }
}

EtwLogger::~EtwLogger()
{
    close();
}

void EtwLogger::close()
{
    ScopeCritical sc(cr_);

    if (h_ == NULL)
        return;

    NTSTATUS status = EventUnregister(h_);
    if (status != STATUS_SUCCESS) {
        Erref newerr = LogErrorSource.mkMuiSystem(GetLastError(), EPEM_LOG_EVENT_UNREGISTER_FAIL, guidName_.c_str());
        err_.append(newerr);
    }
    h_ = NULL;
    enabled_ = false;
}

void EtwLogger::logBody(
    __in Erref err,
    __in Severity sev,
    __in_opt std::shared_ptr<LogEntity> entity
    )
{
    ScopeCritical sc(cr_);

    if (sev < minSeverity_ || h_ == NULL)
        return;

    if (enabled_)
    {
        // The backlog cannot be written from the callback when
        // the logger gets enabled, so write on the next message.
        processBacklogL();
        logBodyInternalL(err, sev, entity);
    } else {
        backlog_.push_back(BacklogEntry(err, sev, entity));
        while (backlog_.size() > BACKLOG_LIMIT)
            backlog_.pop_front();
    }
}

void EtwLogger::poll()
{
    ScopeCritical sc(cr_);

    if (h_ == NULL || !enabled_)
        return;

    processBacklogL();
}

void EtwLogger::processBacklogL()
{
    while (!backlog_.empty()) {
        if (h_ != NULL) { 
            BacklogEntry &entry = backlog_.front();
            logBodyInternalL(entry.err_, entry.sev_, entry.entity_);
        }
        backlog_.pop_front();
    }
}

void EtwLogger::logBodyInternalL(
    __in Erref err,
    __in Severity sev,
    __in_opt std::shared_ptr<LogEntity> entity
    )
{
    const wchar_t *entname = L"[general]";
    if (entity && !entity->name_.empty())
        entname = entity->name_.c_str();

    PCEVENT_DESCRIPTOR event;

    switch(sev) {
    case SV_ERROR:
        event = &ETWMSG_LOG_INST_ERROR2;
        break;
    case SV_WARNING:
        event = &ETWMSG_LOG_INST_WARNING2;
        break;
    default:
        // The .man validation doesn't allow to use any other levels for the
        // Admin messages, so just sweep everything else into INFO.
        // Theoretically, VERBOSE and DEBUG can be placed into a separate
        // channel but doing it well will require more thinking.
        event = &ETWMSG_LOG_INST_INFO2;
        break;
    }

    wstring text;
    Erref cur, next;
    for (cur = err; cur; cur = next) {
        text.clear();
        text.push_back(oneLetterSeverity(sev));
        text.push_back(L' ');
        if (next) {
            text.append(L"(continued)\n  ");
        }
        text.append(cur->toLimitedString(STRING_LIMIT, next));
        
        uint32_t intval[MSG_FIELD_LIMIT];
        EVENT_DATA_DESCRIPTOR ddesc[MSG_FIELD_LIMIT];
        int fcount = 0;

        EventDataDescCreate(ddesc + fcount, text.c_str(), (ULONG)(sizeof(WCHAR) * (text.size() + 1)) );
        ++fcount;

        intval[fcount] = (err.getCode() & 0x3FFFFFFF);
        EventDataDescCreate(ddesc + fcount, intval + fcount, (ULONG)(sizeof(uint32_t)) );
        ++fcount;

        // build the message array
        uint32_t *szptr = intval + fcount;
        *szptr = 0; // the value will be updated as the array gets built
        EventDataDescCreate(ddesc + fcount, intval + fcount, (ULONG)(sizeof(uint32_t)) );
        ++fcount;

        // can the whole array be placed in a single ddesc instead?
        for (Erref eit = cur; eit != next && fcount < MSG_FIELD_LIMIT; eit = eit->chain_) {
            intval[fcount] = (eit.getCode() & 0x3FFFFFFF);
            EventDataDescCreate(ddesc + fcount, intval + fcount, (ULONG)(sizeof(uint32_t)) );
            ++fcount;
            ++*szptr;
        }

        NTSTATUS status = EventWrite(h_, event, fcount, ddesc);

        switch (status) {
        case STATUS_SUCCESS:
            break;
        case ERROR_ARITHMETIC_OVERFLOW:
        case ERROR_MORE_DATA:
        case ERROR_NOT_ENOUGH_MEMORY:
        case STATUS_LOG_FILE_FULL:
            // TODO: some better reporting of these non-fatal errors
            break;
        default:
            Erref newerr = LogErrorSource.mkMuiSystem(status, EPEM_LOG_EVENT_WRITE_FAIL, guidName_.c_str());
            err_.append(newerr);
            close(); // and give up
            return;
        }
    }
}

void NTAPI EtwLogger::callback(
    _In_ LPCGUID sourceId,
    _In_ ULONG isEnabled,
    _In_ UCHAR level,
    _In_ ULONGLONG matchAnyKeyword,
    _In_ ULONGLONG matchAllKeywords,
    _In_opt_ PEVENT_FILTER_DESCRIPTOR filterData,
    _In_opt_ PVOID context
)
{
    EtwLogger *logger = (EtwLogger *)context;
    if (logger == NULL)
        return;

    ScopeCritical sc(logger->cr_);

    switch (isEnabled) {
    case EVENT_CONTROL_CODE_DISABLE_PROVIDER:
        logger->enabled_ = false;
        logger->minSeverity_ = SV_NEVER;
        break;
    case EVENT_CONTROL_CODE_ENABLE_PROVIDER:
        logger->enabled_ = true;
        switch(level) {
        case TRACE_LEVEL_CRITICAL:
        case TRACE_LEVEL_ERROR:
            logger->minSeverity_ = SV_ERROR;
            break;
        case TRACE_LEVEL_WARNING:
            logger->minSeverity_ = SV_WARNING;
            break;
        case TRACE_LEVEL_INFORMATION:
            logger->minSeverity_ = SV_INFO;
            break;
        case TRACE_LEVEL_VERBOSE:
            logger->minSeverity_ = SV_VERBOSE;
            break;
        default: // the level value may be any, up to 255
            logger->minSeverity_ = SV_DEBUG;
            break;
        }
        if ((int)logger->origMinSeverity_ > (int)logger->minSeverity_)
            logger->minSeverity_ = logger->origMinSeverity_;
        break;
    default:
        // do nothing
        break;
    }
}

And a couple more helper classes that get used here:

class Critical
{
public:
    Critical()
    {
        InitializeCriticalSection(&cs_);
    }
    ~Critical()
    {
        DeleteCriticalSection(&cs_);
    }

    void enter()
    {
        EnterCriticalSection(&cs_);
    }

    void leave()
    {
        LeaveCriticalSection(&cs_);
    }

public:
    CRITICAL_SECTION cs_;
};

// a scoped enter-leave
class ScopeCritical
{
public:
    ScopeCritical(Critical &cr):
        cr_(cr)
    {
        cr_.enter();
    }

    ~ScopeCritical()
    {
        cr_.leave();
    }

protected:
    Critical &cr_;

private:
    ScopeCritical();
    ScopeCritical(const ScopeCritical &);
    void operator=(const ScopeCritical &);
};

 

<< Part5

 

Comments (0)

Skip to main content