The Trouble With Logoff Events

A lot of you guys probably are using your SEM/SEIM systems to record logon and logoff activity without much of a second thought.

I just thought I'd bring one problem to your attention.

Logoff events are not strictly reliable.

From an engineering sense they are deterministic.  However like many audit events, if you don't really think about how they work you might make assumptions that aren't correct.  And you know what they say about assumptions...

Anyway, here's the problem.  There is nothing that requires a client to notify you when the client decides to stop using your services.  So for network logon sessions, the "logoff" event often just means "I got tired of waiting with reserved resources allocated for the client, so I reclaimed the resources.  I'll give the client more resources if they come back (and can authenticate)".  In other words- timeout.

You cannot, with protocols, force a client to notify you that they're done.  Nice clients will if they can.  Sometimes they physically can't notify you (this could be what we used to call the "backhoe and a beer" problem in my years in product support that is completely beyond the client's control).  Sometimes they just choose not to notify you.  I myself have designed software which just tore down the network connection when done or at the first sign of trouble, and started over again from scratch, rather than go through sophisticated "goodbye" or "fault" semantics.  A robust server can handle the situation and will notice fairly quickly, reclaiming any reserved resources and generating any necessary audit trail.  However I have seen software that has such expensive connection set-up that they hang on to connections for dear life long after everyone else would have turned out the lights and gone home.  The funny thing is that half the time they don't realize that their client crashed, lost its state (that they were depending on for reconnection) and rebooted, and has reconnected with a new session.  But I digress.

For interactive logon sessions, there is no guarantee of a logoff event either.  There is no law of physics that forces a logoff audit if I pull the power cord out while I'm plugged in.

Plus, I've talked about token leaks before haven't I?  Maybe not?

Windows logon events technically mean that we have created a data structure called a logon session.  Associated with a logon session are one or more data structures called tokens.  Each token has a number associated with it called a reference count, which is just a count of how many processes are using it at any given time.  The reference count starts at 1 and goes up whenever a new process starts and down when the process terminates.  It also goes up when a process specifically asks for a reference to a token and goes down when the process releases that reference.  When the last process (your shell program, Explorer) releases its reference to your token, the token's reference count drops to zero.  When the reference count drops to zero we destroy the token and the logon session associated with the token; the logoff event means the logon session was destroyed.  For network logons we use a thread token that is given back to the service that asked to log you on; that token is usually assigned to a thread that does work on your behalf.  It's all a little more complex than this in real life but this is basically how it works.

Anyway many applications, particularly server applications, request references to tokens, and then forget to release the references.  This causes the reference counts to never drop to zero, and prevents us from generating the logoff event as a result.

To work around this we added the "Begin logoff" event (551) in Windows Server 2003, which can be interpreted as a logoff event, but this doesn't cover all cases.  There are still some cases where logoff events are not generated due to poorly behaving applications.  We fix all known instances of this in the operating system before we release Windows, and we test it rather thoroughly, but we can't promise that your applications will not leak tokens.  If you encounter this you can troubleshoot by isolating each application until the token leak goes away, and then working with that application vendor.

Eric