Posted by: Sue Loh

Hello out there, it’s been a long time since I posted anything real, and I feel sorry about that.  As I began writing this article, I had just come from the first day of TechEd where I saw my colleagues present about CE6 and drivers, and was reminded of a subject I was suddenly inspired to write up for you all.  Today is now the last day of TechEd and I’m back home, but my comments still apply.

I’ll let you in on something – not so much of a secret.  We all make mistakes.  And this is a blog post about one of my own.  You may have already read about the marshalling APIs on this blog, or otherwise learned of them.  When we designed these APIs, we planned them to hide away complexity in the decisions we made for performance and security reasons – so that OEMs and driver writers would not have to thread a maze of difficult details.  With that in mind, consider the CeAllocAsynchronousBuffer API.  The purpose of this API is to marshal a buffer into a driver’s (or server’s or service’s) process space such that the driver/server/service could access the buffer asynchronously.  The work required to do the marshalling depends on the circumstances.  In kernel mode it probably just needs to be aliased (VirtualCopied) into the kernel, while in user mode it must be duplicated (memcpy’d).  The work also depends on what work CeOpenCallerBuffer might have done beforehand – for example if it is already duplicated into the process.  So, CeAllocAsynchronousBuffer hides all of these details.  You can call it and trust the API to make the right choices for security and perf.  We designed it to hide these details while asking the caller to make no assumptions about what’s going on underneath.  Use CeFlushAsynchronousBuffer to guarantee changes have been written back, and CeFreeAsynchronousBuffer to do that plus release any resources.

So that’s all well and good.  Enter older ARM CPUs and their virtually-tagged caches.  In the early days of CE6, we hadn’t quite come to terms with how to prevent the cache coherency problems you could get if you aliased/VirtualCopied memory.  In later days, we fixed aliasing so that it would make both source and dest buffer uncached for the duration of the alias.  (Specifically, we fixed VirtualAllocCopyEx, NOT VirtualCopy, since I am a stickler for little details.)  But in the early days, when we built the marshalling APIs, we were concerned about cache coherency.  So at that time, in CeAllocAsynchronousBuffer we made ARM virtually-tagged CPUs duplicate the memory instead of alias it.  This, of course, concerned us greatly about performance, and we knew we’d ship a lot of ARM virtually-tagged devices.  So we added MARSHAL_FORCE_ALIAS with the expectation that callers would use it with caution, and deal with cache coherency problems themselves.  That, at least, could probably win some performance on large buffers, even if it did cost complexity.

Later, we got our heads on right and fixed aliasing to leave memory uncached.  So duplication was no longer as important.  But we also made a discovery — on small buffers, duplication was *faster* than aliasing!  We did some benchmarking and decided that for buffers below 16KB, we’d duplicate, while on larger buffers we’d alias.  But we’d only benchmarked ARM virtually-tagged devices, and so we left the code similar to its original state.  Meaning that we only made the aliasing vs. duplication decision based on size on ARM virtually-tagged devices.  For all other cases, CeAllocAsynchronousBuffer usually aliased.

At that point, in my opinion, we should have removed the MARSHAL_FORCE_ALIAS flag.  Instead, we left it, and now we’re in a state where it confuses people.  At TechEd I saw my colleagues recommend it to driver developers for performance reasons – when in my opinion it should never be used.  Let the OS make the decision what’s best for performance.  The only case where we don’t alias is for small buffers on ARM virtually-tagged caches, where we’ve demonstrated that duplication is faster than aliasing.  I think it’s safe to say, you can look forward to this getting cleaned up in the future.  But remember, my recommendation remains: don’t (blindly) use MARSHAL_FORCE_ALIAS!  It won’t break anything, but you’ll potentially be forcing the wrong thing for performance.

Comments (8)

  1. ce_base says:

    For performance reasons, letting the kernel decide to alias or copy is the right think to do.

    However, I disagree that we should have removed the MARSHAL_FORCE_ALIAS flag altogether.  It allows a way for a driver to use the CeAllocAsynchronous buffer API, which should be familiar to CE6 driver developers, to make a virtual copy of the user buffer.  

    Sometimes it’s important that the buffer be re-filled by the user (WaveAPI is a good example of this).  In these cases, having the driver point at the same physical memory is imperative.  If we didn’t force an alias and the kernel decided to do a copy, your audio driver would play a very small portion of your sample over and over!

    Of course, the same thing can be achieved using VirtualCopy.  However, the standard recommended sequence in drivers is CeOpenCallerBuffer+CeAllocAsynchronousBuffer, so taking care of the aliasing in that sequence is convenient.

    So letting the kernel decide is best in most cases, but if your driver design calls for aliasing, then MARSHAL_FORCE_ALIAS is useful.  Ultimately (like always!) Sue is correct: "don’t (blindly) use MARSHAL_FORCE_ALIAS".

    –Travis Hobrla

  2. ce_base says:

    Thanks Travis, that’s a good point.  I was too focused on APIs/drivers that asynchronously pass data back to the caller – not thinking about those that asynchronously receive data from the caller.  You’re right, if the caller is making asynchronous changes, then the buffer DOES need to be aliased.

    Thus proving I didn’t make a mistake after all.  😉  Excellent!  Of course I am just kidding.  Thanks, Travis, for clarifying when users do and don’t need this flag!


  3. RS says:

    My question is about CeCallUserProc.

    In one of your posts, you have mentioned that CeCallUserProc does NOT allow embedded pointers.  All arguments must be stored inside the single “in” buffer passed to CeCallUserProc, and return data must be stored in the single “out” buffer. My requirement is to display an error msg and I’m planning to have a UserModeUI.DLL in UserMode which gets called from CeCalluserproc() from the kernel driver. Because I have to pass some debug msgs to the User Driver, I need to have an array which has to be passed to the userModeUI.DLL.

    typedef struct {

    WCHAR Textmsg[256];

    WCHAT Caption[255];


    So :

    1.Based on your earlier statement, I CANNOT pass this struct as lpInBuffer to this func call. What other method, do I have to pass it to UserMode driver

  4. ce_base says:

    Hi RS, those are not embedded pointers so you should be fine with that. If that was

    typedef struct {

       WCHAR* pTextmsg;

       WCHAR* pCaption;

    } ErrorMsg;

    Then you would be unable to pass the data because those ARE embedded pointers.


  5. RS says:

    Hi Sue,

    What I observed was if I have the Struct definition as shown above, the UserModeUI.DLL never gets called. Where as if I comment out the array and have a DWORD, then it seems to work.

    Change the struct definition to :

    typedef struct {

    DWORD dwtest;

    //WCHAR Textmsg[256];

    //WCHAT Caption[255];


  6. RS says:

    Hi Sue,

    if I retain the Textmsg and Caption member variables in the above structure, do I have to do CeOpenCallerBuffer() on each of them so that the User driver gets appropriate address for the member variable

  7. ce_base says:

    I’m sorry, but this is not really a good place to go back and forth discussing a problem unrelated to the above post.  Please post your questions on one of our newsgroups, like microsoft.public.windowsce.platbuilder, and we’ll carry on the conversation there.