Posted by: Sue Loh
Hello out there, it’s been a long time since I posted anything real, and I feel sorry about that. As I began writing this article, I had just come from the first day of TechEd where I saw my colleagues present about CE6 and drivers, and was reminded of a subject I was suddenly inspired to write up for you all. Today is now the last day of TechEd and I’m back home, but my comments still apply.
I’ll let you in on something – not so much of a secret. We all make mistakes. And this is a blog post about one of my own. You may have already read about the marshalling APIs on this blog, or otherwise learned of them. When we designed these APIs, we planned them to hide away complexity in the decisions we made for performance and security reasons – so that OEMs and driver writers would not have to thread a maze of difficult details. With that in mind, consider the CeAllocAsynchronousBuffer API. The purpose of this API is to marshal a buffer into a driver’s (or server’s or service’s) process space such that the driver/server/service could access the buffer asynchronously. The work required to do the marshalling depends on the circumstances. In kernel mode it probably just needs to be aliased (VirtualCopied) into the kernel, while in user mode it must be duplicated (memcpy’d). The work also depends on what work CeOpenCallerBuffer might have done beforehand – for example if it is already duplicated into the process. So, CeAllocAsynchronousBuffer hides all of these details. You can call it and trust the API to make the right choices for security and perf. We designed it to hide these details while asking the caller to make no assumptions about what’s going on underneath. Use CeFlushAsynchronousBuffer to guarantee changes have been written back, and CeFreeAsynchronousBuffer to do that plus release any resources.
So that’s all well and good. Enter older ARM CPUs and their virtually-tagged caches. In the early days of CE6, we hadn’t quite come to terms with how to prevent the cache coherency problems you could get if you aliased/VirtualCopied memory. In later days, we fixed aliasing so that it would make both source and dest buffer uncached for the duration of the alias. (Specifically, we fixed VirtualAllocCopyEx, NOT VirtualCopy, since I am a stickler for little details.) But in the early days, when we built the marshalling APIs, we were concerned about cache coherency. So at that time, in CeAllocAsynchronousBuffer we made ARM virtually-tagged CPUs duplicate the memory instead of alias it. This, of course, concerned us greatly about performance, and we knew we’d ship a lot of ARM virtually-tagged devices. So we added MARSHAL_FORCE_ALIAS with the expectation that callers would use it with caution, and deal with cache coherency problems themselves. That, at least, could probably win some performance on large buffers, even if it did cost complexity.
Later, we got our heads on right and fixed aliasing to leave memory uncached. So duplication was no longer as important. But we also made a discovery — on small buffers, duplication was *faster* than aliasing! We did some benchmarking and decided that for buffers below 16KB, we’d duplicate, while on larger buffers we’d alias. But we’d only benchmarked ARM virtually-tagged devices, and so we left the code similar to its original state. Meaning that we only made the aliasing vs. duplication decision based on size on ARM virtually-tagged devices. For all other cases, CeAllocAsynchronousBuffer usually aliased.
At that point, in my opinion, we should have removed the MARSHAL_FORCE_ALIAS flag. Instead, we left it, and now we’re in a state where it confuses people. At TechEd I saw my colleagues recommend it to driver developers for performance reasons – when in my opinion it should never be used. Let the OS make the decision what’s best for performance. The only case where we don’t alias is for small buffers on ARM virtually-tagged caches, where we’ve demonstrated that duplication is faster than aliasing. I think it’s safe to say, you can look forward to this getting cleaned up in the future. But remember, my recommendation remains: don’t (blindly) use MARSHAL_FORCE_ALIAS! It won’t break anything, but you’ll potentially be forcing the wrong thing for performance.