Posted by: Sue Loh
Ha! Some nutshell. This post is rather long — though for me, that’s nothing new. I’ve tried to at least provide a starting point for you (OEMs and ISVs) to begin understanding the details of the CE6 OS that will mean most to you. I and the rest of the CE base team plan to follow up with more in-depth posts on selected topics in the near future.
“Windows Embedded CE 6.0” … it just rolls off the tongue doesn’t it? Ahhh Microsoft marketing.
Difference #1: New virtual memory layout
You probably have already seen the address space drawing. You can’t really escape it. But for some reason I feel a need to duplicate it here.
The main picture you need to internalize is that the kernel process gets the top 2GB of the 4GB (=32bit) virtual memory space, and the bottom 2GB is duplicated anew for every other process. If you are used to thinking of the current application running in “Slot 0,” now imagine “Slot 0” being 2GB in size. To visualize the difference, the tiny yellow sliver from CE5 on the left side of the picture becomes a giant chunk in CE6 on the right side of the picture. Instead of every process’ virtual memory being accessible at all times, now only the kernel process and one other process – the current process – are accessible. This means that accessing another process’ memory, particularly buffer parameters that are passed to your driver, is no longer as simple as mapping a pointer.
The benefits are widely trumpeted: you are no longer limited to 32 processes, and each process is no longer limited to 32MB of virtual memory. Marketing talks about how there are now 32,000 possible processes; in practical terms it’s a little lower since you’ll run out of memory before you make that many. There is still a limit of 512MB of physical RAM, and you can’t make 32,000 processes in 512MB. (The minimum memory per process is ~5 pages: 1 page directory, 1 page table, 1 handle table page, 1 page of stack, 1 page of heap.) Anyway, with a limit of “thousands,” who cares. Additionally, the increase from 32MB to 2GB of virtual memory is a 64x jump. Actually it is probably closer to double that, because in CE5 days all processes loaded DLLs at the same addresses. If a server process like device.exe loaded a driver DLL at a particular spot, an application was not free to load a different DLL in the same spot, even if it didn’t load the driver DLL. This means that in practical terms, CE5 applications were limited to much less than 32MB of VM. In CE6, all of the servers have been moved into the kernel process (see below) so the DLLs they load don’t affect application VM space. So, the practical increase applications see is closer to 128x depending on how much space was taken by DLLs in system server processes.
What are we giving up? Why was it done the old way? Primarily, interprocess communication and buffer passing get more complicated. In the past, applications could access each others’ address spaces relatively easily. Now they must marshal memory between processes. Note that this is also a security benefit; hacker applications also will have a harder time accessing memory they shouldn’t.
The primary impact of this change is that most drivers will require changes to the way they access the buffers that they are passed. This means that drivers mainly require at least a small amount of code changes and a recompile, instead of being compatible straight out of the box. That is pretty unfortunate, but basically unavoidable. Our goal was to make driver porting a very easy process, and we’ll be following up in the near future with more posts about what changes.
When it comes to application compatibility, we tried very hard to ensure that well behaved applications would work without any changes. They should run just like they did before. “Well behaved” means they don’t make a lot of assumptions about being able to access each others’ memory, use some APIs we are restricting to kernel mode (they weren’t part of our SDK anyway), or things like that. To test application compatibility, we actually ported most of the Windows Mobile 5 components to the CE6 OS, and then ran a large set of WM5 applications to ensure that they still worked.
Difference #2: The unified kernel
With CE6, many critical OS components moved into the kernel process.
filesys.exe –> filesys.dll
device.exe –> device.dll
gwes.exe –> gwes.dll
Additionally, all of the drivers that used to load into those processes now load into the kernel process by default. This was primarily done for performance reasons; it is faster for applications to call into the kernel process than into other processes, and faster for kernel components to call each other than to call into other processes. In fact, this is a performance gain compared to CE5 since both types of calls are faster than the inter-process calls we had in CE5.
In the old world we spoke of “PSLs” (process server libraries) which were processes that registered an API set. In the new world I don’t like that terminology. Instead I refer to “kernel mode servers,” DLLs that load into the kernel process and register an API set, and “kernel mode drivers,” drivers that load into the kernel process.
Note the care I take with some terminology. First, when I talk about the kernel I try to differentiate between the kernel process where all of these DLLs are loading, and kernel.dll which is the executable for the Windows CE kernel. Sometimes I just say “the kernel” if the context should be relatively clear. Also, the old “kernel mode” that you knew from CE5 does not exist in CE6. SetKMode and ALLKMODE do not exist. There is no equivalent concept anymore. The CE5 “kernel mode” was two things: it was a control over whether a thread had access to kernel memory space, which is now limited solely to code running in the kernel process. And it was an API call performance enhancement which is now replaced by the performance improvements of the unified kernel. Whenever anyone talks about kernel mode in CE6 they’re really talking about running inside the kernel process. And user mode in CE6 refers to running in any other process.
A minor note, the code running inside the kernel process is now supported by a kernel-only version of coredll: K.COREDLL.DLL. Any code that was linked against coredll.dll which loads into the kernel process gets automatically redirected to use k.coredll.dll instead. Another minor detail is that kernel DLLs loaded with LoadKernelLibrary can now link against coredll. Also, to minimize our security risk, we chose not to allow UI components like commctrl.dll to load in kernel mode. Kernel mode code must use new helpers to call into user mode UI routines if UI is required.
What are we giving up? Well, driver bugs can destabilize the system even worse. Crashes in drivers would crash the kernel process (though, in truth, crashing one of the other system servers in CE5 would be just as bad). Memory usage and memory leaks in drivers and servers can be even harder to trace to their source. Buffer overruns and other bad memory accesses can corrupt kernel memory. If a hacker can manipulate a poorly written driver or server into accessing memory on their behalf, they are more likely to be able to access whatever they want. Before all this gloom-and-doom sounds too scary though, remember that this is pretty much the same as most existing OS’ on the market. We must take other steps to make up for these vulnerabilities.
Difference #3: User Mode Services and Drivers
On that note, CE6 introduces new support for user mode servers, drivers and services. Services.exe, which existed on CE5, remains a user mode process. All of the services which loaded into services.exe will load in user mode. Additionally, CE6 introduces a user mode version of our driver manager: udevice.exe. It is now possible to load a driver into user mode, into a unique instance of udevice.exe or into the same instance as other user mode drivers. We’ve taken quite a number of pains to ensure that well written drivers can be binary compatible between kernel mode and user mode: they won’t require a rebuild. “Well written” means that they use the memory marshalling APIs like they’re supposed to, they don’t call any kernel specific APIs or take advantage of the reduced limitations of kernel mode code. I’ll be explaining this in more detail in further blog posts.
This means that OEMs and hardware developers now have more choices available to them. We want to get to a state where OEMs make the final call whether drivers load into user mode for better stability and security, or kernel mode for better performance. Probably we still have more work to go. In CE6 we provide a framework for user mode drivers; for the most part, however, we don’t highly take advantage of it with our own drivers yet. Many Microsoft supplied drivers still load into kernel mode by default. Also, some drivers will have no choice but to load into kernel mode, because they need to use kernel APIs that are not accessible in user mode. I doubt we’ll ever completely remove that limitation.
Currently, good candidates for user mode drivers would be those that have kernel mode expansion bus drivers helping them out, like USB and SDIO.
Difference #4: Kernel / OAL Separation
In CE6 we took another leap which is somewhat significant to OEMs, but not a big deal to application developers. In CE5 the kernel and OEM Adaptation Layer (OAL) linked together to make nk.exe. In CE6 we separated the OAL and kernel into separate binaries, oal.exe (which ends up getting renamed to nk.exe) and kernel.dll. The Kernel Independent Transport Layer (KITL) was also separated into its own library, kitl.dll. This change was made primarily for improved updateability. In the past, if Microsoft released a kernel update, the OEM would have to take this and link it again with their OAL to produce a new nk.exe. Now the OEM only has to distribute Microsoft’s new kernel.dll. The other benefit of this change is that it formalizes the interface between the kernel, OAL and KITL. These components exchange tables of function pointers and variable values to communicate with each other, and cannot invoke functions other than those in the tables.
We tried to simplify this change as much as possible, to ease porting of CE5 OALs to CE6. We hid the existence of the function pointer tables inside wrapper functions that the OAL can use, so that exactly the same set of functions are available to the OAL. Anecdotal experience from our beta partners said that BSP porting to CE6 took them mostly between one day and one month. Travis Hobrla, a member of our BSP team, developed an awesome demo for MEDC 2006 (Mobile & Embedded DevCon) where he ported the CE5 CEPC OAL to CE6 in about 15 minutes. If you are an OEM, your experiences may vary, but we don’t anticipate it being too painful. It was our goal to make it an easy port. The main area where people will have to make big changes is if their OAL called kernel APIs that we did not intend to expose to the OAL. Since in CE5 the kernel and OAL were linked together, some people found that they could call kernel functions if they knew the function signatures (which they could get from our shared source code). In CE6 those people will have to move that functionality out of the OAL, into a kernel mode driver.
As with every release, we made many small security improvements while working on the OS. Here are some of the big ones:
- Per process address space: With the new virtual memory model, it is impossible for one process to access another process’ memory directly. All access must be gated through a small set of APIs (like ReadProcessMemory) that can require privilege.
- Per process handle values: In previous versions of CE, a handle value was global to the whole system. In a lot of cases, handles could be passed verbatim between processes or “guessed” by an attacker. In CE6, handles are unique to each process. Attackers can’t hijack handles owned by other processes, because those handle values have no meaning to the attacker’s process. This is also a stability improvement, because processes can’t accidentally interfere with each others’ handles. For example, it used to be possible for a buggy process to accidentally call CloseHandle multiple times on the same handle value, and end up closing another process’ handle that had been opened in the meantime.
- Secure stack: System calls run on special kernel side stacks, to avoid stack tampering. Applications cannot manipulate stack contents asynchronously during a system call.
- Robust heaps: Our heaps have been completely reimplemented. In CE5, heap meta-data was stored inline with the heap data. In CE6, heap metadata is now completely separated from the data. This protects against system instability caused by heap buffer overruns.
- Safe remote heaps: Remote heaps are a new feature of CE6, a new type of heap that can be shared between a server process and client process. Only those two processes can share the heap. The server can choose whether to make the heap client-writeable or only client-readable. This provides more protection and more functionality than the “shared heaps” supported by previous OS versions. (Shared heaps are still supported in CE6; they are writable only to the kernel and readable by ALL processes.)
- Secure loader: We have ported the Windows Mobile 5 secure loader to the Windows CE embedded release.
- No UI in kernel process: Privileged server code cannot use UI, which means that there is less surface area with which to attack privileged code.
- Kernel mode only APIs: A lot of the “trusted” APIs of past OS versions are now only callable from kernel mode.
I am sure there are too many small fixes to mention, and maybe even some big fixes that I’ve forgotten or don’t know about.
I cannot in good faith talk about security without mentioning one big aspect of the security of Windows CE 6. With this release, we reduced our trust model support. Instead of supporting a 2-tier trust model like the past (modules can be “trusted,” “untrusted,” or not allowed to run at all), Windows CE 6 only supports a 1-tier trust model (modules can be “trusted,” or not allowed to run at all). This does not reflect a diminished interest in providing security for Windows CE devices; rather, it is a step along our way to bigger and better things. OEMs can still completely lock down their devices, by requiring all code to be trusted before it can run. And the new secure loader makes that process simpler. We are moving toward a more granular security model, based on ACLs and privileges rather than “trust,” like you are used to from desktop Windows. However we’re not there yet.
As I mentioned previously, the unified kernel should provide a system wide performance boost, since API calls get cheaper. Here are some more details about the performance differences between CE6 and previous releases.
- Improve: calls to APIs in device.exe, filesys.exe, gwes.exe, calls between OS services.
- Same: thread switches, memory allocation, calls to nk.exe APIs (because nk.exe APIs have always been cheaper than those from our other servers, since it was loaded in kernel space), real-time performance.
- Worse: inter-process calls. The cost comes from having to marshal data between processes.
As in the past, you can use the CEBENCH tool to measure system call performance. However you should note that CEBENCH had to be expanded to reflect the additional scenarios that are now possible. Instead of just calls to a PSL server process, now we measure calls to a kernel mode server, calls to a user mode server, calls between kernel mode servers, etc. So when comparing between OS versions, you’ll have to be careful to make apples-to-apples comparisons between the same scenarios.
I want to re-iterate, the CE6 kernel makes no change to real-time performance. Really this didn’t touch any code that affects real-time. You should be able to obtain the same real-time performance you did in the past. And you can still use the ILTIMING tool to measure your real-time response.
Further references, if you’re interested: