Posted By David Campbell
Doug Boling recently hosted one of his regular webcasts on Optimizing performance and power on Windows Embedded Compact 7 and has graciously provided me with a companion article. Thanks Doug! There’s more information later in the article about how you can sign up for these webcasts so please do join us for the monthly sessions.
Embedded hardware is slow. It’s designed that way. Unlike Personal Computers which are sold to customers who are dazzled by high gigahertz numbers and massive hard disks, embedded customers buy a widget that does something. If the widget does something well that’s all that matters, so the manufacturer of that widget is going to use the slowest (and often least expensive) hardware possible to implement that widget. This is one requirement that makes embedded software so challenging to write. Embedded software must have great performance so that the hardware can be as inexpensive as possible.
In this blog post, I will review some of the techniques for system design that can improve performance, and as a consequence, the power consumption of a system. I’ll also cover some lower level application driver characteristics that can lower power consumption directly.
There is one main attribute for embedded systems that must always be considered when working on performance. These systems are typically limited by the speed of memory accesses, not the core CPU. Typical “System On a Chip” (SOC) CPUs are quite fast internally but have limited instruction and data caches. This issue ripples through the design of the software in many areas.
Of particular influence is the design of the BSP routines that handle maintenance of the L1 and L2 caches. L1 and L2 designate two levels of caching between the CPU and the memory. The L1 cache is typically very small on embedded CPUs, around 16-64KB while the L2 cache is larger, typically 32 to 256 KB. (By comparison, some desktop CPUs have 12M of cache in various L1/2/3 configurations!) Because cache architectures vary among CPU architectures, the kernel calls down to the OEM Abstraction Layer (OAL) in the board support package to flush all or part of the cache as necessary.
Unfortunately, some of the BSPs on the market take a minimalist view of how these cache maintenance routines should work and simply purge the cache when flushing a line or two is all that is requested. This has a huge impact on performance as this significantly decreases the effectiveness of the cache. It is critical that companies review the low level OEMCacheFlush routine in their BSPs to ensure it only flushes the data absolutely required when called. This recommendation, more than anything else in this article, will provide the greatest system-wide improvement in performance.
While data and instruction caches are well known, there is a third type of cache often overlooked by the software developer - overlooked, but critical to performance. The Translation Lookaside Buffer (TLB) is a cache of the virtual to physical address translations kept in the page table. The memory management unit uses the TLB to avoid reading a page table entry every time it needs to calculate an address. The typical TLB is small with perhaps only 32 entries. Each TLB entry references one page in the physical address space.
As the code and data of an application grows, at some point, the number of pages referenced exceeds the number of available TLB entries causing the CPU to read the page table in RAM to read the new page. MIPS and SH system TLB action can be monitored with the CELOG infrastructure as their refill is done in interrupt driven software. Unfortunately, these accesses are difficult to track on an x86 and ARM system because the fill of a TLB entry is done in hardware. In this case, a JTAG debugger is typically used. As a general rule, keep the number of DLLs lower as each DLL will have unique pages for code and static data. Reduce the number of threads as each thread has unique stack pages. And in general, reduce the size of the code. Finally, typical COM (component object model) style development can cause problems as the resulting code is large, and spread among multiple modules.
There are other memory related issues as well. In general, applications and drivers should pass pointers to data instead of copying the data to new buffers. Generally, this passing of pointers is discouraged by ivory tower computer ‘scientists’ worried about security and some level of imagined stability. However, engineers must consider the demands of a real world embedded system and must often override the musings of some theory. Clearly, areas of security exposure must be examined and protections enacted. However, not every call between two blocks of code poses security risks. Also, passing a pointer from one module to another does require that developers of both modules work together to manage that buffer. However the benefits of the optimized code does speed the system. By passing pointers instead of copying data, the load on the memory bus is reduced and therefore performance will increase.
Another performance bottleneck occurs when software crosses a boundary such as between user mode and kernel mode or from managed code to native code. In these cases, the operating system (or the runtime in the case of .NETCF) have to marshal the data from one domain to the other. As the simplest way to do this is to re-buffer the data into the new domain, again for security/stability reasons, these cross-boundary calls are much slower than traditional calls between an .EXE and a .DLL or between .DLLs. To minimize the impact, developers should examine their code and reduce the number of calls. In places where lots of cross-boundary activity does take place, try to use fewer calls with more data. The expression “Think chunky, not chatty” is a good motto. That is, have fewer (less chatty) calls do more (chunky) work.
For example, when calling native code from managed code, sometimes it is better to put more intelligence on the native side to reduce additional calls from the managed code. Another situation is when an application makes repeated operating system calls. In some cases, it makes sense to create a kernel mode driver to invoke those many operating system calls. Kernel mode drivers don’t experience the performance hit when calling the operating system as they already are running in (obviously) kernel mode.
Some things are just slow
There is a popular but unfortunate practice of using this Windows registry as an interprocess communication medium. The temptation is obvious. The registry is a system wide database accessible from all applications as well as drivers. The calls to access the registry are simple and if the system implements it, registry data is persistent. However, the registry is designed as a configuration database, not an interprocess communication medium. Accesses to the registry can be slow, especially opening and closing of keys. What’s worse, many utility libraries open and close a key for each access of a value underneath that key.
Developers should avoid using the registry to communicate anything more than basic configuration data between applications and between applications and drivers. When reading that configuration data, open registry keys once and then read and write all the values within the key. Make sure the application only reads the registry data once and then caches those values. I find it surprising how many applications read registry data repeatedly. This simply slows down the application.
Another area to examine is the handling of Windows Embedded Compact databases. The operating system has a native database that is exposed in two APIs, the CE database (CEDB) and the Embedded database (EDB). While the EDB has more features, for example more flexibility in sorting, the CEDB is smaller and faster. If your application can live within the limits of the CEDB, use it instead of the EDB.
In addition, while opening either a CEDB or EDB database is fairly fast, mounting that database isn’t. Make sure your application only mounts a database once. I’ve seen plenty of large applications written by multiple developers that open a database more than once simply because the various modules within the application don’t share the database instance handle.
Windows Embedded Compact is a multithreaded operating system. Threads can be a great feature to take advantage of in an application particularly on a multicore CPU. However, having multiple threads for the sake of having multiple threads will decrease the speed of an application. Calling a routine is almost instantaneous. Creating a thread to call a routine can take thousands of times longer. Of course, if that thread is necessary because code needs to run asynchronously from the calling routine, by all means use the additional thread but don’t do so unnecessarily.
Another option in places where multithreading is necessary is to use a thread pool. In a thread pool, the code creates a few threads ahead of time that remain blocked. When it comes time to use a thread, the caller simply passes the information to a thread and then signals the thread by triggering an Event. This saves the time needed to create the thread ‘on demand’. It does incur the penalty of having the background thread existing continually, but the performance gain is usually worth the memory cost.
Thread synchronization is typically accomplished using either a Critical Section or a Mutex. These two operating system objects provide similar features. The mutex does have two additional capabilities, the ability to share a mutex across multiple applications by naming the object and a timeout value. So, why use a Critical Section when a mutex is more powerful? Because a Critical Section is thousands of times faster in the typical situation when entering an unblocked CS compared to a signaled mutex. A quick rule of thumb is never use an unnamed mutex, since the only reason to name a mutex is to share it across multiple processes.
No performance discussion would be complete without a mention of the video system. Even with the rise in hardware accelerated video subsystems, the speed of the video in an embedded system pales in comparison to the video systems on the desktop. While the details of the user interface need to be customized for the hardware there is one area that can be universally tuned to help on a Windows Embedded Compact system.
One of my pet peeves is the “Exploding Rectangle” animation that is displayed when a top level window is created, minimized, or maximized slows the system down. This animation is a relic of the old Windows 95 look and feel and is functionally useless and frankly to my taste, old fashioned. Fortunately, it can be disabled by setting the following registry value
; Disable shell rectangle animation
You will be amazed at how removing the animation makes the system seem snappier.
Silverlight for Windows Embedded provides an unprecedented way of easily leveraging powerful graphics processors in embedded systems. SWE doesn’t necessarily require fast graphics hardware, but it can take advantage of it. The key to maximizing the speed of the hardware without writing an application that is too slow is to have the user interface designers test their designs on equivalent hardware as the design is being developed. Fortunately, the bifurcated development process of SWE that separates the user interface design (done in Expression Blend) from the business logic design (done in Visual Studio) makes this U/I testing possible. It is critical that the designers test early and often on the hardware. Testing on a Virtual PC or even a CEPC when the final hardware will be a slower system is a direct path to failure. Testing must occur on the end hardware.
[DC: Doug makes a great point here that I’d like to emphasize and more broadly extend. For all development it’s important to develop and test as much as possible on the actual hardware. This is particularly true if you’re developing code that you want shared. Designing on Compact and moving to the desktop results in code that runs well in both places. Designing and developing on the desktop and then moving to Compact often results in design decisions that are hard to undo and performance issues as a result. Unfortunately I’ve heard of way too many cases where the work was done on the more powerful platform first.]
So far, I’ve only talked about performance. What about power? Actually, the two are closely related. If I were talking hardware, performance gained by such strategies as increasing CPU clock speed would have an adverse effect on power consumption. But in software design, performance is gained by more efficient code. Writing efficient code also helps with power by lowering the speed requirements on the hardware. There are other aspects of software design that will help with power consumption however.
Before I dive into some power optimization techniques, a discussion of the Windows Embedded Compact power manager is in order. The power manager provides centralized control of various power states in the system such as “On”, Screen Off”, “User Idle”, “System Idle”, “Off” and “Reset”. For each of these states, the power manager sets the power state of individual device drivers through a set of IOCTL calls. The correlation of the power manager state and the driver power states is configured through the registry.
[DC: Here’s a link to more detail on the power manager for more detail.]
While these states are handy, and some power states clearly cause the system to consume less power than others, this is only the beginning of good power management. Device drivers have a better knowledge of the state of their hardware and how it is being used.
For example with a serial port, if the driver has not been opened by an application the port isn’t in use, regardless of the system power state. If the port isn’t in use, there is no reason for the RS-232 level driver chip to be powered. In fact, since most SOC processors can control the clock logic to the various peripherals, there isn’t even a reason for the serial port UART to be enabled and have an active clock. When the application opens the port, the driver can power up the level driver and enable clocks to the UART. The driver should be able to minimize power consumption by its hardware by intelligently disabling the hardware whenever possible.
From both the application and driver perspective, there are a number of coding techniques that improve power optimization. The first and foremost rule being: “Threads always block.” When the Windows Embedded Compact scheduler detects that all threads are blocked in the system, it calls into the OAL to a routine called OEMIdle. In that routine, the OEM typically halts the CPU. In a halted state, the CPU stops executing any code. This reduces power consumption of the CPU (and the resulting components not having to react to the CPU) over 1000%. That’s right, over a 10x reduction in power when all threads are blocked.
It is never okay to run a thread in a loop, polling for some state to change. Spinning a thread is just about the worst thing a developer can do to hurt both performance and power in a system. Introducing Sleep statements inside the loop may reduce the load on the system a bit but the power consumption will still be dramatically impacted.
I had a client from a large cellphone manufacturer once ask me to contact Microsoft. He was the power engineer on a particular phone design and was complaining about a thread in the Windows Mobile shell code that polled for an event once every 5 seconds or so. He said, “They’re ruining my power budget!” Cellphones need long standby times and having any thread in the system poll for an event, even with a period of 5 seconds was enough to noticeably hurt power consumption on the device. For good power AND performance, remember; “Threads always block”.
The great thing about embedded software is that good software design impacts everything in the product. Well-designed software can improve the performance enough to enable lowering the performance of the underlying hardware. This reduces cost, and if battery powered, can even reduce the capacity of the battery and therefore the size of the product. This is what makes embedded software design fun. So have fun, design your software with an eye to performance and power.
Once again I’d like to thank Doug for his time and his insight. For more information on Doug visit his website at www.bolingconsulting.com. I should also point out that Doug has another webcast scheduled for July, here are the details:
When: Tuesday July 17, 2012 9:00 AM Pacific Time
Title: Booting x86 systems into Windows Embedded Compact 7
Overview: Windows Embedded Compact 7 is a popular operating system for low cost, x86-based systems. The problem is getting the system up and running. From DOS based bootloaders to the customized BIOSLoader embedded developers have tried a number of methods to boot the operating system. This webcast will cover the various bootloaders for x86 systems and how to configure and update the loaders to customize the user experience for your system..
Registration link (includes link to previous recordings): http://www.microsoft.com/windowsembedded/en-us/develop/windows-embedded-compact-7-developer-classroom.aspx
Please join us for what will certainly be another great talk.
Windows Embedded Compact 7 supports power management features through system power states that you define for your device. This sample adds support for additional system power states, and the source code is organized so that you can easily extend it to support custom system power states of your own design. The Power Manager Sample platform dependent driver (PDD) can be used with Windows Embedded Compact 7 to add additional system power states to your device.
The Power Manager Sample PDD supports the following system power states:
Another great site
I was recently introduced to the 101 blog (part of his Embedded101 site) by writer Sam Phung who wrote Professional Windows Embedded Compact 7. According to Sam, the site was setup as a community site to serve the following objectives:
- Provide information to help the general developer community realize opportunities associate with Windows Embedded.
- Provide technical information resources to help developer learn and engage in Windows Embedded development.
- Provide an online platform for developer to collaborate and share Windows Embedded knowledge.
- Provide technical information resources to help student and hobbyist developers to learn and develop project base on Windows Embedded technologies.
The site has a number of interesting topics from high level information to great step by step walkthroughs. Great site Sam!
If folks have similar recommendations they’d like to have call out by all means let me know.
That’s it for now. Thanks to Doug and Sam for sharing their insights with the community.