Announcing key advances to JavaScript performance in Windows 10 Technical Preview

Article
10/09/2014

The Windows 10 Technical Preview brings key advances to Chakra, the JavaScript engine that powers Internet Explorer and store based Web apps across a whole range of Windows devices – phones, tablets, 2-in-1’s, PC’s and Xbox. As with all previous releases of Chakra in IE9, IE10 and IE11, this release is a significant step forward to create a JavaScript engine that is highly interoperable, spec compliant, secure and delivers great performance. Chakra now has a highly streamlined execution pipeline to deliver faster startup, supports various new and augmented optimizations in Chakra’s Just-in-Time (JIT) compiler to increase script execution throughput, and has an enhanced Garbage Collection (GC) subsystem to deliver better UI responsiveness for apps and sites. This post details some of these key performance improvements.

Chakra’s Multi-tiered Pipeline: Historical background

Since its inception in IE9, Chakra has supported a multi-tiered architecture – one which utilizes an interpreter for very fast startup, a parallel JIT compiler to generate highly optimized code for high throughput speeds, and a concurrent background GC to reduce pauses and deliver great UI responsiveness for apps and sites. Once the JavaScript source code for an app or site hits the JavaScript subsystem, Chakra performs a quick parse pass to check for syntax errors. After that, all other work in Chakra happens on an as-needed-per-function basis. Whenever possible, Chakra defers the parsing and generation of an abstract syntax tree (AST) for functions that are not needed for immediate execution, and pushes work, such as JIT compilation and GC, off the main UI thread, to harness the available power of the underlying hardware while keeping your apps and sites fast and responsive.

When a function is executed for the first time, Chakra’s parser creates an AST representation of the function’s source. The AST is then converted to bytecode, which is immediately executed by Chakra’s interpreter. While the interpreter is executing the bytecode, it collects data such as type information and invocation counts to create a profile of the functions being executed. This profile data is used to generate highly optimized machine code (a.k.a. JIT’ed code) as a part of the JIT compilation of the function. When Chakra notices that a function or loop-body is being invoked multiple times in the interpreter, it queues up the function in Chakra’s background JIT compiler pipeline to generate optimized JIT’ed code for the function. Once the JIT’ed code is ready, Chakra replaces the function or loop entry points such that subsequent calls to the function or the loop start executing the faster JIT’ed code instead of continuing to execute the bytecode via the interpreter.

Chakra’s background JIT compiler generates highly optimized JIT’ed code based upon the data and infers likely usage patterns based on the profile data collected by the interpreter. Given the dynamic nature of JavaScript code, if the code gets executed in a way that breaks the profile assumptions, the JIT’ed code “bails out” to the interpreter where the slower bytecode execution restarts while continuing to collect more profile data. To strike a balance between the amounts of time spent JIT’ing the code vs. the memory footprint of the process, instead of JIT compiling a function every time a bailout happens, Chakra utilizes the stored JIT’ed code for a function or loop body until the time bailouts become excessive and exceed a specific threshold, which forces the code to be re-JIT’ed and the old JIT code to be discarded.

Figure 1 – Chakra’s JavaScript execution pipeline in IE11

Improved Startup Performance: Streamlined execution pipeline

Simple JIT: A new JIT compiling tier

Starting with Windows 10 Technical Preview, Chakra now has an additional JIT compilation tier called Simple JIT, which comes into play in-between the switch over from executing a function in the interpreter to executing the highly optimized JIT code, when the compiled code is ready. As its name implies, Simple JIT avoids generating code with complex optimizations, which is dependent on profile data collection by the interpreter. In most cases, the time to compile the code by the Simple JIT is much smaller than the time needed to compile highly optimized JIT code by the Full JIT compiler. Having a Simple JIT enables Chakra to achieve a faster switchover from bytecode to simple JIT’ed code, which in turn helps Chakra deliver a faster startup for apps and sites. Once the optimized JIT code is generated, Chakra then switches over code execution from the simple JIT’ed code version to the fully optimized JIT’ed code version. The other inherent advantage of having a Simple JIT tier is that in case a bailout happens, the function execution can utilize the faster switchover from interpreter to Simple JIT, till the time the fully optimized re-JIT’ed code is available.

The Simple JIT compiler is essentially a less optimizing version of Chakra’s Full JIT compiler. Similar to Chakra’s Full JIT compiler, the Simple JIT compiler also executes on the concurrent background JIT thread, which is now shared between both JIT compilers. One of the key difference between the two JIT execution tiers is that unlike executing optimized JIT code, the simple JIT’ed code execution pipeline continues to collect profile data which is used by the Full JIT compiler to generate optimized JIT’ed code.

Figure 2 – Chakra’s new Simple JIT tier

Multiple Background JITs: Hardware accelerating your JavaScript

Today, the browser and Web applications are used on a multitude of device configurations – be it phones, tablets, 2-in-1s, PCs or Xbox. While some of these device configurations restrict the availability of the hardware resources, applications running on top of beefy systems often fail to utilize the full power of the underlying hardware. Since inception in IE9, Chakra has used one parallel background thread for JIT compilation. Starting with Windows 10 Technical Preview, Chakra is now even more aware of the hardware it is running on. Whenever Chakra determines that it is running on a potentially underutilized hardware, Chakra now has the ability to spawn multiple concurrent background threads for JIT compilation. For cases where more than one concurrent background JIT thread is spawned, Chakra’s JIT compilation payload for both the Simple JIT and the Full JIT is split and queued for compilation across multiple JIT threads. This architectural change to Chakra’s execution pipeline helps reduce the overall JIT compilation latency – in turn making the switch over from the slower interpreted code to a simple or fully optimized version of JIT’ed code substantially faster at times. This change enables the TypeScript compiler to now run up to 30% faster in Chakra.

Figure 3 – Simple and full JIT compilation, along with garbage collection is performed on multiple background threads, when available

Fast JavaScript Execution: JIT compiler optimizations

Previewing Equivalent Object Type Specialization

The internal representation of an object’s property layout in Chakra is known as a “Type.” Based on the number of properties and layout of an object, Chakra creates either a Fast Type or a Slower Property Bag Type for each different object layout encountered during script execution. As properties are added to an object, its layout changes and a new type is created to represent the updated object layout. Most objects, which have the exact same property layout, share the same internal Fast Type.

Figure 4 – Illustration of Chakra’s internal object types

Despite having different property values, objects `o1` and `o2` in the above example share the same type (Type1) because they have the same properties in the same order, while objects `o3` and ‘o4’ have a different types (Type2 and Type3 respectively) because their layout is not exactly similar to that of `o1` or `o2`.

To improve the performance of repeat property lookups for an internal Fast Type at a given call site, Chakra creates inline caches for the Fast Type to associate a property name with its associated slot in the layout. This enables Chakra to directly access the property slot, when a known object type comes repetitively at a call site. While executing code, if Chakra encounters an object of a different type than what is stored in the inline cache, an inline cache “miss” occurs. When a monomorphic inline cache (one which stores info for only a single type) miss occurs, Chakra needs to find the location of the property by accessing a property dictionary on the new type. This path is slower than getting the location from the inline cache when a match occurs. In IE11, Chakra delivered several type system enhancements, including the ability to create polymorphic inline caches for a given property access. Polymorphic caches provide the ability to store the type information of more than one Fast Type at a given call site, such that if multiple object types come repetitively to a call site, they continue to perform fast by utilizing the property slot information from the inlined cache for that type. The code snippet below is a simplified example that shows polymorphic inline caches in action.

Despite the speedup provided by polymorphic inline caches for multiple types, from a performance perspective, polymorphic caches are somewhat slower than monomorphic (or a single) type cache, as the compiler needs to do a hash lookup for a type match for every access. In Windows 10 Technical Preview, Chakra introduces a new JIT optimization called “Equivalent Object Type Specialization,” which builds on top of “Object Type Specialization” that Chakra has supported since IE10. Object Type Specialization allows the JIT to eliminate redundant type checks against the inline cache when there are multiple property accesses to the same object. Instead of checking the cache for each access, the type is checked only for the first one. If it does not match, a bailout occurs. If it does match, Chakra does not check the type for other accesses as long as it can prove that the type of the object can’t be changed between the accesses. This enables properties to be accessed directly from the slot location that was stored in the profile data for the given type. Equivalent Object Type Specialization extends this concept to multiple types and enables Chakra to directly access property values from the slots, as long as the relative slot location of the properties protected by the given type check matches for all the given types.

The following example showcases how this optimization kicks in for the same code sample as above, but improves the performance of such coding patterns by over 20%.

Code Inlining Enhancements

One of the key optimizations supported by JIT compilers is function inlining. Function inlining occurs when the function body of a called function is inserted, or inlined, into the caller’s body as if it was part of the caller’s source, thereby saving the overhead of function invocation and return (register saving and restore). For dynamic languages like JavaScript, inlining needs to verify that the correct function was inlined, which requires additional tracking to preserve the parameters and allow for stack walks in case code in the inlinee uses .caller or .argument. In Windows 10 Technical Preview, Chakra has eliminated the inlining overhead for most cases by using static data to avoid the dynamic overhead. This provides up to 10% performance boost in certain cases and makes the code generated by Chakra’s optimizing JIT compiler comparable to that of manually inlined code in terms of code execution performance.

The code snippet below is a simplified example of inlining. When the doSomething function is called repetitively, Chakra’s optimizing Full JIT compiler eliminates the call to add by inlining the add function into doSomething. Of course, this inlining doesn’t happen at the JavaScript source level as shown below, but rather happens in the machine code generated by the JIT compiler.

JIT compilers need to strike a balance to inlining. Inlining too much increases the memory overhead, in part from pressure on the register allocator as well as JIT compiler itself because a new copy of the inline function needs to be created in each place it is called. Inlining too little could lead to overall slower performance of the code. Chakra uses several heuristics to make inlining decisions based on data points like the bytecode size of a function, location of a function (leaf or non-leaf function) etc. For example, the smaller a function, the better chance it has of being inlined.

Inlining ECMAScript5 (ES5) Getter and Setter APIs: Enabling and supporting performance optimizations for the latest additions to JavaScript language specifications (ES5, ES6 and beyond) helps ensure that the new language features are more widely adopted by developers. In IE11, Chakra added support for inlining ES5 property getters. Windows 10 Technical Preview extends this by enabling support for inlining of ES5 property setters. In the simplified example below, you can visualize how the getter and setter property functions of o.x are now inlined by Chakra, boosting their performance by 5-10% in specific cases.

call() and apply() inlining: call() and apply() JavaScript methods are used extensively in real world code and JS frameworks like jQuery, at times to create mixins that help with code reuse. The call and apply methods of a target function allow setting the this binding as well as passing it arguments. The call method accepts target function arguments individually, e.g. foo.call(thisArg, arg1, arg2) , while apply accepts them as a collection of arguments, e.g. foo.apply(thisArg, [arg1, arg2]). In IE11, Chakra added support for function call/apply inlining. In the simplified example below, the call invocation is converted into a straight (inlined) function call.

In Windows 10 Technical Preview, Chakra takes this a step further and now inlines the call/apply target. The simplified example below illustrates this optimization, and in some of the patterns we tested, this optimization improves the performance by over 15%.

Auto-typed Array Optimizations

Starting with IE11, Chakra’s backing store and heuristics were optimized to treat numeric JavaScript arrays as typed arrays. This enabled JavaScript arrays with either all integers or all floats, to be auto-detected by Chakra and encoded internally as typed arrays. This change enabled Chakra to optimize array loads, avoid integer tagging and avoid boxing of float values for arrays of such type. In Windows 10 Technical Preview, Chakra further enhances the performance of such arrays by adding optimizations to hoist out array bounds checks, speed up internal array memory loads, and length loads. Apart from usage in real Web, many JS utility libraries that loop over arrays like Lo-Dash and Underscore benefit from these optimizations. In specific patterns we tested, Chakra now performs 40% faster when operating on such arrays.

The code sample below illustrates the hoisting of bounds checks and length loads that is now done by Chakra, to speed up array operations.

Chakra’s bounds check elimination handles many of the typical array access cases:

And the bounds check elimination works outside of loops as well:

Better UI Responsiveness: Garbage Collection Improvements

Chakra has a mark and sweep garbage collector that supports concurrent and partial collections. In Windows 10 Technical Preview, Chakra continues to build upon the GC improvements in IE11 by pushing even more work to the dedicated background GC thread. In IE11, when a full concurrent GC was initiated, Chakra’s background GC would perform an initial marking pass, rescan to find objects that were modified by main thread execution while the background GC thread was marking, and perform a second marking pass to mark objects found during the rescan. Once the second marking pass was complete, the main thread was stopped and a final rescan and marking pass was performed, followed by a sweep performed mostly by the background GC thread to find unreachable objects and add them back to the allocation pool.

Figure 5 – Full concurrent GC life cycle in IE11

In IE11, the final mark pass was performed only on the main thread and could cause delays if there were lots of objects to mark. Those delays contributed to dropped frames or animation stuttering in some cases. In Windows 10 Technical Preview, this final mark pass is now split between the main thread and the dedicated GC thread to reduce the main thread execution pauses even further. With this change, the time that Chakra’s GC spends in the final mark phase on the main thread has reduced up to 48%.

Figure 6 – Full concurrent GC life cycle in Windows 10 Technical Preview

Summary

We are excited to preview the above performance optimizations in Chakra in the Windows 10 Technical Preview. Without making any changes to your app or site code, these optimizations will allow your apps and sites to have a faster startup by utilizing Chakra’s streamlined execution pipeline that now supports a Simple JIT and multiple background JIT threads, have increased execution throughput by utilizing Chakra’s more efficient type, inlining and auto-typed array optimizations, and have improved UI responsiveness due to a more parallelized GC. Given that performance is a never ending pursuit, we remain committed to refining these optimizations, improving Chakra’s performance further and are excited for what’s to come. Stay tuned for more updates and insights as we continue to make progress. In the meanwhile, if you have any feedback on the above or anything related to Chakra, please drop us a note or simply reach out on Twitter @IEDevChat, or on Connect.

— John-David Dalton, Gaurav Seth, Louis Lafreniere

Chakra Team