How CPU cycles are used when encoding

Once in a while on our forum, somebody asks why the CPU usage isn't at 100% when offline encoding. I recently took the time to provide a thorough answer, which I think would be beneficial posted here.

So first, let's explore what's happening when encoding a media file using Expression Encoder. The encoding process involves 5 phases:

  1. Reading the source(s) from storage.
  2. Demuxing and decoding the source(s) into uncompressed frames.
  3. Pre-processing the frames (deinterlacing, resizing, cropping, etc).
  4. Encoding and muxing the frames.
  5. Writing the stream to storage.

If any one of those operations is slower than the others, all the encoding pipeline will suffer in terms of performance because it will bottleneck the flow of frames into it.

#1 and #5 are usually the bottleneck when source files are either not local (aka slow Network share) or are very high-bandwidth coupled with a slow storage. Possible solutions: making sure the source files are copied locally on your fastest storage available and outputting to a different local storage are good solutions to reduce the bottleneck in those two areas.

#2 usually becomes a bottleneck because of the type of sources and the codecs used to decode them. Obviously, using complex HD sources will use significantly more resources than simple lower-resolution sources. Some 3rd party codecs are extremely slow, running only on one core. Trying to test other codecs available on the PC by disabling some of the codecs listed in the "Tools -> Options -> Compatibility" dialog may help reducing the bottleneck. It's also worth noting that because many codecs are single-thread, having faster cores can help enormeously here. This is the main reason why a 3.4MHz 4-core is faster than a 2.6MHz 8-core PC in many cases.

#3 can become the bottleneck when unneeded cropping/rescaling is applied, when "SuperSampling" resizing and/or "Auto Pixel Adaptive" deinterlacing are applied. The last two are our defaults, which were the best choice for "high quality" encodes at the time we shipped v4 RTM over 18 months ago and we unfortunately can't change the defaults until our next major version since it would be an SDK breaking change. For better balance between performance and quality, we highly recommend using "Bicubic" resizing and "Auto selective blend" deinterlacing. Both of them will perform 2-3x times faster than the defaults and will reduce the chances of the pre-processing phase being the bottleneck. Of course, if at all possible, removing the need for a resize (aka using the same frame size and pixel aspect ratio as the source) will also reduce the load on that phase while keeping better output quality.

This leaves phase #4, where most of the CPU cycles should be used if there are no bottlenecks elsewhere in the pipeline. In the best of worlds, this phase should be your bottleneck, which would very likely max out CPU usage on an 8-core PC for a single-stream encode. Depending on the encode settings, 100% may still not be achieved, but it should be pretty high. Here are few ideas to make this phase faster:

  1. Make sure to use the defaults for number of threads and slices used to encode (where applicable).
  2. If speed is most important, consider applying our "Fastest" Quality preset, which will significantly speed up encoding process at the cost of some output quality.
  3. Use proper encode settings. For example, trying to encode 480p content into 300kbps would be stressing the encode phase and could slow it down to a crawl. We recommend about 1.5+ Mbps for 480p and about 3+ Mbps for 720p. Consider reducing the frame size and/or frame rate if you are trying to achieve significantly lower output bitrates than this.
  4. If GPU encoding is enabled on a slow GPU when encoding to H.264, it could make the GPU the bottleneck. Solution: disable GPU encoding or get a better GPU.
  5. On larger scale PCs (16 cores and above), single one-stream encodes can't maximize the CPU usage. Parallel encoding is an option to maximize the PC resources. If using a NUMA enabled system, process affinity is highly recommended to ensure memory is not shared across CPUs between the different encoding processes, which can reduce performance.

In the case where nothing seems to help maximizing CPU usage, one could also consider running 2-4 jobs in parallel in separate encoding processes. Using the UI, simply start 2-4 Encoder instances and run the encodes in parallel, bearing that enough system memory is available. This suggestion would certainly not help with a storage bottleneck situation (aka phase #1 and/or #5), but should greatly help speed up the process of encoding multiple jobs.

Finally, if encoding to H.264, using GPU encoding may drastically help speed up the encoding phase. Cuda is supported in Encoder 4 Pro SP1, and Intel QSV (Sandy Bridge) in our next SP release coming soon. When used with the proper hardware, either (or even both) options can cut down encoding time by more than half, and in some cases significantly more.

Hopefully, this gives you some insights on how to isolate and resolve encode performance issues.