WSL System Calls


This is the third in a series of blog posts on the Windows Subsystem for Linux (WSL). For background information you may want to read the architectural overview and introduction to pico processes.

Posted on behalf of Stephen Hufnagel.

System calls

WSL executes unmodified Linux ELF64 binaries by emulating a Linux kernel interface on top of the Windows NT kernel. One of the kernel interfaces that it exposes are system calls (syscalls). This post will dive into how syscalls are handled in WSL.

Overview

A syscall is a service provided by the kernel that can be called from user mode which typically handle device access requests or other privileged operations. As an example, a nodejs webserver would use syscalls to access files on disk, handle network requests, create processes\threads, and other operations.

How a syscall is made depends on the operating system and the processor architecture, but it comes down to having an explicit contract between user mode and kernel mode called an Application Binary Interface (ABI). For most cases, making a syscall breaks down into 3 steps:

  1. Marshall parameters – user mode puts the syscall parameters and number at locations defined by the ABI.
  2. Special instruction – user mode uses a special processor instruction to transition to kernel mode for the syscall.
  3. Handle the return – after the syscall is serviced, the kernel uses a special processor instruction to return to user mode and user mode checks the return value.

While the Linux kernel and Windows NT kernel follow the steps above, they differ in ABI so they are not compatible. Even if the Linux kernel and Windows NT kernel had the same ABI, they expose different syscalls that do not always map one to one. For example, the Linux kernel includes things like fork, open, and kill while the Windows NT kernel has the comparable NtCreateProcess, NtOpenFile, and NtTerminateProcess. The following sections will go into the specifics of making syscalls in the different environments.

Syscall mechanics on Linux x86_64

The calling convention for syscalls on Linux x86_64 follow the System V x86_64 ABI defined here. As an example, one way to call the getdents64 syscall directly from a C program would be to use the syscall wrapper:

Result = syscall(__NR_getdents64, Fd, Buffer, sizeof(Buffer));

To discuss the syscall calling convention it’s easiest to translate the line of code into pseudo assembly to show the 3 steps above:

  1. mov rax, __NR_getdents64
  2. mov rdi, Fd
  3. mov rsi, Buffer
  4. mov rdx, sizeof(Buffer)
  5. syscall
  6. cmp rax, 0xFFFFFFFFFFFFF001

First, let’s look at how the parameters are marshaled. Steps 1-4 move the syscall parameters into the registers specified by the calling convention. Then on step 5 the special syscall instruction is made to transition to kernel mode. Finally, on step 6, the return value is checked by user mode.

Between steps 5 and 6 is where the Linux kernel handles the getdents syscall. When the syscall instruction is invoked, the processor performs a ring transition to kernel mode and starts executing at a specific function in kernel mode typically called the syscall dispatcher. As part of kernel initialization when the machine is booted, the kernel configures the processor to execute the syscall dispatcher with a specific environment when the syscall instruction is invoked. The first thing the Linux kernel will do in the syscall dispatcher is save off the user mode thread’s register state according to the ABI so that when the syscall returns the user mode program can continue executing with the expected register context. Then it will inspect the rax register to determine which syscall to perform and pass the registers off to the getdents syscall as parameters. Once the getdents syscall completes, the kernel restores the user mode register state, updates rax to contain the return value of the syscall, and uses another special instruction (usually sysret or iretq) that informs the processor to perform the ring transition back to user mode.

Syscall mechanics on NT amd64

The calling convention for syscalls on NT x64 follows the x64 calling convention described here. As an example, when the NtQueryDirectoryFile call is made we’ll have this simplified version of the call so it’s easier to compare against the getdents call above:

Status = NtQueryDirectoryFile(Foo, Bar, Baz);

The real NtQueryDirectoryFile API takes 11 parameters which would be strenuous for this example since it requires pushing arguments to the stack. Now let’s look at the disassembly side by side with the getdents case to see how the three steps are handled on NT compared to Linux:

Step Getdents NtQueryDirectoryFile
1 mov rax, __NR_getdents64 mov rax, #NtQueryDirectoryFile
2 mov rdi, Fd mov rcx, Foo
3 mov rsi, Buffer mov rdx, Bar
4 mov rdx, sizeof(Buffer) mov r8, Baz
5 syscall syscall
6 cmp rax, 0xFFFFFFFFFFFFF001 test eax, eax

 

For the marshalling parameters step, we see that NT also uses rax to hold the syscall number but differs in the registers used to pass the syscall parameters because it has a different ABI. For the special instruction step, we see syscall is also used since it is the preferred method on x64 for making syscalls. For the final step of checking the return, the code is slightly different because NTSTATUS failures are negative values whereas Linux failure codes fall into a specific range.

Just like with the Linux kernel, between steps 5 and 6 is where the NT kernel handles the NtQueryDirectoryFile syscall. This part is pretty much identical to Linux with some small exceptions around the ABI related to which user mode registers are saved and which registers are passed to NtQueryDirectoryFile. Since NT syscalls follow the x64 calling convention, the kernel does not need to save off volatile registers since that was handled by the compiler emitting instructions before the syscall to save off any volatile registers that needed to be preserved.

Syscall mechanics on WSL

After looking at how syscalls are made on NT compared to Linux, we can see that there are only a few minor differences around the calling convention. Next we’ll take a look at how syscalls are made on WSL.

The calling convention for syscalls on WSL follows the Linux description above since unmodified Linux ELF64 binaries are being executed. WSL includes kernel mode pico drivers (lxss.sys and lxcore.sys) that are responsible for handling Linux syscall requests in coordination with the NT kernel. The drivers do not contain code from the Linux kernel but are instead a clean room implementation of Linux-compatible kernel interfaces. Following the original getdents example, when the syscall instruction is made the NT kernel detects that the request came from a pico process by checking state in the process structure. Since the NT kernel does not know how to handle syscalls from pico processes it saves off the register state and forwards the request to the pico driver. The pico driver determines which Linux syscall is being invoked by inspecting the rax register and passes the parameters using the registers defined by the Linux syscall calling convention. After the pico driver has handled the syscall, it returns to NT which restores the register state, places the return value in rax, and invokes the sysret\iretq instruction to return to user mode.

syscall_graphic

WSL syscall examples

Where possible, lxss.sys translates the Linux syscall to the equivalent Windows NT call which in turn does the heavy lifting. Where there is no reasonable mapping lxss.sys must service the request directly. This section will discuss a few syscalls implemented by WSL and the interaction with NT.

The Linux sched_yield syscall is an example that maps one to one with a NT syscall. When a Linux program makes the sched_yield syscall on WSL, the NT kernel hands the request off to lxss.sys which forwards the request directly to ZwYieldExecution which has similar semantics to the sched_yield syscall.

While sched_yield is an example of a syscall that maps nicely to an existing NT syscall, not all syscalls have the same properties even when there is similar functionality on NT. Linux pipes are a good example of this case since NT also has support for pipes. However, the semantics of Linux pipes are different enough from NT pipes, that WSL could not use NT pipes to get a fully featured Linux pipe implementation. Instead, WSL implements Linux pipes directly but still uses NT functionality for primitives like synchronization and data structures.

As a final example, the Linux fork syscall has no documented equivalent for Windows. When a fork syscall is made on WSL, lxss.sys does some of the initial work to prepare for copying the process. It then calls internal NT APIs to create the process with the correct semantics and create a thread in the process with an identical register context. Finally, it does some additional work to complete copying the process and resumes the new process so it can begin executing.

Conclusion

WSL handles Linux syscalls by coordinating between the NT kernel and a pico driver which contains a clean room implementation of the Linux syscall interface. As of this article, lxss.sys has ~235 of the Linux syscalls implemented with varying level of support. This support will continue to improve over time especially with the great feedback we get from the community.

 

Stephen Hufnagel and Seth Juarez explore how WSL redirects system calls.


Comments (37)

  1. Yuhong Bao says:

    I wonder what is the plan for Linux 32-bit binaries when that is supported.

    1. Stephen Hufnagel says:

      No official plans at this point, but we've looked at supporting x86\i386 and x32 emulation on 64bit and native x86\i386. This has come up before from the github community so please give us feedback on the user voice page so we can prioritize - https://wpdev.uservoice.com/forums/266908-command-prompt-console-bash-on-ubuntu-on-windo/suggestions/13377507-please-add-32-bit-elf-support-to-the-kernel.

  2. Alp says:

    Please start explaining what WSL is first for people coming from Hacker News etc.

    1. Stephen Hufnagel says:

      Thanks for the feedback, there are two previously posted entries and the first gives an explanation of WSL:

      https://blogs.msdn.microsoft.com/wsl/2016/04/22/windows-subsystem-for-linux-overview/
      https://blogs.msdn.microsoft.com/wsl/2016/05/23/pico-process-overview/

      1. Alp says:

        Great, I already know it though. My comment was mostly about, when someone outside Microsoft ecosystem clicks on article, you start talking about WSL, but they probably do not know what it is, so if you start it like "WSL (Windows Subsystem for Linux)" they won't be as confused and start looking around what WSL might be. 🙂

        1. Jeroen says:

          Says on top: "Windows Subsystem for Linux
          The underlying technology enabling the Windows Subsystem for Linux"

          Then first sentence links to https://blogs.msdn.microsoft.com/wsl/2016/04/22/windows-subsystem-for-linux-overview/

          I'm outside of Microsoft ecosystem and its perfectly clear what this provides. If I were to describe it to a laymen I'd describe it as something WSL is for Windows (in regards to Linux-x64) akin to what WINE is for *NIX. (I'm sure that's technically inaccurate as a comparison though.)

  3. Milosz Tanski says:

    I'm very curious about some details. Just some minutia that's interesting to me.

    1. How does WSL handle vDSO?
    2. How does WSL emulate futex?
    3. How is fork() and clone() implemented on WSL? Traditional I know Windows didn't have fork() semantics for processes. Additionally, clone() has a complex list of flags that dictate how the process is created.

    1. Stephen Hufnagel says:

      Great questions:

      For #1, check out https://blogs.msdn.microsoft.com/wsl/2016/05/23/pico-process-overview/ for a brief summary. In short, lxcore\lxss loads in a vdso that we generated from scratch during process creation.

      For #2, futex is similar to the Linux pipe discussion above since it doesn't map nicely to existing NT syscalls . For win32 programs, futex is similar to the WaitOnAddress APIs but the semantics are different enough from Linux that a custom implementation was needed.

      For #3, there is a discussion of both fork and clone in the video near the end.

  4. Frank Ch. Eigler says:

    Any chance of making this translation layer open-source?

    1. Stephen Hufnagel says:

      This is one of the top requests from the community, https://wpdev.uservoice.com/forums/266908-command-prompt-console-bash-on-ubuntu-on-windo, please give us feedback if you haven't already.

    2. John Doe says:

      The real issue isn't lxss.sys being "open source" -- the real issue is whether somebody could make changes, rebuild it, and use their version instead. I seriously doubt MS would allow this. Even if you can build a new lxss.sys, if it's not properly signed, Windows won't load it. The kernel mode code signing requirements in Windows 10 are difficult for commercial developers to navigate, and would be almost impossible for individual hackers.

  5. Zac says:

    Was this idea inspired by the previous Windows Services for UNIX (https://en.wikipedia.org/wiki/Windows_Services_for_UNIX)? Was any of the previous stuff re-used?

    1. I was gonna say, pretty sure NT does support fork, because that's how SUA/WSU supported fork. Also discussed at http://stackoverflow.com/a/985525/58074

    2. Steven Edwards says:

      I wish they had Open Sourced Services for Unix rather than killed it and added the Linux syscall support there.

      There was a vibrant community that supported ports of those applications. I can't recall the site now but at one point in time they had almost the completely same collection as Cygwin ported over, except of course those applications ran better because Cygwin has to emulate fork, thread and pipe semantics on top of Win32 while SFUs psx subsystem was able to make a 'native' implementation.

      It would have been possible to add the Linux syscall support to that subsystem which would have allowed the existing SUA/SFU community to grow. I am one of the people that have run Windows with a NT subsystem going all the way back to Softway Systems OpenNT and Interix but after being burned by Microsoft too many times, I won't play the game anymore.

      Yes it is technically interesting, and yes I have played with the new Linux subsystem one of my dev boxes, but I will never use it for anything serious because this is the fourth time we have been through this dance starting with OpenNT then it being killed, then moving on to SFU and it being dropped and coming back as SUA only to be dropped again.

      There was actually some really good tech for porting applications that those systems had, such as the ability in later versions to make cross-subsystem calls so one did not have to port the entire application over in one shot.

      Maybe the new Hipster kids will embrace the Linux Subsystem and keep it from happening again, but it is fairly simple these days to run vagrant to spin up an Ubuntu Virtual Box instance and have it be mostly seamless. There is really no need of it to do 'ports' any longer.

      That being said, once the Linux subsystem guys are able to make it support Docker, I think it will be useful when Visual Studio can tie in to it and be able to test/build/debug Linux Containers. The Visual Studio guys have been adding support for Windows Containers and it is pretty sweet, since you can test and debug locally and then send your newly built container right to Azure all in the same interface. If you could do that with your Linux apps as well, then the subsystem becomes of great value.

      When I can do that with a Linux container from Visual Studio running Windows using the Linux subsystem will be a real value add.

    3. Stephen Hufnagel says:

      Yes, NT has supported different subsystems in the past and the other two videos have a brief discussion:

      https://blogs.msdn.microsoft.com/wsl/2016/04/22/windows-subsystem-for-linux-overview/
      https://blogs.msdn.microsoft.com/wsl/2016/05/23/pico-process-overview/

      WSL supports running native Linux binaries directly by emulating Linux kernel interfaces, whereas the previous Linux subsystems emulated Linux user mode and required recompilation\porting. The codebase couldn't be reused directly for that reason, but it was used for reference\comparison.

  6. Ezhik says:

    I'm curious, how did LXSS come to be? Is it true that it emerged from the work done on the cancelled Project Astoria?

  7. Craig says:

    Any estimate on when WSL will be available in standard (non-insider) windows builds?

  8. AS usual, stealing users from other platforms. What about make an open integration so Linux users can do the same? OFC not. You like to steal and give "crap" in return.

  9. Jose Ricardo Ziviani says:

    Are you guys planning to give something back to the community? Like a Windows compatibility layer to let Linux users run unmodified Portable Executable (help Wine, maybe)?

  10. David Conrad says:

    Any word on when Ubuntu 16.04 will be available on Windows?

  11. Amika de Wit says:

    One of the presentations mentions that when the NT kernel receives a syscall, it first checks if the process in question is a pico process or e.g., a Win32 process. Do you happen to know some internal details behind this step? E.g., which kernel data structure is being examined?
    Thanks!

  12. Dave Kok says:

    Is there a list available of which Linux system calls are supported by the Windows Subsystem for Linux? Or are all Linux system calls supported with the initial release?

    1. Jack Hammons says:

      There is a list of currently supported syscalls available on in our MSDN Documentation. Keep in mind that all though this list currently enables most scenarios we don't guarantee that every flag or option is supported in the call.

  13. Arnout says:

    This is pretty damn cool tech! I rarely read microsoft posts/blogs to be honest as I find myself going through open-source linux stuff most of the time, but hats off to this:)
    Thanks for the detailled post!

  14. Arnout says:

    Maybe it would be interesting to also detail some differences between this technology and e.g. cygwin.
    "Cygwin in Kernelspace" might be a bit short as an explainer otherwise

  15. Adam Milner says:

    As an embedded systems dev who works in both Windows and Linux, I'm wondering about good old fashioned serial ports. Specifically, will I be able to access the Windows COM ports in /dev under WSL?

    1. WoC says:

      Maybe in later versions, but currently; no

      Linux version 3.4.0-Microsoft (Microsoft@Microsoft.com) (gcc version 4.7 (GCC) ) #1 SMP PREEMPT Wed Dec 31 14:42:53 PST 2014
      Linux WOC-PC 3.4.0+ #1 PREEMPT Thu Aug 1 17:06:05 CST 2013 x86_64 x86_64 x86_64 GNU/Linux

  16. Joel Dillon says:

    Out of curiosity, am I correct in thinking WSL can't handle static linked binaries? I'm writing a compiler for fun, which generates ELF (and PE-COFF and MachO) binaries from scratch, but when I try and run the output within WSL I get:

    root@ISIS:/mnt/d/git/Enki# ./a.out
    bash: ./a.out: cannot execute binary file: Exec format error
    root@ISIS:/mnt/d/git/Enki# file ./a.out
    ./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped

    The same binary runs fine on actual Linux, and the PE/COFF version works on actual Windows...

    (There's no glibc or any other runtime library in this either if that makes any difference, it works by doing raw syscalls)

    1. Robert Oeffner says:

      I have encountered the same problem with a statically linked executable I downloaded from elsewhere. This could be a dealbreaker if WSL is to catch on beyond command-line tools such as grep, sed and awk.

  17. Nilesh Miskin says:

    This is one of the most compelling reasons for me to adopt to Windows 10 Pro. Keep it up guys. Looking forward to a mature and robust WSL implementation.

  18. Sean Zhou says:

    I wonder how the linux ioctl() syscall is supported in lxss.sys.

  19. David McCormack says:

    You say that following the transition to Ring 0, the NT syscall dispatcher uses a magic flag in the "process structure" to decide whether to interpret the value in rax as an NT syscall number of a Linux one. My question is whether this process structure flag is modifiable by unprivileged user mode code? I guess I'm wondering if it would be possible to have a dual mode process that could make some syscalls as a Windows binary and others as a Linux one.

  20. Nico says:

    Copying everything at fork()-time is expensive; copy-on-write is expensive too. fork() is not really all that useful in a modern world, and, really, is very tough to get right.

    Much more interesting than fork() would be posix_spawn() and vfork(). posix_spawn() might require an extended CreateProcess() family function.

    With posix_spawn() you'd not need vfork(), naturally, except for compatibility reasons.

    vfork(2) could be implemented natively by the NT kernel. Or it could be emulated in user-land (with much, much care!) by creating a larval process for the subsequent execve() (so you can have its PID immediately available) that waits for instructions, the placing the caller's thread into a special mode (e.b., by setting a thread-local boolean) that records open(2), dup(2), dup2(2), close(2), (performing opens, but not the others) and other such system calls (ideally all functions in the async-signal-safe set!) and internally redirecting I/O in read(2)/write(2) and friends, then when the app does an execve(2) call, you would marshall up all those actions, pass them and any necessary file handles to the waiting child process for it to perform those recorded actions, then closing file descriptors and handles on the parent side that were never meant to be opened in the parent side. Similarly if the "child" exits: close handles opened by it, tell the waiting child to exit with the given code. When the "child" _exit()/execve() completes the thread in the parent would return to normal mode and longjump to trampoline that will "return" the child's PID. Complex, but much simpler and faster than fork(2) to a large degree. (Obviously lots of details are missing there, and it might turn out that emulating vfork(2) is ETOOHARD.)

    https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234

    1. In theory, you're right.

      In practice, there's still lot's of software out there using fork() the traditional way, for either historical or portability reasons. (Do all the BSD and MacOS/iOS variants support vfork() or posix_spawn() by now? Not to speak of propritary Unices, GNU Hurd etc...)

      Additionally, according to https://msdn.microsoft.com/en-us/commandline/wsl/release_notes, vfork() was supported from the beginning. AFAICS, glibc implements posix_spawn based on vfork() or execve() (which is also supported by WSL), depending on the flags given, so I guess it doesn't need a specific syscall, and Linux also doesn't provide one.

      1. Nico says:

        It's true that a lot of software uses fork() because... it's the easy thing to do, vfork() has been unfairly vilified over the years, and posix_spawn() is not as easy to use. Also, some software uses fork() to start multiple worker processes where they don't exec and the parent doesn't exit either, and other software has the parent exit (e.g., to detach from a tty) -- these use-cases can't use anything other than fork() (hmmm, threads + vfork() might work, but I've not tested it), sadly, though I'm proposing an avfork() that would support that without copies nor COW:

        https://gist.github.com/nicowilliams/a8a07b0fc75df05f684c23c18d7db234

        So for now you do need a fork(), even if it's slow because it does full copies.

        Does WSL support mmap() with MAP_PRIVATE? I mean, does it support COW at all? If so you have the building block to implement a Unix-style fork()... Though even so, COW is very expensive, which is why typical implementations of fork() copy the RSS and apply COW for the rest.

        And yes, vfork() and posix_spawn() support are pretty universal across Linux, *BSD, Illumos, and OS X. I imagine AIX and HP-UX also have it, but I don't care to check. NetBSD even has posix_spawn() as a proper system call, which is pretty clever.

      2. Nico says:

        Also, congrats on implementing a vfork() system call. I'm glad you did. It can make a huge difference.

Skip to main content