This post is about hosting FTP in a Linux VM on Windows Azure. And here's a Spoiler Alert: The catch is that you may need to set the TCP keepalive timeout in the Linux Kernel to support very long FTP transfer times through the Azure load balancer. But I'll get to that.
A few weeks ago, a customer needed to run their FTP server on Windows Azure. Being familiar with Linux and having a pretty complex proftpd configuration, the customer decided to keep this all on Linux.
So let's recall again what's so special about FTP:
- FTP uses two connections, a control connection that you use for sending commands to a server and a data connection that gets set up whenever there is data to be transferred.
- FTP has two ways to set up such a data connection: active and passive. In passive mode, the client opens a second connection to the same server but on a different port. In active mode, the client creates a listening port, then server opens a connection to this port on the client.
- And in the world of client systems behind firewalls and NAT devices, the active mode inevitably fails since hardly any client is still accessible from the public internet and can just open a listening port that is reachable from the public internet.
- Lucky enough, most off-the-shelf FTP clients including the ones in web browsers default to passive mode.
- There are some funny things you can do with FTP, e.g. FXP, where one FTP server in active mode directly transfers to another ftp server in passive mode.
And recall what's special about Windows Azure networking:
- Every connection from the outside to an Azure VM goes through a cloud service endpoint. There are no "exposed hosts".
So in order to have the "passive" data connections reach their destination, one has to configure a bunch of endpoints in the Azure configuration and then tell the FTP server to use these endpoints for incoming data connections. One could configure each of those endpoints manually through the Windows Azure portal, but that's time-consuming and error-prone. So let's use a script to do that... (I'm using the Linux command line tools from http://www.windowsazure.com/en-us/downloads/ )
$ azure vm endpint create contosoftp 21
$ for ((i=20000;i<20020;i++)); do azure vm endpoint create contosoftp $i; done
This creates 20 endpoints for the FTP data connections and the usual port 21 endpoint for the control connection.
Now we need to tell proftpd (or any other FTP daemon of your choice) to use exactly this port range when opening data connection listening sockets.
PassivePorts 20000 20019
As you may know, Windows Azure VMs use local IP addresses that are non public. In order to tell the client what IP address to talk to when opening the data connection, the FTP server needs to know its external, public IP address, i.e., the address of its endpoint. Proftpd has all the required functionality, it just needs to be enabled via the MasqueradeAddress directive
And that's it.
Now the customer used this configuration, but once in a while, a customer reported that a very long-running FTP transfer would not go through but break because of a "closed control connection".
After thinking a bit, we thought this is a side effect of the Windows Azure load balancer that is managing the endpoints. When the load balancer does not see traffic for a while (at the earliest after about 60 seconds) it may "forget about" an established tcp connection. In our case, the control connection of the ongoing data transfer was idle while the data connection was happily pumping data.
Lucky enough, there's a unix socket option called "TCP Keepalive" which will make idle but open connections send a few control packets to inform everything on the network that this connection is still in use. And proftpd (from version 1.3.5rc1 on) supports a "SocketOptions keepalive on" directive to enable this behavior on its connections. Great!
But even enabling this didn't solve the issue, since there is a default in the Linux kernel for when these keepalive packets are first sent:
$ cat /proc/sys/net/ipv4/tcp_keepalive_time
OK, that's 7200 seconds which is two hours. That's a bit long for our load balancer.
# sysctl -w net.ipv4.tcp_keepalive_time=60
That's better. But remember this is a runtime setting in the Linux kernel, so in order for it to survive reboot, put it into a convenient place in /etc/rc*.d/
Hope this helps,