One common reason we go for cloud computing is the ability to scale, as much as needed. This usually means an increase in the overall performance of our application. However in certain cases we might face a decrease in performance with cloud computing. I will try to explain the most common reasons that I came across for such unexpected behaviour.
Also known as ping time, latency is the round trip network time between the client and the server. (Azure web roles by default doesn't support ping for security reasons. But it would be quite easy to measure the latency without adding ping support. Please check the IE developer toolbar section at the bottom.)
Latency, locally in a country can be around 25 msec (milliseconds).
Latency, to a global cloud provider’s closest data centre can range from 25 (if you are quite close geographically) to 200 msec (if you are in a remote country like Australia). For most countries we can assume latency will range from 25 to 100 msec.
This means that if we are in Australia and we have a solution that is deployed to a global cloud vendor, the latency can be 8 times higher compared to a local hoster.
Now, does this mean our solutions overall performance will be 8 times less? Not really. It might actually be much more faster, depending on the solution itself.
Let's examine some scenarios; I will take a Turkish company, ACME as an example where the latency to closest Azure data centre is around 100 msec. (I deliberately choose a country that doesn’t have the best or worst latency)
Scenario 1 - ACME host their solution on a local hoster. A web page in their solution does some heavy lifting and takes 10 seconds to complete. When they have more users this can take up to 20 seconds. It's not easy to add more servers on demand with their local hoster so they are facing performance issues from time to time. When a user opens up this page on a busy period, it can take up to 20,025 msec to show up. 20,000 msec (20 seconds) for the page to generate and 25 msec to transfer.
Then ACME decides to move their solution to Azure. They implement autoscaling so they can add more capacity under heavy load. With the additional capacity we can guarantee that the page generates in 10 seconds, most of the time. Unlike the previous 20,025 msec case, even under heavy load the same page is served in 10,100 msec (10 sec page generation + 100 msec transfer time).
When we have less load Azure would perform nearly the same as local hosted solution. 10,025 vs. 10,100 msec. The difference is 75 msec which is less than one tenth of a second. It would be nearly impossible for human eye to realise the difference for a page that takes 10 seconds to generate. However under heavy load Azure would performance nearly 2 times faster, 20,025 vs 10,100 msec.
The bottom line for this scenario would be, Azure solution would perform either the same or better compared to the local hosted solution. So we have no issues on this one.
Scenario 2 – ACME has a web page that only takes 500 msec to execute. The math is simple, local hosted solution takes 525 msec for user to see the page vs. 600 msec on Azure. The difference is still 75 msec, again 1/10 of a second but this time a very careful eye might realise a little bit of a difference.
The bottom line, Azure performs nearly the same as local hosted version and the difference is negligible.
In this scenario the local hosted page would take 500 (page execution) + 25 (latency) + 500 (number of cells) x [ 5 (lookup execution) + 25 (latency) ] = 525 + 500 x 30 = 15,525 which is roughly 15 seconds.
Now let's calculate the Azure hosted version. 500 (page execution) + 100 (latency) + 500 (number of cells) x [ 5 (lookup execution) + 100 (latency) ] = 600 + 500 x 105 = 53,100 This time it is 53 seconds, nearly a minute.. Quite slow compared to the local hosted 15 seconds page eh?
Although this scenario is a bit exaggerated, performance issues due to wrong AJAX usage is one of the most common issues for cloud solutions.
Solution: As you can easily guess the solution in the above scenario would be to simply optimize the page to make one AJAX call to get all the data needed in the table, instead of making a call for each cell.
With this optimization the Azure version could take 500 + 100 + 5 (lookup) + 100 (a single AJAX call) = 705 msec. Less than a second, more than 15 times faster compared to the unoptimized local hosted version!
Scenario 4 - ACME has a web page built in ASP.Net and they use update panel and many other ASP.Net AJAX controls. ACME loves the way they built it, simply dragging and dropping controls and all the "magic" happens behind the scenes.
ACME feels the local hosted version works "just fine" but when they move to Azure, they "feel" that the page is just "slower". It was hard for ACME to highlight which part is slow since ASP.net AJAX controls does the magic to continuously load "stuff" and they don't really know what happens when.
Solution: Unfortunately tackling issues like this one is a bit trickier compared to the one on scenario 3. (And Murphy is never wrong, we face more of this type) So what do we do?
Scale Up issues
All cloud vendors including Azure has a hard upper limit for scaling up. As an example (and as of May 2012) a single Azure compute instance (VM) has the upper limit of 8 CPU cores and 14 GB Ram. This is as much as you can scale up, today.
Solution: It's quite easy to guess, right? Your application must be designed to scale out, not up. Regarding compute, you should simply be able to add more instances to your deployment. And the good news is, most Azure services are designed with auto scale out features out of box. For instance Azure storage (BLOB, Table, Queue) all scale out to more and more VMs behind the scenes if needed.
Relational databases are a bit tricky in this regard since it’s not very easy to scale them out. SQL Azure scales up automatically to a stage but as of May 2012 it can't really compare to the 32 core physical box that you can buy for your on premise data centre. The solution can be partitioning the DB to scale out (never that easy) using multiple DBs (never easy either) or using non-relational stores like Azure BLOBs and tables (which is also not very easy if you really need relations)
Scaling out relational databases is beyond the scope of this blog but after a quick search you should be able to find plenty of resources on this topic.
Nowadays bandwidth capacity is quite high for most countries. Afterall more and more users want to watch online videos and internet is simply replacing TV. However we might still face bandwidth issues because, a certain country might have permanent or temporary international bandwidth issues or the amount of content we deliver is simply massive and downloading it takes more than the desired time. (ex: to broadcast a full HD video in realtime)
Solution: We might consider using Azure CDN (Content Delivery Network) or a CDN from another vendor.
CDNs are distributed servers worldwide and used to cache content so that the clients receive them from a node that is closer from a “network topology” point of view.
So what is a network topology? Closer geographic distance doesn’t always mean better bandwidth or lower latency. Two cities (from different countries) might be 100 kilometers far away from each other. One of them might have a CDN node serving the whole country and the other city might have very poor connection to the neighbour country. This can easily mean users from the second city will automatically skip downloading content from the 100 km away CDN node and will choose to download it directly from the main datacentre or from a CDN node that is 500 kilometers away.
Optimizing your solution for CDN usage and understanding internet network topology are again beyond the scope of this blog post.
Compute Instance (VM) Performance
The bigger your instance is, the more bandwidth it has (on top of CPU cores and memory). Please check the below page for more information.
In certain cases increasing your VM size alone can boost the performance quickly to a desirable point. Still it would be best to collect some performance counters to see what is going on behind the scenes. For that we have;
For many performance issues if we can’t measure the current state then we can’t really optimize. Simple as that. Thankfully Azure has an extensive set of tools and performance counters for this purpose and they are quite easy to use.
You can check the below page for more information on Azure diagnostics.
Internet Explorer Developer Tools
Pressing F12 in IE opens up developer tools that have plenty of goodies. The last tab, network, can be your first stop to detect many Azure performance issues (especially latency).
A very easy way to test your latency would be deploying one liner HelloWorld.html and then in the developer toolbar go to Network tab and click “Start Capturing”. Then you can hit F5 to refresh the page and see how long it takes to load. Since helloworld.html is extremely simple I would assume most of what you see (80% ?) would be network related and therefore latency.
UPDATE: I posted another blog where I talk about a few other performance related topics. You can check it from Tuning Windows Azure Performance, more thoughts