Windows Azure for Social Applications

One of the projects I’m working on during my day job is pulling together information on how Windows Azure can be used to host social applications (i.e. social games.) It’s an interesting topic, but I think I’ve managed to isolate it down to the basics and wanted to put it out here for feedback. This post is just going to talk about some high level concepts, and isn't going to drill into any implementation information.

Note: this post won’t go into details of client implementation, but will only examine the server side technologies and concerns.

Communication

The basic requirement for any social interaction is communication. The client sends a message to the server, which sends the message to other users. This can be accomplished internally in the web application if both clients are connected to the same instance, but what about when we scale this out to multiple servers?

Once we scale out, there are a couple of options:

  • Direct commumication between instances
  • Queues
  • Blobs
  • Tables
  • Database
  • Caching

While direct communication is probably the fastest way to do inter-instance communication, it’s also not the best solution in the cloud. This sort of multi-instance direct communication would normally involve building and maintaining a map of what users are on what instances, then directing communication between instances based on what users are interacting. Instances in the cloud may fail over to different hardware if the current node they are running on encounters a problem, or if the instance needs more resources, etc. There's a variety of reasons, but what it boils down to is that instances are going to fail over, which is going to cause subsequent communications from the browser to hit a different instance. Because of this, you should never rely on the user to server instance relationship being a constant.

It may make more sense to use the Windows Azure Queue service instead, as this allows you to provide guaranteed delivery of messages in a pull fashion. The sender puts messages in, the receiver pulls them off. Queues can be created on the fly, so it would be fairly easy to create one per game instance. The only requirement of the server in this case is that it can correctly determine the queue based on information provided by the client, such as a value stored in a cookie.

Beyond queues, other options include Windows Azure Blob service and Table service. Blobs are definitely useful for storing static assets like graphics and audio, but they can be used to store any type of information. You can also use blob storage with the Content Distribution Network, which makes it a shoe-in for any data that needs to be directly read by the client. Tables can't be exposed directly to the client, but they are still useful in that they provide a semi-structured key/value pair storage. There is a limit on the amount of data they can store per entity/row (a total of 1MB,) however they provide fast lookup of data if the lookup can be performed using the partition key and row key values. Tables would probably be a good place to store session specific information that is needed by the server, but not necessarily by the client.

SQL Azure is more tailored to storing relational data and performing queries across it. For example, if your game has persistent elements such as personal items that a player retains across sessions, you might store those into SQL Azure. During play this information might be cached in Tables or Blobs for fast access and to avoid excessive queries against SQL Azure.

Windows Azure also provides a fast, distributed cache that could be used to store shared data, however it’s relatively small (4GB max) and relatively expensive ($325 for 4GB as of November 7, 2011.) Also, it currently can only be used by .NET applications.

Latency

I mentioned latency earlier, I’ll skip the technical explaination and just say that latency is the amount of delay your application can tolerate between one user sending a message and other users receiving it. The message may be an in-game e-mail, which can tolerate high latency well, to trying to poke another player with a sharp stick, which doesn’t tolerate high latency well.

Latency is usually measured in milliseconds (MS) and the lower the better. The link to AzureScope can provide some general expectations of latency within the Azure network, however there’s also the latency of the connection between Azure and the client. This is something that’s not as easily to control or estimate ahead of time.

Expecations of Immediacy

When thinking about latency, you need to consider how immediate a user expects a social interaction to be. I tend to categorize expectations into ‘shared’ and ‘non-shared’ experience categories. In general, the expectation of immediacy is much higher for shared experiences. Here are some examples of both:

Non-Shared Experience

  • Mail – Most people expect mail to take seconds, if not tens of seconds, to reach the recipient.
  • Chat – While there is an expectation of immediacy when you send a message (that once you hit enter the people on the other end see your message,) this is moderated by the receivers lack of expectation of immediacy. The receiver’s expectations are moderated by the knowledge that people often type slowly, or that you may have had to step away to answer the phone.
  • Leaderboards and achievement – Similar to chat, the person achieving the score or reward expects it to immediately be reflected on thier screen, however most people don’t expect thier screens to be instantly updated with other people’s achievements.

Shared Experience

  • Avatar play – If your game allows customers to move around avatars around a shared world, there is a high expectation of immediacy. Even if the only interaction between players is chat based, you expect others to see your character at the same location on thier screen as you see yourself.
  • Competitive interactions tend to have high expectations on immediacy; however this is modified by the type of competition:
    • If you’re competing for a shared resource, such as harvesting a limited number of vegitables from a shared garden, then expectations are high. You must ensure that when one user harvests a vegitable, that it immediately disappears from other users screens.
    • If you’re competing for a non-shared resource, such as seeing who can harvest the most vegitables from thier own gardens in a set period of time, then the scope of expectations shifts focus to the clock and the resources you interact with. You don’t have to worry as much about synchronizing the disappearance of vegitables with other players.

Working with Latency

The most basic thing you can do to ensure low latency is host your application in a datacenter that is geographicaly close to your users. This will generally ensure good latency between the client and the server, but there’s still a lot of unknowns in the connection between client and server that you can’t control.

Internally in the Azure network, you want to perform load testing to ensure that the services you use (queues, tables, blobs, etc.) maintain a consistent latency at scale. Architect the solution with scaling in mind; don’t assume that one queue will always be enough. Instead, allocate queues and other services dynamically as needed. For example, allocate one queue per game session and use it for all communication between users in the session.

Bandwidth

Another concern is at what point the data being passed exceeds your bandwidth. A user connecting over dial-up has much less bandwidth than one connecting over fiber, so obviously you need to control how much data is sent to the client. However you also need to consider how much bandwidth your web application can handle.

According to billing information on the various role VM sizes at https://msdn.microsoft.com/en-us/library/dd163896.aspx#bk_Billing (What is a Compute Instance section), different roles have different peak bandwidth limitations. For the largest bandwidth of 800Mbps, you would need an ExtraLarge VM Size for your web role. Another possibility would be to go with more instances with less bandwidth and spread the load out more. For exmaple, 8 small VMs have the same aggregate bandwidth of an ExtraLarge VM.

Working with Bandwidth

Compressing data, caching objects on the client, and limiting the size of data structures are all thinks that you should be doing to reduce the bandwidth requirements of your application. That's really about all you can do.

Summary

While this has focused on Windows Azure technologies, the core information presented should be usable on any cloud; the expectations of immediacy in social interactions provides the baseline that you want to meet in passing messages while latency and bandwidth act as the limiting factors. But I don’t think I’m done here, what am I missing? Is there another concern beyond communication, or another limiting factor beyond bandwidth and latency?

As always, you can leave feedback in the comments below or send it to @larry_franks on twitter.