Microsoft Windows Azure in Academia and what help can you get from Microsoft Research

windowsAzureLogo

What is Windows Azure?

Windows Azure is a platform for building scalable, highly reliable, multi-tiered web service applications. It is hosted on Microsoft’s large data centers in the United States, Europe, and Asia. Windows Azure has both compute and data resources. The compute resources are designed to allow applications to scale to thousands of servers and data resources. For more information on Windows Azure keep an eye on the Windows Azure team blog.

Windows Azure as a Platform for Research in Cloud Computing

What types of research projects are well suited to Azure?

Windows Azure can be an excellent research platform for many types of research. However, it is designed to support scalable web services, so projects that play to this strength will have the most success. One area of particular interest is computational models and techniques that augment the capabilities of client devices, ranging from feature rich desktop and laptop computers to cell phones of other mobile devices with data and computation resources in the cloud. How can we make the cloud into a transparent extension and experience amplifier of our client-based research tools?

Others interesting areas include:

  • Evaluation of cloud computing for a spectrum of algorithms, computational models, and programming patterns.

  • Programming abstractions for cloud platforms that offer higher level abstractions and guarantees to application developers.

  • Scaling conventional algorithms to cloud scale, coping with load, and dealing with faults.

  • Research into scientific data management, data analysis, and new services in the cloud to support data intensive research.

  • Data parallel program frameworks, going beyond the map reduce model.

  • Experimental platforms for distributed agent technology.

  • Compiling declarative and/or functional languages for cloud platforms.

  • Sustainable science gateways and virtual organizations in the cloud for web-based application and collaboration frameworks.

  • Research to support digital data libraries in the cloud that contain scientific data, not just the metadata, and support integration with published literature.

  • New techniques for data visualization of cloud scale data sets, including visual analytics.

  • Frameworks to support sensor-based science in the cloud, from services to process data streaming and from sensors to the storage and curation of raw and derived data products.

  • Machine learning techniques for data understanding and data generation.

  • Research to support intelligent interactions leveraging web data and domain knowledge.

Will Hadoop or Dryad/LINQ be available on Azure?

There is no port of Hadoop or Dryad/LINQ currently available. However, Windows Azure is an excellent platform for experimenting with new variations on large-scale map-reduce algorithms, as these patterns are easily coded as worker role networks.

Can I run my MPI HPC applications on Windows Azure?

Windows Azure is not designed to replace the traditional HPC supercomputer. In its current data center configuration it does not have the high-bandwidth, low-latency communication model that is appropriate for tightly-coupled MPI jobs. However, Windows Azure can be used to host large parallel computations that do not require MPI messaging, such as ensemble or parameter sweep studies.

Can Azure be useful as an experimental host for distributed computing research?

Yes. Windows Azure worker roles have access to standard TCP/IP sockets on each virtual machine (VM) in which they run. Hence it is possible to use a large number of worker roles to experiment with distributed computing algorithms and protocols.

Can Azure be used to support collaborations and “science gateways”?

Yes. Windows Azure is an excellent platform for sharing “community” data and data analysis tools. Most science gateways are built as web portals and Windows Azure is ideally suited for this task.

What data collections will be made available?

We will be very interested in suggestions from researchers about important community data collections and tools that can be hosted. We currently have data collections from the NCBI genome databases, oceanographic instrument data, and some MODIS satellite data. We also are providing access to web scale n-grams via a service. However, our goal is to let the research community help us define a sustainable collection of shared resources and analysis tools.

Web N-gram Services

The Web N-gram Services, provided by Microsoft Research in partnership with Microsoft Bing, will provide researchers access to large scale real-world datasets and benchmarks. Access to the Web N-gram Services will be made available to NSF awardees, with the following properties:

  • Content types: Document Body, Document Title, Anchor Texts
  • Model types: smoothed models
  • Highest order N: 5 (N=5 in N-gram)
  • Training size (Body): over 1.3 trillion
  • #of 1-gram (Body): 1 billion
  • #of 5-gram (Body): 237 billion
  • Availability: Hosted Services by Microsoft
  • Refresh: 2 data updates

N-gram models can advance research in areas such as document representation and content analysis (for example, clustering, classification, and information extraction), query analysis (for example, query suggestion and query reformulation), retrieval models and ranking, spelling, and machine translation. They can also improve intelligent interactions with better dialogue modeling (for example, semantic relations and summarization).

Windows Azure Software and Architecture

What is the programming model?

A Windows Azure program is a scalable, multi-tiered web service. The service consists of one or more “web roles,” which are standard web service processes, and “worker roles,” which are computational and data management processes. Roles communicate by passing messages through queues or sockets. The number of instances of each type of role is determined by the developer when the application is deployed and each role is assigned by Windows Azure to a unique Windows Server virtual machine (VM) instance. (Currently no more than one VM instance runs on an individual core.)

What types compute instances are available on Windows Azure?

Each Windows Azure compute instance (web role or worker role) represents a virtual server. Although many resources are dedicated to a particular instance, some resources associated to I/O performance, such as network bandwidth and disk subsystem, are shared among the compute instances on the same physical host. During periods when a shared resource is not fully utilized, you can utilize a higher share of that resource. Each Windows Azure data center server currently has 8 cores, 14 GB of memory, and 2 TB of disk space. An instance can be mapped to one or more cores with the memory and resources divided evenly. The table below describes the way the resources are partitioned on each server.

Compute Instance Size

CPU

Memory

Instance Storage

I/O Performance

Small

1.6 GHz

1.75 GB

225 GB

Moderate

Medium

2 x 1.6 GHz

3.5 GB

490 GB

High

Large

4 x 1.6 GHz

7 GB

1,000 GB

High

Extra large

8 x 1.6 GHz

14 GB

2,040 GB

High

Windows Azure Pricing Calculator

What virtual machine (VM) types are available? Can I configure my own VM?
Windows Azure automatically configures and manages Windows Server VM instances for your application. In the current version of Windows Azure, you cannot remotely connect and run a remote desktop on this VM instance. The VM instances are managed and deployed by the Windows Azure Fabric Controller and you interact with the Fabric Controller though the Windows Azure web interface.

What types of data storage are available on Windows Azure?
There are basically five storage systems. Blob storage is for long-term data. Blobs are binary objects together with <name, value> pair metadata. Each blob can be up to 50 GB and blobs are grouped into logical containers. Blobs are replicated three times in the data center for reliability purposes and they can be accessed from any server or by a URL over the Internet. Table storage is another type of persistent storage. A table can be very large (millions of rows and columns) and is partitioned by rows and distributed over the storage nodes in Windows Azure. It is also triply replicated. Tables are not full SQL tables because there is no join operator. Within the compute node there are two types of storage. Local disk is available to each Windows Azure role, but this is not persistent. If your role process goes down it may be restarted on another node, so the local disk is not for persistent data. However, XDrives are virtual drives that can be mounted on a Windows Azure VM instance and they are backed by the blob storage system so that they are persistent. The queue system is also part of the persistent Windows Azure storage model.

Can I run the Windows Azure software stack on my own private cluster?
Not currently. Windows Azure is a public cloud service and it is not available as a software product.

Developing Applications for Windows Azure

What languages/compilers are available? What IDEs can be used with Azure?
Applications on Windows Azure are designed and debugged completely on the programmer’s local machine. So any compiler that generates a Windows binary can be used. Microsoft Visual Studio has a “plug-in” for Windows Azure that makes the construction of Windows Azure applications extremely simple. The plug-in allows the programmer to test the application on a local Windows Azure emulator. When the programmer is ready to deploy the application on Windows Azure, the binaries are uploaded through a web interface. This interface also controls deployment parameters such as the number of server instances to be used. Visual Studio is not required.

The application program can also use a plug-in for the Eclipse software framework if that is preferred. Programmers can use Java, Ruby, Python, and C++. We have examples that illustrate how to deploy the Apache Tomcat server on Windows Azure for the web role.

What scientific libraries will be available?
We are working on a list of these currently.

Can I run Matlab on Azure?

Matlab can be used to “compile” a Matlab application and it is possible to upload this compiled code and libraries to Windows Azure. We have not yet installed a complete Matlab instance on Windows Azure, but it is a project currently under study.

Can I run arbitrary applications as Azure workers?

In general, any Windows binary that does not require modifications to the Windows operating system or registry can be loaded and run as part of a Windows Azure role. This includes compiled C, C++, or Fortran programs as well as Java, Python, PHP, and Ruby applications.

How can I use custom libraries with my Azure applications?

When you use the IDE to build the application, you simply include the libraries with the application. The IDE will roll these up into the binary that is uploaded to Windows Azure when you deploy your application.

Can I manage collections of tasks (workflows) that involve local data and activities as well as Azure resident data and tasks?

Yes. We have a sample Windows Azure service that can manage many concurrent tasks that run on Windows Azure as well as local compute resources. The tasks are arbitrary Windows executables that are wrapped as Windows Azure worker roles. The data that is used or produced can be in the Windows Azure replicated persistent storage or the local disk on each machine. This local storage is not persistent across deployments of your application; however, it is possible to mount a virtual disk that is part of the persistent storage.

Are there hooks in Azure to enable systems-level research on scheduling and resource allocation?

Unfortunately, no. Windows Azure applications are controlled by the Azure Fabric controller, which has the responsibility of resource allocation and quality of service for all the currently running applications. Consequently, Windows Azure is not well suited to many systems-level research projects.

What dynamic scaling models are supported by Azure?

The programmer must specify the number of each type of role to instantiate at deployment time. These numbers remain fixed until the application is de-deployed. However, it is possible to experiment with dynamic deployment in the application logic.

What performance guarantees can I expect from my applications? What is the interference between workers?

Performance will be variable depending on the load on the data center. We will provide basic benchmark tools to help guide application designers to optimize performance. At any given time, Windows Azure will be running many applications in the data center and many of these are commercial customers who demand high levels of service. The research engagement project will receive the same level of support as these commercial users.

Will I be able to instrument my applications?

Because you have no direct access to the Windows Server virtual machine instance, you will not be able to access instrumentation that requires administrator level authorization. However, you do have access to application logs and some performance counters, and application-level instrumentation is possible. A complete benchmark suite will be available that you can use to experiment with many performance features of Windows Azure.

How can I get my data into the Azure storage?

There are several ways to move data. The most direct method is through the web API to the Windows Azure blob storage. It is also possible to take data stored in Microsoft Office Excel or Matlab and, by using plug-ins we provide, save your data directly in Windows Azure tables. There are also free GUI tools to manage Windows Azure storage from your desktop, such as the Cloud Storage Studio. OpenDAP services will be available for loading and serving data and an FTP application can be used to pull data from FTP sites.

Are there ways to connect desktop or mobile applications to Azure services?

Windows Azure is a web service platform, so any web-based protocol can be used to build applications that fully integrate the cloud. This is how the plug-ins for Matlab and Office Excel were built.

Are there data visualization tools available for Azure resident data collections?

Both Excel and Matlab running on the desktop can be used to visualize data stored on Windows Azure. More sophisticated local and remote visualization tools will be made available.

What will the Microsoft Research Cloud Research Engagement Team provide to the research community?
The engagement team will support the community of researchers through the following:

  • Develop tutorials and white papers for a general overview of Windows Azure, identify best practices, and provide a benchmark suite as a guide for application architects and developers.

  • Host reference data sets in Windows Azure, selected based on research value/interest.

  • Host an internal kickoff workshop and annual all-hands meeting (invitation only) at Microsoft Research.

  • Field questions and provide technical consultation through our engagement team.

  • Regularly update a community web site with technical content, blogs, source codes, and examples—along with community-supplied content, questions and answers, and more.

  • Offer common services and tools that the research community finds useful.