Windows Azure Storage Overview

I am at the Azure Firestarter event in Redmond today and just heard Brad Calder give a quick overview of Azure data.  Here are my notes; slides and sample code are to be posted later and I will update the post with them when they are.

  • Blobs
    • REST APIs
    • Can have a lease on the blob - allows for limiting access to the blob (used by drives)
    • To create a blob…
      • Use StorageCredentialsAccountAndKey to create the authentication object
      • Use CloudBlobClient to establish a connection using the authentication object and a URI to the blob store (from the portal)
      • Use CloudBlobContainer to create/access a container
      • Use CloudBlob to access/create a blob
    • Two types of blobs
      • Block blob - up to 200 GB
        • Targeted at streaming workloads (e.g. photos, images)
        • Can update blocks in whatever order (e.g. potentially mulitple streams)
      • Page blob - up to 1 TB
        • Targeted at random read/write workloads
        • Used for drives
        • Pages not stored are effectively initialized to all zeros.
          • Only charged for pages you actually store.
          • Can create a 100 GB blob, but write 1 MB to it - only charged for 1 MB of pages.
        • Page size == 512 bytes
          • Updates must be 512 byte aligned (up to 4 MB at a time)
          • Can read from any offset
        • ClearPages removes the content - not charged for cleared pages.
  • CDN
    • Storage account can be enabled for CDN.
    • Will get back a domain name to access blobs - can register a custom domain name via CDN.
    • Different from base domain used to access blobs directly - if you use the main storage account URL, will retrieve from blob store not using CDN.
    • To use CDN
      • Create a blob
      • When creating a blob, specify "TTL" - time to live in the CDN in seconds.
      • Reference the blob using the CDN URL and it will cache it in the nearest CDN store.
  • Signed URLs (Shared Access Signatures) for Blobs
    • Can give limited access to blobs without giving out your secret key.
    • Create a Shared Access Signature (SAS) that gives time-based access to the blob.
      • Specify start-time and end-time.
      • What resource-granularity (all blobs in a container, just one blob in the container)
      • Read/write/delete access permissions.
    • Give out URL with signature.
    • Signature is validated against a signed identifier. You can instantaneously revoke access to a signature issued by removing the signed identifier.
      • Can also store time range and permissions with the signed identifier rather than in the URL.
      • Can change them after issuing the URL and the signature is still valid in the URL.
  • Windows Azure Drive
    • Provides a durable NTFS drive using page blob storage.
      • Actually a formatted single-volume NTFS VHD up to 1 TB in size (same limit as page blob)
      • Can only be mounted by one VM instance at a time.
        • Note that each role instance runs on a VM, so only one role instance can mount a drive read/write
        • Could not have both a worker role and a web role mounting the same drive read/write
      • One VM instance can mount up to 16 drives.
    • Because a drive is just a page blob, can upload your VHD from a client.
    • An Azure instance mounts the blob
      • Obtains a lease
      • Specifies how much local disk storage to use for caching the page blob
    • APIs
      • CloudDriveInitializeCache - initalize how much local cache to use for the drive
      • CloudStorageAccount - to access the blob
      • Create a CloudDrive object using CreateCloudDrive specifying the URI to the page blob
      • Against CloudDrive…
        • Create to initialize it.
        • Mount to mount it - returns path on local file system and then access using normal NTFS APIs
        • Snapshot to create backups
          • Can mount snapshots as read-only VHDs
        • Unmount to unmount it.
    • Driver for mounting blobs only in the cloud - not on development fabric.
      • Instead, just use VHDs
  • Tables
    • Table can have billions of entities and terabytes of data.
    • Highly scalable.
    • WCF Data Services - LINQ or REST APIs
    • Table row has a partition key and a row key
      • Partition key:
        • controls granularity of locality (all entities with same partition key will be stored and cached together)
        • provides entity group transactions - as long as entities have same partition key, can do up to 100 insert/update/delete operations as a batch and will be atomic.
        • enable scalability - monitor usage patterns and use partition key to scale out across different servers based on partition keys
          • More granularity of partition key = better scalability options
          • Less granularity of partition key = better ability to do atomic operations across multiple rows (because all must have same partition key)
    • To create / use an entity
      • Create a .NET class modeling an entity
      • Specify the DataServiceKey attribute to tell WCF Data Services the primary keys (partitionkey, rowkey)
      • APIs
        • CloudTableClient - establish URI and account to access table store
        • TableServiceContext - get from CloudTableClient
        • Add entities using the context AddObject method specifying the table name and the class with the data for the new entity
          • SaveChangesWithRetries against context to save the object.
        • To Query… using LINQ with AsTableServiceQuery<xxxx> where xxx is the .NET class modeling the entity.
          • Manages continuation tokens for you
        • Then do a foreach and can use UpdateObject to update objects in the LINQ stream.
        • Use SaveChangesWithRetries
          • Add SaveChangesOptions.batch if < =100 records and all have same partition key - save as one batch.
          • If not, sends a transaction for each object.
    • Table tips
      • ServicePointManager.DefaultConnectionLimit = x (default .NET HTTP connections = 2)
      • Use SaveChangesWithRetries and AsTableServiceQuery to get best performance
      • Handle Conflict errors on inserts and NotFound errors on Delete
        • Can happen because of retries
      • Avoid append only write patterns based on partitionkey values
        • Can happen if partition key is based on timestamp.
        • If you keep appending, defeating scale out strategy of Azure
        • Make partition key be distributed not all in one area.
  • Queues
    • Provide reliable delivery of messages
    • Allow loosely coupled workflow between roles
      • Work gets loaded into a queue
      • Multiple workers consume the queue
    • When dequeuing a message, specify an "invisibility time" which leaves the message in the queue but makes it temporarily invisible to other workers
      • Allows for reliability.
    • APIs
      • Create a CloudQueueClient using account and credentials
      • Create a CloudQueue using the client and GetQueueReference - queue name
      • CreateIfNotExist to create it if not there
      • Create a CloudQueueMessage with content
      • Use CloudQueue.AddMessage to add it to the queue
      • Use CloudQueue.GetMessage to get it out (passing invisibility time)
    • Tips on Queues
      • Messages up to 8 KB in size
        • Put in a blob if more and send blob pointer as message
      • Remember that a message can be processed more than once.
        • Make that not be a problem - idempotent
      • Assume messages are not processed in any particular order
      • Queues can handle up to about 500 messages/second.
        • For higher throughput - batch items into a blob and send a message with reference to blob containing 10 work items.
        • Worker does 10 items at a time.
        • Increased throughput by 10x.
      • Use DequeueCount to remove "poison messages" that seem to be repeatedly crashing workers.
      • Monitor message count to increase/decrease worker instances using service management APIs.
    • Q: can you set priorities on queue messages?
      • A: No - would have to create different queues
    • Q: Are blobs stored within the EU complying with the EU privacy policies?
      • A: Microsoft has a standard privacy policy which we adhere to.