I am attending the Web 2.0 Expo in New York this week and using the blog to share notes from sessions that may interest API users.
Geir Magnusson Jr. - VP of Engineering at 10gen Inc. - presented on The Sequel to SQL: Why You Won't Find Your RDBMS in the Clouds.
Notes from the session:
The cloud has intrinsic capabilities different from what we now have for normal computing, and this will change how we approach software development even for normal problems.
Geir defined cloud as composition of SaaS (SalesForce), PaaS (Platform as service - 10gen, AppEngine), TaaS (Tools as service - Amazon SimpleDB), HaaS (Hardware as a service - Amazon EC2). Geir polled the audience for EC2 usage and was surprised on lack of usage.
He mentioned that startups now don't need to spend money or seed money in hardware etc.
Advantages are -
no need to own and maintain equipment
3. Geo Diversity
Provide local access
4. Elastic resource availability
Get service more quickly
Disadvantage is as follows --
1. Data has to be duplicated (or partitioning) and distributed for safety and geo-availability.
For problem of scaling by having joins across machines, the speaker proposed "Plate Spinning" on EC2. Cloud from the HaaS perspective, by having multiple VMs of MyQL - one master, many slaves. When master instance goes away, designate a new master from remaining and add another instance. Here cluster is moved from locally to cloud in Amazon.
Second option covered was BigTable via Google's AppEngine to enable large scale storage of data objects called entities (which is a set of name/value pairs called properties). There was a team member from AppEngine who clarified that writes are always consistent.
Third option covered was Amazon's SimpleDB with tabular store where you have domains (like tables) which contain items (like rows). Item is set of attribute /value. Probably meant as metadata store for S3 (upto 256 attributes, values <=1k). It is eventually consistent. Query is limited to 250 items. SimpleDB uses everything as string and comparisons are lexicographical. They have RESTish API where queries are expressed as strings. They recommend you to compare positive numbers only.
Fourth option is 10gen's Mongo which is database for 10gen platform (has JVM). It is schema less, where database is set of collections and collection is a bag of objects. Language bindings can be native - examples shown were in Java Script, Python, Ruby. It also provides cursor unlike other examples shown above. Provides semantics like indexing, count, skip etc.
Fifth option was AppJet's persistent object database which is a javascipt based appserver in sky. You can save arbitrary objects and collections. They provide Sort, query, limit, and skip.
Takeway - no one is doing relational and data is treated in clustered model. He covered eventual consistency which has been prevented by providing atomic updates.
Version based locking was considered but not used.
Geir said another project to watch is Drizzle which is derived from MySQL by forking MySQL (throwing out stored proc/views) for cloud purposes. Another project is couch db, and Hadoop (distributed File System + Map/Reduce Engine).