Trip Report – HPTS (High Performance Transaction Systems) Workshop – Oct 25th-28th, Asilomar, Pacific Grove, CA. (www.hpts.ws). This is Part 1 of the trip report… I am chiseling my way through my notes and expect this to take a couple more blog entries before I can describe everything I plan to cover. Believe it or not, this 11 page “Part 1” is only taking us through Monday morning – there’s Monday afternoon, all Tuesday, and Wednesday morning yet to cover!
HPTS was founded by Jim Gray, Dieter Gawlick, Andreas Reuter, and some other industry luminaries in 1985. It is a workshop designed to bring together practitioners and academics working in super large, scalable, high throughput systems. There is a lot of overlap between this community and the folks doing DBMS systems. In the early days, the biggest challenges included getting 1000 transactions per second through a mainframe. At the time, that felt like getting a man to land on the moon and come back safely. Now, this includes huge web based sites like Amazon, eBay, Google, and Bing. Our area of interest also includes integrating massively complex heterogeneous Enterprise and Internet (B2B) systems. Of course, we are fascinated with hardware trends, networking, and all sorts of geeky stuff.
I first attended HPTS in 1985. It was a magic opportunity to be with legends in the industry. Gene Amdahl drove up in his Rolls Royce and parked next to my beat up old van. Mike Blasgen(System R), Mike Stonebraker, Bruce Lindsay, Mohan, Andy Heller, Dieter Gawlick, Hector Garcia-Molina, Irv Traiger, Phil Bernstein, Jim Gray, and many more folks who had become well known to me through their papers were there in the flesh and treating me (a 29 year old punk) with kindness and respect. It was a magical 3 day gathering.
Let me explain the Asilomar Conference Center (www.visitasilomar.com). Asilomar sits right by the Pacific Ocean between Pebble Beach and Monterey in the Pine Trees and includes dunes leading right to the beach. It was built as a YWCA around 1910 or so and contains a number of beautiful historic buildings in the arts and crafts style built by architect Julia Morgan. In the 1950s, it was sold to the state of California which has operated it as a conference center (and built more buildings in the 1960s). The rooms are rustic (no TVs or phones – crappy mattresses), the food is served in a large dining hall when the bell rings on the roof and it has a magic feel. It is normal to walk out of your room, hear the surf, and see deer walking around. I absolutely love the evenings by the real wood fireplaces.
In 1985, Gray decided we could all go cheap and lined us up to sleep four in a room. I shared a room with Dieter Gawlick (inventor of IMS/FastPath among other things), Hector Garcia-Molina, and Theo Haerderer (a wonderful professor from Germany). After the first night, Hector informed us we ALL snored too much and moved to stay in a hotel down the street. For the next 8 or 10 gatherings, we were two to a room but now most of us pay a bit more to have a private room.
I’ve attended HPTS every two years (for three days) since 1985. I was program and general chair in 1989 and general chair in 2007. Since Jim Gray’s disappearance, I’ve assumed the role of local arrangements chair and “godfather” – kinda’ watching out for the whole workshop. This year, James Hamilton (now of Amazon) was general chair and Margo Seltzer (of Harvard) was the program chair. We had a wonderful (but exhausting) three days and I will tell you about them soon.
One of the traditions is that the folks leading HPTS wander around during the first two days trying to figure out who gets suckered into doing the work as the next workshop happens two years later. I am thrilled to report that Mohan (from IBM) will be the general chair in 2011 and Mike Carey (now of UC Irvine) will be the program chair. I will remain as local arrangements chair and “godfather”.
So, every HPTS starts with Sunday dinner and a casual beer/wine afterwards. Monday is all day presentations and after dinner we have poster sessions (volunteer 10 minute talks – we had 15 this year). Tuesday is all day presentations and this time we had a 90+ minute panel after dinner. Finally, we have sessions Wednesday morning and adjourn after lunch. It is a long, exhausting, and exhilarating 3 to 4 days.
Of course, the most important part of the workshop is the breaks and the opportunity to talk with people working in the same area as you are working. In addition to the grizzled old luminaries, this year we went out of our way to get more graduate students and invited 16 of the brightest and best from Berkeley, Stanford, MIT, Harvard, Brown, and more. I spent each of the three evenings staying up late talking by the fire with these wonderful young folks (and some other diehard old farts). As I type this the following weekend, I am still tired from the sleep shortage!
Session 1) The New Memory Hierarchy –
Session Chair: David Cheriton (Stanford)
Steve Kleiman – Chief Scientist at NetApp: “Flash on Compute Servers”
So, Steve Kleiman is the Chief Scientist at NetApp so you would expect him to be focused on network attached storage and we were not disappointed.
Steve started with a discussion of hardware trends (not all of which I was able to capture in my notes). He observed that with multi-core and VMs, more bandwidth (BW) and IOs per Second (IOPS) are needed for each node. Also, as a trend, Steve pointed out that flash memory has more than 5X the IOPS and less than 100 microseconds access time.
The original case for network storage lies around the benefits of aggregating disk which include more random IOPS and bandwidth as well as the management simplicity of concentrating data into a centralized place. It has been the case that a single application per host has somewhat limited ability to consume bandwidth.
Considering flash in a networked storage environment, it does not have the same benefits of sharing. Since there are enough IOPS in a flash drive to placate most apps, the equation changes. Is network storage fast when the network latency is noticeable compared to flash? We can do 5 SSDs to get 40,000 IOPS but will the latency be a problem?
We will see flash on compute servers. Servers can already accommodate flash and it is cheap enough with 250GB < 10% of the server cost.
A new tier in the storage hierarchy: Steve argues we will see storage split into two tiers: the IOPS tier and the capacity tier. The IOPS-Tier will offer the best $/IOPS with low latency and will have sufficient capacity for the active data in most typical enterprise systems. It will be the new “primary storage”. The Capacity-Tier will offer the best $/GB – Deduplication, compression, and good serial performance.
Steve then outlined three broad categories of Host-Based Flash usage. DAS (which I think means “Direct Access Storage but my notes failed me), Primary Storage Service, and Cache.
· Host-Base Flash: DAS: Single point of failure and so you would use normal backup or application based replication like Exchange or Oracle). You can have boot files, VM, temp files.
· Host-Based Flash: Primary Storage Service: File or Block. Requires mirroring or RAID to peers. Allows capacity sharing. Steve believes it is likely that Microsoft and VMWare are likely to compete here – the integration of this is complex. Making reliable primary storage is hard and this may impact the roles of folks in IT.
· Host-Based Flash: Cache –File or block. This will be a writethrough cache. If you do a write-back you will occasionally lose stuff. Using a cache can allow for a centralized data management (in a remote network storage server). This can help with keeping the current IT storage management roles by providing an automatic “tiering”.
So, data management spans the IOPS tier and the Capacity tier. You need to have automated data movement and global access. It is impractical to have manual placement at huge scale. It is going to be important for us to have a technology independent language to specify the desired properties for the data. Say what properties you want and not how to get them. New SLO (Service Level Objectives) for Max Latency, Bandwidth, Availability, etc.
In summary, the new host-based flash is the new IOPS-Tier. It will replace high-performance primary storage in the cloud. This IOPS-Tier s is not going to be economical as a secondary or as archival since the $/GB is higher than spinning media. The Capacity Tier will be implemented as network storage. It is optimized for the best $/GB. Data management between these two tiers will be important.
There was a lot of discussion around the SLO (Service Level Objectives). Steve indicated that their vision was not super-flexible set of knobs but more a small set of options for defining the data characteristics. Margo offered that one aspect of what was important was the “value” of different parts of data which can, in turn, tell the system how hard to work to make the data remain available.
Someone asked if there was any performance analysis for 3-tier storage (memory, IOPS-Tier, Capacity-Tier) over more traditional 2-tier storage (Memory and Disk). Steve answered that there had not been an explicit performance analysis.
Stonebraker observed that enterprises seem to like NAS/SAN but that the Cloud guys are using direct attached storage. Are the forward looking guys walking from NAS/SAN? Steve replied there are people very interested in NAS/SAN because it gives them a way to control the data and make servers stateless. Adding in the IOPS-Tier is a simple extension.
This was a fascinating and engaging presentation!
Memory Technologies for Data Intensive Computing –Andy Bechtolshein (Sun Computers)
What an honor to get Andy to come and present to us. Unfortunately, I didn’t get a chance to chat with him personally as he was only at the workshop for a short while.
Andy is a well respected and talented hardware engineer who was a founder of both Sun Microsystems and of Arista Network. His slides are not going to be posted and I typed as aggressively as I could during his lightning fast presentation. I would recommend you look both at my friend James Hamilton’s blog post of Andy’s talk at: http://perspectives.mvdirona.com/2009/10/26/AndyBechtolsheimAtHPTS2009.aspx and then read my notes attached below. I have not attempted to coalesce the information between James’ notes and mine and there may be overlap. James has done a nice job cleaning his up and I will focus on some of the other sessions in my editing.
Here’s my rough notes:
è Andy speaks VERY fast and I lost a lot of what was covered – The slides were very dense – Also, I apologize for some inaccuracies… These are RAW notes that I made during the fast paced presentation and some of them don’t completely make sense to me.
Semiconductor Technology Roadmap
Number of devices on the chip still doubling
Flash is increasing faster than logic and DRAM – Flash faster than Moore’s law
Between 2010 and 2022, Flash should increase by 100X – After that (10nm) optical problems may become severe – don’t know after 2022
<more content lost…>
· Carbon-nano tubes
· Graphene Nanoribbons
· … <more that I lost…>
Silicon Roadmap Predictions
· 128X increase in transistors by 2022
· 1000 cores per chip by 2022
CPU Module Roadmap
· IO bandwidth will not allow the same bandwidth per core. IO going a little faster but number of pins hasn’t increased in 10 years.
· 1000 cores would need several terabytes of bandwidth to memory. Of course, this would need power and cooling
· <more lost…>
· Power per core --
· Clock rates not increasing – need lower power level per core
· <more lost…>
Power efficiency – Increasing frequency – a Square effect on the power
· CPU power strategy –
· Clock rates look flat
· High clock rates are less power efficient
· To reduce power, simplify architecture
· <more lost…>
Multi-Chip 3D packaging
· Thru-Si via Stacking
· Need to combine CPU and Memory on one module
High Density 3D Multi-Chip Module
Benefits of MCM Packaging
· Enables much higher memory bandwidth
· More channels, wider interfaces, faster I/o
· Greatly reduces power
Challenges with MCM packaging
· Total Memory – 64 device <…??...>
· With 64 GNB/device – 4TB
· Assuming 1K cores – 4BG/core
MCM enables Fabric I/O
· 2010 –1 32 Gb/direction
· 2020 1.73Tbps/direction
· Mesh or higher radix fabric topologies
Expected Link Data Rates
· 10GBps shipping today
· 2012 – 25 GBps
· 2016 – 50 GBPS
· 2020 – 100 GBps
· Forget Hard Disks – Disk are not getting faster
Flash Experience so far
· Not easy to get to 1000000 IOPS
· Writes are a problem
· SAS/SATA interface not optimal
· Direct PCI-Express interface looks more promising. – Move FLASH closer to the CPU -- Lower latency and more throughput
Flash in the Memory Hierarchy
· Flash is not random access more like a block device -- random access about 1000 slower than DRAM
· Flash can be used as a stable storage device -- writes are committed – Supercap magic behind the scenes.
· Tremendous throughput and size – terabytes of capacity
· Latency expected to drop by a factor of 10
· Flash transfer rates will double each year for the next several years. – 250MBS/sec 2012
· 2004 -- $100 GB
· 2009 $1 – GB
· 2012 -- $.25 / GB
· Prices are dropping dramatically –
· FLASH will be very expensive, very dense,
· Costs falling by 50% per year
· Access times falling by 50% per year
· Moore’s Law for at least 10 more years
· Frequency Gains not happening
· Flash will be interesting
· Limits of App Parallelism
· Need to exploit Parallelism
Session 2) Using the New Memory Hierarch
Session Chair: David Cheriton (Stanford)
Scaling-Out without Partitioning – Phil Bernstein and Colin Reid (Microsoft)
This session was, in many ways, one of the most interesting at HPTS. It was not new to me since I had been privileged to spend a number of hours talking to Colin (and a bit to Phil) about Hyder back at Microsoft. Phil did a great job explaining a radical new approach to scale-out and serializable database!
Hyder is a software stack for transactional record management. It can offer full database functionality and is designed to take advantage of flash in a novel way.
Most approaches to scale-out use partitioning and spread the data across multiple machines leaving the application responsible for consistency.
In Hyder, the database is the log, no partitioning is required, and the database is multi-versioned. Hyder runs in the App process with a simple high-performance programming model and no need for client server. This avoids the expense of RPC.
Hyder leverages some new hardware assumptions. I/Os are now cheap and abundant. Raw flash (not SSDs – raw flash) offers at least 10^4 more IOPS/GB than HDD. This allows for dramatic changes in usage patterns. We have cheap and high performance data center networks. Large and cheap 64-bit addressable memories are available. Also, with many-core servers, computation can be squandered and Hyder leverages that abundant computation to keep a consistent view of the data as it changes.
The Hyder system has individual nodes and a shared flash storage which holds a log. Appending a record to the log involves a send to the log controller and a response with the location in the log into which the record was appended. In this fashion, many servers can be pushing records into the log and they are allocated a location by the log controller. It turns out that this simple centralized function of assigning a log location on append will adjudicate any conflicts (as we shall see later).
The Hyder stack comprises a persistent programming language like LING or SQL, an optimistic transaction protocol, and a multi-versioned binary search tree to represent the database state. The Hyder database is stored in a log but it IS a binary tree. So you can think of the database as a binary tree that is kept in the log and you find data by climbing the tree through the log.
The Binary Tree is multi-versioned. You do a copy-on-write creating new nodes and replace nodes up to the root. The transaction commits when the copy-on-write makes it up to the root of the tree.
For transaction execution, each server has a cache of the last committed state. That cache is going to be close to the latest and greatest state since each server is constantly replaying the log to keep the local state accurate [recall the assumption that there are lots of cores per server and it’s OK to spend cycles from the extra cores]. So, each transaction running in a single server reads a snapshot and generates an intention log record. The transaction gets a pointer to the snapshot and generates an intention log record. The server generates updates locally appending them to the log (recall that an append is sent to the log controller which returns the log-id with its placement in the log). Updates are copy-on-write climbing up the binary tree to the root.
Log updates get broadcast to all the servers – everyone sees the log. Changes to the log are only done by appending to the log. New records and their addresses are broadcast to all servers. In this fashion, each server can reconstruct the tail of the log.
Now we get to transaction commit. All machines are marching ahead rolling the log forward. When an intention record is generated (indicating a proposed new root for the log), each machine in the cluster checks for a conflict. If any of the records updated conflicts with previously committed work, the transaction is aborted. Conflict in this case is traditional read/write conflict using all the classic set of rules for various degrees of consistency. If there are no conflicts, the transaction is committed. If there are conflicts, the transaction is aborted.
Because there is an append in the log controller which assigns order, each intention record will race to the log and the controller will pick a winner. Because log ordering is deterministic, each of the server nodes can reliably come to the same conclusion about which transaction won the race and, hence commits rather than aborts in the face of conflict. Conflict detection and transaction commit is done redundantly and predictably at each node.
So, each node generates an intention log for a transaction it wants to do. Every server checks for conflicts and then either aborts or commits. The log order provides determinism.
Performance of Hyder: The system scales without partitioning. The system-wide throughput of update transactions is bounded by the update pipeline. It is estimated this can perform 15K update transactions per second over a 1GB Ethernet and 150K update transactions per second over a 10GB Ethernet. Conflict detection and merge can do about 200K txs per second.
The abort rate on write-hot data is bounded by the transaction’s conflict zone. This is determined by the transactions end-to-end latency. This conflict zone is about 200 microseconds in the prototype. It is estimated that this would handle about 1500 TPS (assuming all the transaction have conflicts).
Major Technologies: This is using flash as append only (and it is not using SSD flash but raw flash with a special controller). The custom controller over raw flash has mechanisms for sync and fault tolerance. Storage is striped with a self-adaptive algorithm for storage allocation. It includes a fault tolerant algorithm for a totally ordered log. Also, it has a fast algorithm for conflict detection and merging of records.
Status: Most of Hyder has been prototyped but there is a long way to go. Phil Bernstein and Colin Reid are working on a paper and HPTS is the first time Hyder has been discussed publicly.
Again, this was a wonderful talk with some fascinating and different ideas. Great work!
A Performance Puzzle: B-Tree Insertions are Slow on SSDs (Bradley Kszmaul – Tokutek – Research Scientist at MIT)
So, this talk was about some surprises in SSD performance in supporting MySQL. Much of the discussion was about the datasheets for some of the drives not matching the measured reality.
Bradley had measured the value of running MySQL against SSDs and expected the SSDs (with higher IOPS) to get better Btree performance. Here’s my rough and raw notes:
Intel X25E Specs
· Read BW up to 250 MB/s
· Write BW up to 170 MB/s
· Random 4KB read rate – 35 KIO/s
· Random 4KB
· <lost some…>
Measure Berkeley DB
· Trending to 4500 write per second
Performance Model --
· Startup cost plus bandwidth cost
· Looking at read performance as a function of block size
· As block size gets bigger, bandwidth goes up
Discussion à Margo points out that many Filesystems see large blocks and do read ahead. That can perturb the results…
When you read and write large blocks, Bradley sees Filesystems doing the right thing and keeping contiguous blocks adjacent.
The datasheet for the drive is not matching the reality for larger block sizes
Mixing reads and writes seems to impact the performance quite a bit…
What Block Size to Use?
· For point-queries, BTrees are insensitive to block size.
· For range queries, block size if important
· Tension: Large block sizes make range queries faster; Small block sizes make exact match faster
· Set block size so half of the time is accounted by seek, half by bandwidth
· Reads: 50KB for SSD but 500KB-1000KB for rotating disk
· Writes 10KB SSD , 500KB-1000KB for HDD
· Mixed read/write 21KB for SSD
· Use data structures that are fast for any block size
· Also can speed insertions without slowing searches
· Tokutek’s MySQL storage engine uses these ideas.
Discussion: James Hamilton layering of software and release by release changes makes it hard to get predictable behavior. When all these releases are separately evolved, the performance behavior is very unpredictable. Bradley: fragmentation starts messing with this, too. James: all the layers interact and we need to measure by the workload and adapt. The complexity makes performance very hard to predict.
Implications of Storage Class Memories (SCMs) on Software Architectures (C. Mohan – IBM)
Storage Class Memory (SCM)
Mohan gave a very thought provoking talk on the implications of Storage Class Memories. Where do we need them? What good will they do?
If we have non-volatile memory that is as easy to access as DRAM, will we have corruption problems? Will we see a running system trash the non-volatile memory like it trashes DRAM? One of the advantages of our current system design is that we have to pass a “sanity test” of an IO driver or messaging stack to get nastiness outside of a running system. That dramatically reduces the chance of this happening.
Anyway, Mohan’s talk was really thought provoking and enjoyable! It was good to see my old friend back at HPTS after taking a stint in India for a few years.
· Blurs the distinction between DRAM and Disk
Industry SCM Activities
· Intel/ST spun out of Numonyx (FLASH and PCM)
· Samsung, Numonyx sample PCM chips –
· <lost som>
· DRAM plus SWCM – Fast and persistent
· Persistent storage will not lose data
Competing technologies for SCM
· Phase Change RAM (most promising)
SCM as a part of the storage stack
· The distinction between RAM and Disk – the line in between is blurring
· Memory-like and also Storage-Like
SCM Design Triangle
· Speed – endurance – Cost/bit
· Speed with Write endurance is more memory-like
· When density increases, the failure rate goes up as bits flip more often
· SCM is on Moore’s law
PCM is fast
· 100-1000 ns
· SCM may be used as Memory
· SCM may, alternatively be used as a storage (through IO controllers)
· Discussion à Pat: Bugs in non-volatile memory can sometimes be preserved
PCM Use Cases
· PCM as Disk
· PCM as Staging device
· PCM as memory
· PCM as extended memory
Exploring DBMS as an exploiter of PCM
· Should log records be written directly to PCM?
· Should log be more like Disk (first to buffer and then forced to disk)?
· PCM is not as reliable as disk à do you need to offload to a disk
· If you use Group Commit, it is unclear that PCM helps a DBMS
PCM replaces DRAM – Buffer Pool in PC
· The PCM Buffer Pool access will be slower than DRAM
· Writes will suffer more than reads!
· Should we instead have a DRAM BP backed by a PCM BP?
· Similar to DB2 z in parallel sysplex environment with BPs in coupling facility (CF)
· But DB2 situation has well defined rules on when pages a move from DRAM CP to CF BP
· Variation was used in SafeRAM work at MCC in 1989
Assume the whole DB fits in PCM
· Apply old main memory DB concepts design directly
· Shouldn’t we leverage persistence specially?
· Every bit change persisting may not be good!
· Today’s failure semantics has flexibility on tracking changes to DB pages.
· Memory overwrites will cause more damage
· If every write assumed to be persistent as soon as write completes, then L1 and L2 caching can’t be leveraged.
Assume whole DB fits in memory
· Need to externalize periodically since PCM won’t have good endurance
· If DB spans both DRAM and PCM, hard to figure out what goes where!
What about Logging?
· If PCM is persistent and whole DB is in PCM, do we need logging?
The end of Monday morning! More soon…