Big Data, Big Guidance Problem

You'd think that, after all the years I've been writing guidance for Microsoft technologies and tools, I'd have at least grasped how to organize the structure of a guide ready to start pouring content into it. But, just as we're getting into our stride on the Windows Azure HDInsight project here at p&p, it turns out that Big Data = big problem.

Let me explain. When I worked on the Enterprise Library projects, it was easy to find the basic structure for the guides. The main subdivision is around the individual application blocks, and for each one it seems obvious that all you need to do is break it down into the typical scenarios, the solutions for each one, the practical implementation details, and a guide to good practices.

In the more recent guide for migrating existing applications to Windows Azure (see Moving Applications to the Cloud) we identified the typical stages for moving each part of the application to the cloud (virtual machines, hosted services, cloud database, federated authentication, Windows Azure storage, etc.) and built an example for each stage. So the obvious subdivision for the guide was these migration stages. Again, for each one, we document the typical scenarios, the solutions for each one, the practical implementation details, and a guide to good practices.

In the cases of our other Windows Azure cloud guides (Developing Multi-tenant Applications and Building Hybrid Applications) we designed and built a full reference implementation (RI) sample that showcases the technologies and services we want to cover. So it made sense to subdivide the guides around the separate elements of the technologies we are demonstrating - the user interface, the data model, the security mechanism, the communication patterns, deployment and administration, etc.

But none of these approaches seems to work for Big Data and HDInsight. At first I thought I'd just lost the knack of seeing an obvious structure appear as I investigate the technology. I couldn't figure out why there seemed to be no instantly recognizable subdivisions on which to build the chapter and content structure. And, of course, I wasn't alone in struggling to see where to go. The developers on the team were suddenly faced with a situation where they couldn't provide the usual type of samples or RI (or, to use the awful marketing terminology, "an F5 experience").

The guidance structure problem, once we finally recognized it, arises because Big Data is one of those topics that - unlike designing and building an application - doesn't have an underlying linear form. Yes, there is a lifecycle - though I hesitate to use the term "ALM" because what most Big Data users do, and what we want to document, is not actually building an application. It's more about getting the most from a humungous mass of tools, frameworks, scenarios, use cases, practices, and techniques. Not to mention politics, and maybe even superstition.

So do we subdivide the guide based on the ethereal lifecycle stages? After collecting feedback from experts and advisors it looks as though nobody can actually agree what these stages are, or what order you would do them in even if you did know what they are. The only thing they seem to agree on is that there really isn't anything concrete you can put into a "boxes-and-arrows" Visio diagram.

What about subdividing the guide on the individual parts of the overall technology? Perhaps a chapter on Hive, one on custom Map/Reduce component theory and design, one on configuring the cluster and measuring performance, and one on visualizing the results. But then we could easily end up with an implementation guide and documentation of the features, rather than a guide that helps you to understand the technology and make the right choices for your own scenario.

Another approach might be to subdivide the guide across the actual use cases for Big Data solutions. We spent quite some time trying to identify all of these and then categorize them into groups, but by the time we'd got past fifteen (and more were still appearing) it seemed like the wrong approach as well. Perhaps what's really big about Big Data is the amount of paper you need to keep scrawling a variety of topic trees and ever-changing content lists.

What becomes increasingly clear is that you need to keep coming back to thinking about what the readers actually want to know, and how best you can present this as a series of topics that flow naturally and build on each other. In most previous guides we could take some obvious subdivision of content and use it to define the separate chapters, then define a series of flowing topics within each chapter. But with the whole dollop of stuff that is Big Data, the "establishing a topic flow" thing needs to be done at the top level rather than at individual chapter level. Once we figured that, all the other sections fell naturally into place in the appropriate chapters.

So where did we actually end up after all this mental gyration? At the moment we're going with a progression of topics based on "What is it and what does it do", "Where and why would I use it?" "What decisions must I make first?", "OK, so basically how do I do this?" and "So now I know how to use it, how does it fit in with my business?" Then we'll have four or five chapters that walk through implementing different scenarios for Big Data such as simple querying and reporting, sentiment analysis, trend prediction, and handling streaming data. Plus some Hands-on Labs and maybe a couple of appendices describing the tools and the Map/Reduce patterns.

Though that's only today's plan...