Data Science and the Cloud

More than perhaps any other computing discipline, Data Science lends itself best to Cloud Computing in general, and Windows Azure in specific. That's a big claim, but before I offer some evidence, I need to explain what I mean by "Data Science". I've written before on Data Science (, and ), but since it's an evolving field, here's what I've observed as the areas that a Data Scientist focuses on:

  • Research - Standard researching techniques such as domain knowledge, data sources and impact analysis
  • Statistics - Probability and descriptive statistics-focused
  • Programming - At least one functional or object-oriented language, often Python, F#, LISP, Haskell or Java and Javascript
  • Sources of data - Internal organizational data as well as external sources such as weather, economics, spatial, geo-political sources and more
  • Data movement - Traditional Extract, Transform and Load (ETL), along with ingress or referencing external data sources
  • Complex Event Processing (CEP) - Analyzing or triggering computing as data moves through a source
  • Data storage - Storage systems including distributed storage and remote storage
  • Data processing - Both single-node and distributed processing systems, RDBMS, NoSQL (Hadoop, Key/Value Pair, Document Store, Graph databases, etc)
  • Machine learning - Data-instructive programming as well as Artificial Intelligence and Natural Language Processing
  • Decision analysis - Interpreting the processing of data to identify a pattern, make a prediction, and data mining
  • Business Intelligence - Design of exploratory data, visualizations, business and organization impacts and communication to the stakeholders of the use of data and visualization tools

There are of course other aspects of data science, but I believe this list covers the majority of skills I've seen in individuals with the Data Scientist title. And it is normally an individual, or at least a very limited group of people. as you examine the list above, you can see this person requires a fairly extensive technical background, and in the domain knowledge area in specific, there's a pretty large time element. That isn't to say a very bright person couldn't ramp up on these areas, just that having all of that in your portfolio takes time.

Given that these are the skillsets, why is cloud computing well suited to assisting in the data science function?

It's obvious that a researcher needs good Internet skills, beyond simply referencing a Wikipedia article - although that's certainly a good thing to include from time to time. While searching isn't specific to Windows Azure, there are platform components that allow the programming function to call out to the web for data access. Windows Azure includes a platform that allows languages from Python to F#, JavaScript (Including NodeJS), Java and more.

Cloud computing allows the data scientist to access data stored in Windows Azure (Blobs, Tables, Queues, RDBMS's as a service such as SQL Server and MySQL) as well as IaaS systems that can run full RDBMS systems such as SQL Server, Oracle, PostGreSQL and others. In addition, the Windows Azure Marketplace contains "Data as a Service" which has free and fee-based data to include in a single application.

The Windows Azure Service Bus allows architecting a CEP system, and using SQL Server allows the StreamInsight feature, and can communicate from on-premises, Windows Azure IaaS and PaaS, and other data sources.

For data storage and computing, Windows Azure allows everything from traditional RDBMS's as described to any NoSQL in IaaS, on both Windows and Linux operating systems. Statistical packages such as "R" are also supported. The elasticity allows the data scientist to spin up huge clusters, such as Hadoop or other NoSQL offerings, perform some analysis, and then stop the process when complete, saving cost, and bypassing the internal IT systems (which may have its own dangers, to be sure).  Windows Azure also offer the High Performance Computing (HPC) computing version of Windows Server on Windows Azure, for large-scale massively parallel data processing, in constant and "burst" modes.

In addition, Windows Azure has many services, such as the HDInsight Service (Hadoop on demand) and other analysis offerings that don't even require the data scientist to stand up and manage a Virtual Machine in IaaS. For visualization, Microsoft has included the ability to use Excel with the HDInight Service, and of course that works with all Microsoft Business Intelligence functions, and there are several other data visualization tools such as Power View . You can enter the tools you have in the Microsoft stack in this tool ( for more on the visualization options you have. The data scientist can also build visualizations in web pages, on iPhone, Android or Windows mobile devices, or in full client-code installations.

Because the need for elasticity, multiple operating systems, and changing landscapes for data and processing, data science is well served by cloud computing - and in Windows Azure in particular because of the services and features offered, not only on Microsoft Windows but Open Source.


Comments (1)
  1. K.Borne says:

    Great article! I would add one more area that a Data Scientist should focus on: Semantics — which includes ontologies, taxonomies, folksonomies, context metadata, RDF triplestores, and SPARQL — for knowledge representation, preservation, sharing, and reuse. Being able to extract context (and knowledge) from Big Data, then representing and storing that information for later reuse, and finally applying context in recommendation engines (and in other "Learning From Data" applications) is really hot stuff for cool Data Scientists. This is especially true in social data as well as scientific data applications.

Comments are closed.

Skip to main content