Data Science in a Box using IPython: Scipy and Scikit-Learn (3/4)

In the first two blogs of this series, we installed the IPython notebook using the minimum requirement.    Creating a Linux VM on Windows Azure (1/4) Installing IPython notebook (2/4) The third blog post will walk you through some of the common packages used for Data Science.  SciPy/NumPy packages are usually mentioned together.  At this point,…

0

Data Science in a Box using IPython: Installing IPython notebook (2/4)

In the previous blog, we demonstrated how to create a Windows Azure Linux VM in detail. We will continue the installation process for the IPython notebook and related packages. Python 2.7 or 3.3 One of the discussions that happened at the Python in Finance conference is which version of Python you should use?  My personally…

1

Data Science in a Box using IPython: Creating a Linux VM on Windows Azure (1/4)

I just returned from the Python in Finance Conference in New York, I would like to thank Bank of America and Andrew Shepped organizing the event.  It was not difficult to see the popularity of Python in the financial community; the event was quickly sold out with over 400 attendees.  I gave a 35 minute…

2

Enter the Big Data Matrix: analyzing meanings and relations of everything (2/2)

Running the Python example step by step: We explained the basic idea behind LSA or latent semantic analysis in the first part of this blog. We built a matrix by word counting for each document.  The set of document vectors are then sorted by words they appear in. Then we applied SVD (single value decomposition)…

0

Enter the Big Data Matrix: analyzing meanings and relations of everything (1/2)

Data Science is compute and labor intensive In the previous blogs, we showed you how to find a dataset, clean it and run simple mapReduce, sort on the dataset.  It was meant to give you a flavor of what data science is all about, and I also wanted to expose Big Data’s rather labor intensive…

0

New Breakthrough in Big Data Technologies: the NullSQL Paradigm shift

  Mammoth the NullSQL tool Most of us by now understand the properties of big data.  Many of us are already working with big data tools, or NoSQL tools such as Hadoop.  I’ve spent a bit of my spare time in the last 2 months working on prototypes of a new set of tools that can help the…

1

Make another small step, with the JavaScript Console Pig in HDInsight

Our previous blog, MapReduce on 27,000 books using multiple storage accounts and HDInsight showed you how to run the Java version of the MapReduce code against the Gutenberg dataset we uploaded to the blog storage.  We also explained how you can add multiple storage accounts and access them from your HDInsight cluster.  In this blog,…

0

MapReduce on 27,000 books using multiple storage accounts and HDInsight

In our previous blog, Preparing and uploading datasets for HDInsight, we showed you some of the important utilities that are used on the Unix platform for data processing.  That includes Gnu Parallel, Find, Split, and AzCopy for uploading large amounts of data reliably.  In this blog, we’ll use an HDInsight cluster to operate on the…

0

Preparing and uploading datasets for HDInsight

In the previous blog  http://blogs.msdn.com/b/hpctrekker/archive/2013/03/30/finding-and-pre-processing-datasets-for-use-with-hdinsight.aspx  we went over how to get English only documents from the Gutenberg DVD.  We showed you the Cygwin Unix emulation environment and also some simple Python code.  The script takes about 15 minutes to run.  Async or some simple task scheduling would probably have saved us some time in copying…

0

Finding and pre-processing datasets for use with HDInsight

  Free datasets There are many difficult aspects associated with Big Data, getting a good, clean, well tagged dataset is the first barrier.  After all, you can not really do much data processing without data!  Many companies are yet to realize and discover the value of their data.  For those of you who want to…

0