If you want to explore machine learning, sometimes the hardest part is finding an interesting data set to play with.
In this post, I want to tackle how to find and use interesting data in Azure Machine Learning (AzureML). I’ll discuss using the sample datasets, importing data, and finding some interesting data sources.
If you don’t have an Azure account yet, you can sign up for a Microsoft Azure Free Trial.
First of all, if you are simply trying out Azure’s machine learning capabilities, or you are participating in a hackathon with limited time available, you may want to use the sample datasets that are included in the Microsoft Azure Machine Learning Studio. There are currently 39 sample datasets available. This data includes: adult census income binary classification, automobile price data, blood donation data, an RGB image of Bill Gates, book reviews from Amazon, breast cancer data, CRM data, flight delay data, forest fires data, German credit card data, IMDB movie titles, movie ratings and tweets, MNIST data, restaurant data, telescope data, weather data, and a dataset from Wikipedia. There are descriptions of the various sample datasets at http://azure.microsoft.com/en-us/documentation/articles/machine-learning-use-sample-datasets/.
To use the sample datasets in a machine learning project, open the Azure portal. Click on the “+NEW” button at the bottom, then select “Data Services”, “Machine Learning”, and “Quick Create”. Give your workspace a name. You may change the owner email if you want, but it must be a Microsoft account since you will sign in using that account. For now, Machine Learning is only available in the location “South Central US” (there is a table of which Azure features are available in which regions at http://azure.microsoft.com/en-us/regions/#services). Create a storage account or use an existing one.
After you have created the ML workspace, click on it, and open and sign in to ML Studio. Click on the “Experiments” tab in the left-hand sidebar and click “NEW” and then “Blank Experiment”. In the left sidebar, click on “Saved Datasets” to expand it. Then you can drag and drop a sample dataset into the experiment. (There is a full walkthrough of how to create your first predictive solution at http://azure.microsoft.com/en-us/documentation/articles/machine-learning-walkthrough-develop-predictive-solution/.)
The sample datasets are nice when you are getting started, but if you want to use machine learning on your own data or on one of the data sources that I shared below, you will need to import data. There is additional information on importing data at http://azure.microsoft.com/en-us/documentation/articles/machine-learning-import-data/.
There are multiple ways to import data.
1. Upload a new dataset from a local file. If you already have a batch of data stored on your local hard drive that you want to process, this is a good option. In ML Studio, click “+NEW” at the bottom. In the left sidebar, click “DATASET” and then “FROM LOCAL FILE”. In the dialog, provide the path to the file on your hard drive, give it a new name if you want (this is the name that will be displayed under “Saved Datasets”), choose the type of the dataset if it is not auto-filled-in, and optionally provide a description. Click the check box to complete. The dataset should now be available from the left sidebar under “Saved Datasets”. Then you can drag and drop it into an experiment.
2. Enter data. If you want to type in a quick dataset yourself, you can use the “Enter Data” module. The data can then be stored as CSV, TSV, ARFF, or SVMLight. In ML Studio in the left sidebar, expand “Data Input and Output” to find the “Enter Data” module. Drag and drop “Enter Data” into the experiment. For more information, see the documentation page at https://msdn.microsoft.com/library/azure/4fbef0ab-2c8e-4a25-b5c4-7be76eac33d6.
3. Use the reader. If you need to load data from the web or from Azure, use the “Reader” module. It can read from a web URL via HTTP, Hive query, Azure SQL database, Azure table, Azure blob, or feed provider. In ML Studio in the left sidebar, expand “Data Input and Output” to find the “Reader” module. Drag and drop “Reader” into the experiment. The properties that must be configured for the Reader are different depending on your data source. For more information, see the documentation page at https://msdn.microsoft.com/library/azure/4e1b0fe6-aded-4b3f-a36f-39b8862b9004.
So now you know how to import data, but you must find interesting data sets to analyze. Here are some great sources to find interesting data:
Data.gov – http://www.data.gov/
This is the home of the U.S. government’s open data. There is the ability to search for data sets, or browse in the areas of Agriculture, Business, Climate, Consumer, Ecosystems, Education, Energy, Finance, Health, Local Government, Manufacturing, Ocean, Public Safety, and Science & Research.
Kaggle – http://www.kaggle.com/
This is the “home of Data Science”. Kaggle regularly runs competitions utilizing machine learning and has interesting datasets available. At the moment, there are challenges and tutorials around identifying signs of diabetic retinopathy in eye images, using telematics data to identify a driver of a car, predicting the 2015 NCAA basketball tournament (March Madness), classifying malware into families based on file content and characteristics, building a digit recognizer to classify handwritten digits, predicting who will survive on the Titanic, detecting the location of keypoints on facial images, and more…
UCI Machine Learning Repository – http://archive.ics.uci.edu/ml/
The University of California Irvine Machine Learning Repository contains 311 data sets as a service to the machine learning community. Data sets include Iris, Car Evaluation, Breast Cancer Wisconsin (Diagnostic), Wine Quality, Poker Hand, Heart Disease, Forest Fires, Internet Advertisements, Human Activity Recognition Using Smartphones, and much more.
Please leave a comment to share other interesting data sources.