Free data sets for Azure Machine Learning


 

azureML

One of the key things students need, for learning how to use Microsoft Azure Machine learning,  is access sample data sets and experiments.

At Microsoft we have made a number of  sample data sets available. These data sets are used by the sample models in the Azure Cortana Intelligence Gallery.

Some of these data sets are available in Azure Blob storage so can be directly linked to Azure ML experiments whilst other are available in CSV format.

For these data sets the below list provides a direct link.

You can use these data sets in your experiments by using the Import Data module.

The rest of these sample data sets are listed under Saved Datasets in the module palette to the left of the experiment canvas when you open or create a new experiment in ML Studio. You can use any of these data sets in your own experiment by dragging it to your experiment canvas.

Try Azure Machine Learning for free

No credit card or Azure subscription needed. Get started now >

Here are some of the FREE Data sets available to use

Adult Census Income Binary Classification dataset
A subset of the 1994 Census database, using working adults over the age of 16 with an adjusted income index of > 100.

Usage: Classify people using demographics to predict whether a person earns over 50K a year.

Related Research: Kohavi, R., Becker, B., (1996). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Airport Codes Dataset
U.S. airport codes.

This dataset contains one row for each U.S. airport, providing the airport ID number and name along with the location city and state.

Automobile price data (Raw)
Information about automobiles by make and model, including the price, features such as the number of cylinders and MPG, as well as an insurance risk score.

The risk score is initially associated with auto price and then adjusted for actual risk in a process known to actuaries as symboling. A value of +3 indicates that the auto is risky, and a value of -3 that it is probably pretty safe.

Usage: Predict the risk score by features, using regression or multivariate classification.

Related Research: Schlimmer, J.C. (1987). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Bike Rental UCI dataset
UCI Bike Rental dataset that is based on real data from Capital Bikeshare company that maintains a bike rental network in Washington DC.

The dataset has one row for each hour of each day in 2011 and 2012, for a total of 17,379 rows. The range of hourly bike rentals is from 1 to 977.

Bill Gates RGB Image
Publicly-available image file converted to CSV data.

The code for converting the image is provided in the Color quantization using K-Means clustering model detail page.

Blood donation data
A subset of data from the blood donor database of the Blood Transfusion Service Center of Hsin-Chu City, Taiwan.

Donor data includes the months since last donation), and frequency, or the total number of donations, time since last donation, and amount of blood donated.

Usage: The goal is to predict via classification whether the donor donated blood in March 2007, where 1 indicates a donor during the target period, and 0 a non-donor.

Related Research: Yeh, I.C., (2008). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence, “Expert Systems with Applications, 2008, http://dx.doi.org/10.1016/j.eswa.2008.07.018

Book Reviews from Amazon
Reviews of books in Amazon, taken from the amazon.com website by University of Pennsylvania researchers (sentiment). See the research paper, “Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification” by John Blitzer, Mark Dredze, and Fernando Pereira; Association of Computational Linguistics (ACL), 2007.

The original dataset has 975K reviews with rankings 1, 2, 3, 4, or 5. The reviews were written in English and are from the time period 1997-2007. This dataset has been down-sampled to 10K reviews.

Breast cancer data
One of three cancer-related datasets provided by the Oncology Institute that appears frequently in machine learning literature. Combines diagnostic information with features from laboratory analysis of about 300 tissue samples.

Usage: Classify the type of cancer, based on 9 attributes, some of which are linear and some are categorical.

Related Research: Wohlberg, W.H., Street, W.N., & Mangasarian, O.L. (1995). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Breast Cancer Features
The dataset contains information for 102K suspicious regions (candidates) of X-ray images, each described by 117 features. The features are proprietary and their meaning is not revealed by the dataset creators (Siemens Healthcare).

Breast Cancer Info
The dataset contains additional information for each suspicious region of X-ray image. Each example provides information (e.g., label, patient ID, coordinates of patch relative to the whole image) about the corresponding row number in the Breast Cancer Features dataset. Each patient has a number of examples. For patients who have a cancer, some examples are positive and some are negative. For patients who don’t have a cancer, all examples are negative. The dataset has 102K examples. The dataset is biased, 0.6% of the points are positive, the rest are negative. The dataset was made available by Siemens Healthcare.

CRM Appetency Labels Shared
Labels from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train_appetency.labels).

CRM Churn Labels Shared
Labels from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train_churn.labels).

CRM Dataset Shared
This data comes from the KDD Cup 2009 customer relationship prediction challenge (orange_small_train.data.zip).

The dataset contains 50K customers from the French Telecom company Orange. Each customer has 230 anonymized features, 190 of which are numeric and 40 are categorical. The features are very sparse.

CRM Upselling Labels Shared
Labels from the KDD Cup 2009 customer relationship prediction challenge (orange_large_train_upselling.labels).

Energy Efficiency Regression data
A collection of simulated energy profiles, based on 12 different building shapes. The buildings differ with respect to differentiated by 8 features, such as glazing area, the glazing area distribution, and orientation.

Usage: Use either regression or classification to predict the energy efficiency rating based as one of two real valued responses. For multi-class classification, is round the response variable to the nearest integer.

Related Research: Xifara, A. & Tsanas, A. (2012). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Flight Delays Data
Passenger flight on-time performance data taken from the TranStats data collection of the U.S. Department of Transportation (On-Time).

The dataset covers the time period April-October 2013. Before uploading to Azure ML Studio, the dataset was processed as follows:

  • The dataset was filtered to cover only the 70 busiest airports in the continental US
  • Cancelled flights were labeled as delayed by more than 15 minutes
  • Diverted flights were filtered out
  • The following columns were selected: Year, Month, DayofMonth, DayOfWeek, Carrier, OriginAirportID, DestAirportID, CRSDepTime, DepDelay, DepDel15, CRSArrTime, ArrDelay, ArrDel15, Cancelled

Flight on-time performance (Raw)
Records of airplane flight arrivals and departures within United States from October 2011.

Usage: Predict flight delays.

Related Research: From US Dept. of Transportation http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time.

Forest fires data
Contains weather data, such as temperature and humidity indices and wind speed, from an area of northeast Portugal, combined with records of forest fires.

Usage: This is a difficult regression task, where the aim is to predict the burned area of forest fires.

Related Research: Cortez, P., & Morais, A. (2008). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

[Cortez and Morais, 2007] P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 – Portuguese Conference on Artificial Intelligence, December, Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf.

German Credit Card UCI dataset
The UCI Statlog (German Credit Card) dataset (Statlog+German+Credit+Data), using the german.data file.

The dataset classifies people, described by a set of attributes, as low or high credit risks. Each example represents a person. There are 20 features, both numerical and categorical, and a binary label (the credit risk value). High credit risk entries have label = 2, low credit risk entries have label = 1. The cost of misclassifying a low risk example as high is 1, whereas the cost of misclassifying a high risk example as low is 5.

IMDB Movie Titles
The dataset contains information about movies that were rated in Twitter tweets: IMDB movie ID, movie name and genre, production year. There are 17K movies in the dataset. The dataset was introduced in the paper “S. Dooms, T. De Pessemier and L. Martens. MovieTweetings: a Movie Rating Dataset Collected From Twitter. Workshop on Crowdsourcing and Human Computation for Recommender Systems, CrowdRec at RecSys 2013.”

Iris two class data
This is perhaps the best known database to be found in the pattern recognition literature. The data set is relatively small, containing 50 examples each of petal measurements from three iris varieties.

Usage: Predict the iris type from the measurements.

Related Research: Fisher, R.A. (1988). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Movie Tweets
The dataset is an extended version of the Movie Tweetings dataset. The dataset has 170K ratings for movies, extracted from well-structured tweets on Twitter. Each instance represents a tweet and is a tuple: user ID, IMDB movie ID, rating, timestamp, numer of favorites for this tweet, and number of retweets of this tweet. The dataset was made available by A. Said, S. Dooms, B. Loni and D. Tikk for Recommender Systems Challenge 2014.

MPG data for various automobiles
This dataset is a slightly modified version of the dataset provided by the StatLib library of Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.

The data lists fuel consumption for various automobiles in miles per gallon, along with information such the number of cylinders, engine displacement, horsepower, total weight, and acceleration.

Usage: Predict fuel economy based on 3 multivalued discrete attributes and 5 continuous attributes.

Related Research: StatLib, Carnegie Mellon University, (1993). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

Pima Indians Diabetes Binary Classification dataset
A subset of data from the National Institute of Diabetes and Digestive and Kidney Diseases database. The dataset was filtered to focus on female patients of Pima Indian heritage. The data includes medical data such as glucose and insulin levels, as well as lifestyle factors.

Usage: Predict whether the subject has diabetes (binary classification).

Related Research: Sigillito, V. (1990). UCI Machine Learning Repository http://archive.ics.uci.edu/ml”. Irvine, CA: University of California, School of Information and Computer Science

Restaurant customer data
A set of metadata about customers, including demographics and preferences.

Usage: Use this dataset, in combination with the other two restaurant data sets, to train and test a recommender system.

Related Research: Bache, K. and Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

Restaurant feature data
A set of metadata about restaurants and their features, such as food type, dining style, and location.

Usage: Use this dataset, in combination with the other two restaurant data sets, to train and test a recommender system.

Related Research: Bache, K. and Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

Restaurant ratings
Contains ratings given by users to restaurants on a scale from 0 to 2.

Usage: Use this dataset, in combination with the other two restaurant data sets, to train and test a recommender system.

Related Research: Bache, K. and Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

Steel Annealing multi-class dataset
This dataset contains a series of records from steel annealing trials with the physical attributes (width, thickness, type (coil, sheet, etc.) of the resulting steel types.

Usage: Predict any of two numeric class attributes; hardness or strength. You might also analyze correlations among attributes.

Steel grades follow a set standard, defined by SAE and other organizations. You are looking for a specific ‘grade’ (the class variable) and want to understand the values needed.

Related Research: Sterling, D. & Buntine, W., (NA). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science

A useful guide to steel grades can be found here: http://www.outokumpu.com/SiteCollectionDocuments/Outokumpu-steel-grades-properties-global-standards.pdf

Telescope data
Records of high energy gamma particle bursts along with background noise, both simulated using a Monte Carlo process.

The intent of the simulation was to improve the accuracy of ground-based atmospheric Cherenkov gamma telescopes, using statistical methods to differentiate between the desired signal (Cherenkov radiation showers) and background noise (hadronic showers initiated by cosmic rays in the upper atmosphere).

The data has been pre-processed to create an elongated cluster with the long axis is oriented towards the camera center. The characteristics of this ellipse, (often called Hillas parameters) are among the image parameters that can be used for discrimination.

Usage: Predict whether image of a shower represents signal or background noise.

Notes: Simple classification accuracy is not meaningful for this data, since classifying a background event as signal is worse than classifying a signal event as background. For comparison of different classifiers the ROC graph should be used. The probability of accepting a background event as signal must be below one of the following thresholds: 0.01 , 0.02 , 0.05 , 0.1 , or 0.2.

Also, note that the number of background events (h, for hadronic showers) is underestimated, whereas in real measurements, the h or noise class represents the majority of events.

Related Research: Bock, R.K. (1995). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information

Weather Dataset
Hourly land-based weather observations from NOAA (merged data from 201304 to 201310).

The weather data covers observations made from airport weather stations, covering the time period April-October 2013. Before uploading to Azure ML Studio, the dataset was processed as follows:

  • Weather station IDs were mapped to corresponding airport IDs
  • Weather stations not associated with the 70 busiest airports were filtered out
  • The Date column was split into separate Year, Month, and Day columns
  • The following columns were selected: AirportID, Year, Month, Day, Time, TimeZone, SkyCondition, Visibility, WeatherType, DryBulbFarenheit, DryBulbCelsius, WetBulbFarenheit, WetBulbCelsius, DewPointFarenheit, DewPointCelsius, RelativeHumidity, WindSpeed, WindDirection, ValueForWindCharacter, StationPressure, PressureTendency, PressureChange, SeaLevelPressure, RecordType, HourlyPrecip, Altimeter

Wikipedia SP 500 Dataset
Data is derived from Wikipedia (http://www.wikipedia.org/) based on articles of each S&P 500 company, stored as XML data.

Before uploading to Azure ML Studio, the dataset was processed as follows:

  • Extract text content for each specific company
  • Remove wiki formatting
  • Remove non-alphanumeric characters
  • Convert all text to lowercase
  • Known company categories were added

Note that for some companies an article could not be found, so the number of records is less than 500.

Downloadable Data Sets in CSV Format

direct_marketing.csv
The dataset contains customer data and indications about their response to a direct mailing campaign. Each row represents a customer. The dataset contains 9 features about user demographics and past behavior, and 3 label columns (visit, conversion, and spend). Visit is a binary column that indicates that a customer visited after the marketing campaign, conversion indicates a customer purchased something, and spend is the amount that was spent. The dataset was made available by Kevin Hillstrom for MineThatData E-Mail Analytics And Data Mining Challenge.

lyrl2004_tokens_test.csv
Features of test examples in the RCV1-V2 Reuters news dataset. The dataset has 781K news articles along with their IDs (first column of the dataset). Each article is tokenized, stopworded, and stemmed. The dataset was made available by David. D. Lewis.

lyrl2004_tokens_train.csv
Features of training examples in the RCV1-V2 Reuters news dataset. The dataset has 23K news articles along with their IDs (first column of the dataset). Each article is tokenized, stopworded, and stemmed. The dataset was made available by David. D. Lewis.

network_intrusion_detection.csv
Dataset from the KDD Cup 1999 Knowledge Discovery and Data Mining Tools Competition (kddcup99.html).

The dataset was downloaded and stored in Azure Blob storage (network_intrusion_detection.csv) and includes both training and testing datasets. The training dataset has approximately 126K rows and 43 columns, including the labels; 3 columns are part of the label information, and 40 columns, consisting of numeric and string/categorical features, are available for training the model. The test data has approximately 22.5K test examples with the same 43 columns as in the training data.

rcv1-v2.topics.qrels.csv
Topic assignments for news articles in the RCV1-V2 Reuters news dataset. A news article can be assigned to several topics. The format of each row is ” 1″. The dataset contains 2.6M topic assignments. The dataset was made available by David. D. Lewis.

student_performance.txt
This data comes from the KDD Cup 2010 Student performance evaluation challenge (student performance evaluation). The data used is the Algebra_2008_2009 training set (Stamper, J., Niculescu-Mizil, A., Ritter, S., Gordon, G.J., & Koedinger, K.R. (2010). Algebra I 2008-2009. Challenge data set from KDD Cup 2010 Educational Data Mining Challenge. Find it at downloads.jsp or algebra_2008_2009.zip.

The dataset was downloaded and stored in Azure Blob storage (student_performance.txt) and contains log files from a student tutoring system. The supplied features include problem ID and its brief description, student ID, timestamp, and how many attempts the student made before solving the problem in the right way. The original dataset has 8.9M records, this dataset has been down-sampled to the first 100K rows. The dataset has 23 tab-separated columns of various types: numeric, categorical, and timestamp.

Comments (0)

Skip to main content