Hadoop for .NET Developers: Obtaining the Sample Data Sets

NOTE This post is one in a series on Hadoop for .NET Developers.

In the exercises that follow, we will work with two sample data files.  These files are available as part of a ZIP file associated with this blog post.

The first sample data file, integers.txt, contains a simple list of integers from 1 to 10,000.  In later exercises, we will programmatically load that file to a Hadoop cluster and then write a simple MapReduce job against it.

The second data file, ufo_awesome.tsv, was obtained from InfoChimps. InfoChimps is a clearing house for some really interesting data sets that's well worth checking out.  This specific file is an extract of a UFO sightings database maintained by the National UFO Reporting Center. The data file provides observation dates, reporting dates, locations, UFO shapes, sighting durations, and sighting descriptions in a tab-delimited format.  In later exercises, we will programmatically load this file to a Hadoop cluster and then write a more complex MapReduce job against it.  This file is also interesting for working with Pig and Hive. More info on this data set along with alternative data file formats is found here.