Processing Million Songs Dataset with Pig scripts on Apache Hadoop on Windows Azure

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Its purposes are:

  • To encourage research on algorithms that scale to commercial sizes

  • To provide a reference dataset for evaluating research

  • As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)

  • To help new researchers get started in the MIR field

Full Info:

Download Full 300GB full data Set:

MillionSongSubset 1.8GB DataSet:
To let you get a feel for the dataset without committing to a full download, we also provide a subset consisting of 10,000 songs (1%, 1.8 gb) selected at random:

It contains "additional files" (SQLite databases) in the same format as those for the full set, but referring only to the 10K song subset. Therefore, you can develop code on the subset, then port it to the full dataset.

To Download 5GB, 10,000 songs subset use link below:

To Download any single letter slice use link below:

Once you download you can copy the data directly to HDFS using:

>Hadoop –fs –copyFromLocal <Path_to_Local_Zip_File> <Folder_At_HDFS>

Once file is available you can verify it at HDFS as below:
grunt> ls /user/Avkash/

Now you can run following PIG scripts on the Million Songs Data Subset:

 grunt> songs = LOAD 'Z.tsv.m' USING PigStorage('\t') AS (track_id:chararray, analysis_sample_rate:chararray, artist_7digitalid:chararray, artist_familiarity:chararray, artist_hotttnesss:chararray, artist_id:chararray, artist_latitude:chararray, artist_location:chararray, artist_longitude:chararray,artist_mbid:chararray, artist_mbtags:chararray, artist_mbtags_count:chararray, artist_name:chararray, artist_playmeid:chararray, artist_terms:chararray, artist_terms_freq:chararray, artist_terms_weight:chararray, audio_md5:chararray, bars_confidence:chararray, bars_start:chararray, beats_confidence:chararray, beats_start:chararray, danceability:chararray, duration:chararray, end_of_fade_in:chararray, energy:chararray, key:chararray, key_confidence:chararray, loudness:chararray, mode:chararray, mode_confidence:chararray, release:chararray, release_7digitalid:chararray, sections_confidence:chararray, sections_start:chararray, segments_confidence:chararray, segment_loudness_max:chararray, segment_loudness_max_time:chararray, segment_loudness_max_start:chararray, segment_pitches:chararray, segment_start:chararray, segment_timbre:chararray, similar_artists:chararray, song_hotttnesss:chararray, song_id:chararray, start_of_fade_out:chararray, tatums_confidence:chararray, tatums_start:chararray, tempo:chararray, time_signature:chararray, time_signature_confidence:chararray, title:chararray, track_7digitalid:chararray, year:int);

grunt> filteredsongs = FILTER songs BY year == 0 ;

grunt> selectedsong = FOREACH filteredsongs GENERATE title, year;

grunt> STORE selectedsong INTO 'year_0_songs' ;

grunt> ls year_0_songs
hdfs:// <dir>
hdfs://<r 3> 15013
hdfs://<r 3> 12772

grunt> songs1980 = FILTER songs BY year == 1980 ;

grunt> selectedsongs1980 = FOREACH songs1980 GENERATE title, year;

grunt> dump selectedsongs1980;
(Nice Girls,1980)
(Burn It Down,1980)
(No Escape,1980)
(Lost In Space,1980)
(The Affectionate Punch,1980)
(Good Tradition,1980)

Now Joining these two results selectedsong and selectedsong1980
[Inner Join is Default]

grunt> final = JOIN selectedsong BY $0, selectedsongs1980 BY $0;

grunt> dump final;

(Burn It Down,0,Burn It Down,1980)

Now Joining these two results selectedsong and selectedsong1980


grunt> finalouter = JOIN selectedsong BY $0 FULL, selectedsongs1980 BY $0;

grunt> dump finalouter;

(Tongue Tied,0,,)
(Vuelve A Mi,0,,)
(Blutige Welt,0,,)
(Burn It Down,0,Burn It Down,1980)
(Fine Weather,0,,)
(Ghost Dub 91,0,,)
(Hanky Church,0,,)
(I Don't Know,0,,)
(If I Had You,0,,)
(The Eternal - [University of London Union Live 8] (Encore),0,,)
(44 Duos Sz98 (1996 Digital Remaster): No. 5_ Slovak Song No. 1,0,,)
(Boogie Shoes (2007 Remastered Saturday Night Fever LP Version),0,,)
(Roc Ya Body "Mic Check 1_ 2" (Robi-Rob's Roc Da Jeep Vocal Mix),0,,)
(Phil T. McNasty's Blues (24-Bit Mastering) (2002 Digital Remaster),0,,)
(When Love Takes Over (as made famous by David Guetta & Kelly Rowland),0,,)
(Symphony No. 102 in B flat major (1990 Digital Remaster): II. Adagio,0,,)
(Indagine Su Un Cittadino Al Di Sopra Di Ogni Sospetto - Kid Sundance Remix,0,,)
(Piano Sonata No. 21 in B flat major_ Op.posth. (D960): IV. Allegro non troppo,0,,)
(C≤rtame Con Unas Tijeras Pero No Se Te Olvide El Resistol Para Volverme A Pegar,0,,)
(Frank's Rapp (Live) (24-Bit Remastered 02) (2003 Digital Remaster) (Feat. Frankie Beverly),0,,)
(Groove On Medley: Loving You / When Love Comes Knocking / Slowly / Glorious Time / Rock and Roll,0,,)
(Breaking News Per Netiquettish Cyberscrieber?s False Relationships In A Big Country (Album Version),0,,)

Using LIMIT:

grunt> finalouter10 = LIMIT finalouter 10;

grunt> dump finalouter10;


Comments (6)

  1. ajay says:

    hi avkash, how did u get millionsongs data in .tsv format, as it is available in .h5 format??

  2. justin says:

    I would also like to know where you can get the .tsv version.

  3. how do we get the tsv.m file?

  4. Walson says:

    Here are wrappers in various programming languages which allow you to parse and read data from an hdf5 file. You can take any on of them, modify the code to store it in another form as TSV, CSV, etc.…/code

  5. Lavanya says:

    Hi Avkash, how to validate big data using hive

  6. too says:

    Hello, avkash. How did you get Z.tsv.m as the millionsongsubset only available in .h5 ?

    BTW, this post is awesome. Thank you.

Skip to main content