Music Artist Identification:
`artist20` Baseline System in Matlab

There is a growing body of research related to classifying music audio according to the statistics of some features. One issue with this work, however, is the scarcity of common data sets, although the annual MIREX competition has helped significanty. Despite that, it would be useful to have a dataset and task that anyone could download and run on their own machines. That is what we are attempting to provide with the artist20 set.

artist20 is a database of six albums by each of 20 artists, making a total of 1,413 tracks. It grew out of our work in artist identification, where we identified 18 artists with five or more albums in our uspop2002 dataset. This data was used to train artist identification tasks, with albums disjoint between training and test (to avoid gross features related to masterin). There were, however, a number of issues with that data, including repeated tracks, live recordings, and others.

artist20 was assembled to resolve these problems. The largest part of the set is drawn from the same uspop2002 set, but we have expanded it with several artists and albums not in uspop2002 in order to get a temporally-sequential series of six regular studio albums from each artist. We've made some effort to avoid major changes in style, where possible. All music is drawn from the personal CD collections of lab members or our friends.

We are also distributing list files that define various cuts of the data. We define a canonical training set (three albums per artist), validation set (1 album), and test set (2 albums). We also define a canonical 6-fold jacknife train/test scheme, where each fold consists of training on five albums per artist, and testing on the remaining one, and the final result comes from averaging all the folds to get an effective test set of all 1,413 tracks.

The package includes Matlab code to evaluate the 6-fold train/test, which should be easy to modify to work with your own feature types. We are distributing the data as precalculated MFCCs and beat-chroma matrices, but also as 32 kbps mono MP3s (16 kHz sample rate, bandlimited to 7.2 kHz). For the MFCC and chroma features we use, this data gives results that are no different from starting with the original 44 kHz stereo data, so we hope it will be adequate for other groups. At the same time, this format is comparable to AM radio quality, so we do not believe it will harm the interests of the copyright owners.

Data

We are making the data set freely available for download, but we would like to keep track of who has downloaded it. If you would like to get a copy, simply send an email to Dan Ellis <dpwe@ee.columbia.edu> giving your name, institution/affiliation, and a sentence describing what you plan to use the data for. Then we'll send you the download links.

The different data sets are:

artist20-mp3s-32k.tgz (1.3GB) - the 32 kbps mono mp3 audio files
artist20-mfccs.tgz (1.6GB) - precalculated 20 dimensional MFCCs at 10ms hops
artist20-chromftrs.tgz (122MB) - per-tatum 12 dimensional chroma vectors

Code

The complete Matlab code to run the baseline, 6-fold, 20-way artist identification task based on the MFCC features can be downloaded here: artist20-baseline.tgz. Here is its README file.

You will also need to download and install Kevin Murphy's HMM toolbox (although we're only using the GMM part) and make sure it is in your Matlab path.

Main routines

[acc,confusm,lhoods,models] = trntest_folds(folds,doplot) - Top-level routine to run an entire multi-fold classification training and test experiment. folds is the name of a file defining each fold's training and test set; acc returns the overall classifcation accuracy, and confusm returns a confusion matrix (which is plotted if doplot is set).
[acc,confusm,lhoods,models] = trntest_1fold(trainset, testset, datapath, dataext, ngmm, nsamp, dims, verbose) - Perform a training and test cycle on one fold of the data. trainset and testset are names of files defining which tracks to use for training and testing, with datapath giving a prefix common to all files, and dataext a suffix. ngmm, nsamp, and dims are parameters to the Gaussian models used in this example. Returns as for trntest_folds.m

Example Usage

This is the default full test over all 1413 tracks, using a single full-covariance Gaussian model for each artist, based on 1000 randomly- chosen frames (for training and test).


>> % Run the 6-fold train/test experiment
>> [ac,co,lh,mo] = trntest_folds('6fold.list',1);
09:23:39 acc=0.54494 durn=696.1604
>> sum(diag(co))/sum(sum(co))  % co is confusion matrix

ans =

    0.5449

>> % i.e. overall accuracy is around 54%
>> % (timing is 696 sec, on a MacBook Pro 2 GHz)

Papers

You can reference this dataset with the following paper:

D. Ellis (2007). Classifying Music Audio with Timbral and Chroma Features,: Proc. Int. Conf. on Music Information Retrieval ISMIR-07, Vienna, Austria, Sep. 2007.

Acknowledgment

This material is based in part upon work supported by the National Science Foundation under Grant Nos. IIS-0238301 and and IIS-07133334. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

This work was also supported by the Columbia Academic Quality Fund.

This work makes use of the HMM Toolbox by Kevin Murphy, and includes the htk reading code from Mike Brooke's VOICEBOX. Their work is gratefully acknowledged.

Last updated: $Date: 2007/11/07 17:57:15 $
Dan Ellis <dpwe@ee.columbia.edu>

Music Artist Identification: artist20 Baseline System in Matlab