The authors are members of the University of Michigan Artificial Intelligence Lab and can be contacted at http://musen.engin.umich.edu/.
In the not so distant past, audio music collectionsCDs, LPs, cassette tapes, and maybe a few reel-to-reel tapeswere easy to manage. Most people could fit their entire collection in a single bookcase and easily find the album or track they wanted to play. In addition, getting music usually meant going to a local music store, ordering through a catalog, or checking out recordings from libraries. Computers, inexpensive high-capacity hard drives, and the Internet have changed all that.
Today, music collections are stored digitally on computers in increasing numbers. CDs are stored on computers by ripping tracks from CDs to MP3 files. Online sources such as EMusic (http://www.emusic.com/) and peer-to-peer-based services like Gnutella (http://www.gnutella.com/) make music available to anyone with a computer connected to the Internet. It is common for music collections to contain hundreds of ripped CDs and thousands of downloaded MP3 files, translating into gigabytes of stored music. While this may seem like a lot for most users, we believe the size of music collections will continue to grow, using both local and remote storage.
As music collections increase in size, the task of managing them becomes more difficult. Researchers in the area of Music Information Retrieval (MIR) are designing retrieval mechanisms and systems to aid in searching and organizing music.
MIR systems are concerned with techniques and mechanisms for representing music queries, searching collections of music for possible matches, and retrieving results that best match user queries. In the general case, users enter an artist's name or song title into the MIR system, and it returns a list of matching musical works. What gets returned varies from system to system, but might include a list of music sound files (such as MP3s or MIDI files), musical scores, links to online services where the album can be purchased, or extended information on the work with information on its location in a library or online. Today, commercial services like EMusic let users search databases of available recordings by artist, album, track, or label.
However, most systems are unable to search on musical content, thereby excluding the actual music as a searchable entity. As you can imagine, letting users search on musical content adds richness to searches, as well as a completely new dimension to the type of searches users can perform. Without this, many searches can be time consuming, or even impossible.
For example, imagine you wish to search your music collection or online database for all Mozart symphonies in the key of A major. Overall, this is a simple operation since most music catalogs include the key as a searchable element. However, how would you find all Mozart symphonies in the key of A major that also contain some specific melody? With current systems, one technique is to first find all Mozart symphonies in the key of A major, then listen to each for the target melody. Alas, this approach is error prone, time consuming, and tedious.
Another approach is to use a printed music index of themes for Mozart symphonies and find the symphony by key and theme. Music literature includes numerous thematic indexes of well-known works, making it easy for you to associate important musical themes with works.
Our group at the University of Michigan (http://musen.engin.umich.edu/) is exploring ways to design and build useful, efficient, and powerful MIR systems using stochastic methods and concomitant retrieval mechanisms. We currently have two software systems that implement retrieval mechanisms for searching on musical content. One system is a server-based implementation deployed at the University of Michigan's School of Music library. The other is a standalone system that runs on Mac OS X using Apple's iTunes software.
In this article, we discuss the issues involved in developing an MIR system that incorporates musical content as a searchable element, discuss our approach to solving this problem, and show an implementation of a prototype system running under iTunes, concentrating on audio-based queries only.
Music Information Retrieval
Generally, MIR systems that use musical content in searches operate as follows: Users present the system with a query. This request originates from one of two sourcesa MIDI device, such as a keyboard connected to a computer on which users play a melody, or an audio query, where users sing or hum a melody into a microphone connected to a computer. In both cases, the MIR system transforms the request into some internal representation, then processes the request by searching its musical database for matching material, and returns matching records to users.
However, there are innate problems in these MIR systems. For example, most users are not trained musicians and can have difficulty accurately singing the intended melody. These errors are notoriously difficult to address in audio-based query systems. In addition, correctly translating the sounds that users play or sing into actual musical notes that the MIR system can understand is fundamental to returning correct results. This transcription process generally works well for monophonic melodies, but is difficult when there are many notes sounding simultaneously, as in most musical recordings.
One technique for performing music retrieval is to use text strings to represent user queries and string-based search techniques for finding matches. In this model, strings represent the pitch contour and intervallic distances of the sung queries. For example, suppose a user sings a melody composed of the following notes: A A E C D E F D E. This melody is converted to the representation S U D U U U D U, where S represents a repeated note, U indicates the current note is higher than the previous one, and D indicates the current note is lower than the previous one. This is commonly called the "SUD representation." We can also introduce a small u or d, indicating a jump less than some amount (typically, a minor third), leaving the uppercase letters to represent larger jumps.
While string-based techniques perform well, they do have limitations:
- They perform well on music where melodic contour is the distinguishing feature of the music, but do not work on other classes of music that do not contain strong melodic signatures and where rhythm is a significant characteristic (such as drum music).
- Many works may contain the same melodic signatures, producing many false positives.
- They have difficulty with works containing multiple concurrent independent melodies.
Stochastic Models & Music Retrieval
We base our approach to MIR on the belief that melodic content, if used effectively, can provide extended information and clues to guide the search process. Think of your favorite pop song or classical symphony: You probably remember some melody. Because many musical works can be characterized by their distinctive melodies, you can use this information to guide your search. Further, people are generally able to recall some melody, so we can use melody as a query subject. In "query-by-humming" (QBH) systems like ours, users hum or sing the melody, or a fragment of it, into the search engine. Figure 1 shows the main components of this system.
Themes: The Musical Database
The first task in building our system was to create a database of musical material, which constitutes our search space. This database is composed of a set of themes derived from musical works. For example, a Mozart symphony has four movements. We treat each movement as a separate work, which has a set of derived themes. Thus, a single work may map on to multiple themes, while a theme maps only to a single work.
Automatically extracting the right themes from a work is a complex process. Our extraction program, called "Melodic Motive Extractor" (MME), first extracts repeated sequences of notes from a MIDI file representing the work, then evaluates each sequence against some criteria (such as intervallic variety or rhythmic consistency) to determine its likelihood as a theme. Our experimental results show that MME identifies the principal themes listed in Barlow's thematic catalog in 97 percent of cases (see "Thematic Extractor," by C. Meek and W. Birmingham, International Symposium on Music Information Retrieval, 2001; http://musen.engin.umich.edu/). At this writing, MME is undergoing a patent investigation, so we can't distribute it. However, the algorithm is based on fully documented papers listed at the end of this article.
A Markov Model (MM) is a probabilistic model of a system that transitions through a sequence of states. A distinguishing characteristic of the model is the Markov property, which states that the probability distribution across next states is based only on the current state and nothing else in the state history. Many domains use Markov Models; for example, they have achieved high performance in the area of speech recognition. In a sense, you can think of an MM as a generative model that can describe an underlying structure able to generate a sequence of observed events, called an "observation sequence."
An extension of the Markov Model is the Hidden Markov Model (HMM) (see "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," by L. Rabiner; Proceedings of the IEEE, Vol. 77, No. 2, 1989). The difference between a Markov Model and a Hidden Markov Model is in the mapping from states to observations. Markov Models assume a one-to-one correspondence between states in the model and observed events. Thus, if the model is in state "A," you expect to see the same observation (such as the note "A") every time. Hidden Markov Models generalize this to assume a fixed-probability distribution over the set of possible observations, given a state in the model. For example, if the model is in state "A," you might expect to observe "A" 70 percent of the time; "A sharp" 10 percent of the time; and "A flat" 20 percent of the time. This probability distribution allows modeling of uncertainties such as a singer's inability to reliably sing the exact pitch or rhythm intended.
Our system represents each theme as an HMM and each sung query as an observation sequence. A theme HMM consists of state, transition-probability, and observation-probability information about pitch and rhythm interval transitions. An observation sequence is a sequence of pitch and rhythm intervals.
Figure 2 characterizes how themes are converted into Markov Models. In this example, a state is defined by the number of semitones between successive notesthe intervallic distance. This is similar to using SUD for a state description, but provides more information about the contour of the theme. The intervallic representation works wells because it represents the relative pitch of a note, so users are not confined to singing in any particular key. The probability of transitioning between two states in an HMM is the probability of a state's occurrence given the previous state.
The HMMs we use have a more complex state definition. This definition includes a notion of rhythm based on the relative temporal distance between the start of successive notes. We have simplified the representation description in this article. Adding more attributes to a state definition is a straightforward extension.
We model singer error through observation probability distributions. This requires the construction of a large training set of sung examples of known pieces. These must be pitch-tracked and labeled with note names, and the results compared to the original pieces. As this is done for many pieces, a statistical profile of singer error emerges and may be used as the basis for the observation probability distribution of an HMM. Ideally, each singer should have an individual observation probability profile. In practice, this is a time-consuming process, so we have done this for a few representative singers of different ability and use the observation probability data of the singer whose profile most closely matches the current user. For details on our state definitions and observation probability models, see "HMM-Based Musical Query Retrieval," by J. Shifrin, B. Pardo, C. Meek, and W. Birmingham; Joint Conference on Digital Libraries, 2002.
Processing a Query
Once we build a database of themes and Markov Models, the system can process user search requests. The first step is to acquire a query from a user via the recording module. This part of the system is responsible for taking a user's audio query and saving it to a WAV file. The WAV file is then passed to the pitchtracker. This component first identifies the fundamental frequency across the audio signal, and segments the frequency signal into a sequence of notes. Each note is a pitch, onset, and duration triple, where pitch is the MIDI pitch value of the note being played, onset is the time that the note starts, and duration is the length of the note. A sequence of note triples describes the query. This information is stored in a text file that has a ".notes" extension.
Once the sung query is in .notes format, we can perform the actual search. To accomplish this, we use a program called "Hanoj," which first constructs in-memory HMMs for each theme in the database; see Figure 3. Hanoj treats the user query as an observation sequence generated by one of the HMMs. The program determines the likelihood each HMM had of generating the observation sequence using the Forward Algorithm. Themes are then ranked by the likelihood their HMMs generated the query. Like MME, Hanoj is undergoing a patent search.
The Macintosh implementation of the system, called "VocalSearch," implements our general system as a standalone personal-media-management tool to efficiently search and manage digital music collections. VocalSearch operates on any Apple computer running Mac OS X (10.2) that has sound input capabilities and Apple's iTunes program installed. iTunes (http://www.apple.com/itunes/) is a freely available program from Apple that supports playing, organizing, and recording digital music. iTunes supports running user-created programs written in AppleScript, which automate many iTunes tasks and extend the functionally of programs such as VocalSearch.
We structure the iTunes implementation as a collection of command-line programs controlled by an AppleScript. The VocalSearch program functions are as follows: A user launches iTunes and selects the VocalSearch script from the Scripts menu. This starts up a voice-recorder program. The user sings a melody into the computer and the recording program records the sung melody and saves it to disk. Next, VocalSearch runs the pitch-tracker, which extracts the notes from the user query and writes a new file containing the extracted notes. The Hanoj search program compares the notes file against the musical database of themes, and produces an ordered list of works that best match the user's query. Finally, the VocalSearch script takes the list of matched songs and creates a named Playlist within iTunes, which contains the matched songs. Within iTunes, users can listen to a matched song by highlighting the song name in the Playlist and selecting Play. Figure 4 shows an example iTunes Playlist resulting from singing a portion of the Beatle's "Sgt. Pepper's Lonely Heart's Club Band." For the script to construct the tracks in the Playlist, the songs need to be in the iTunes Library.
In Listing One, the controlling script calls a command-line program and acquires its output. Each of our command-line programs returns information in a strict format, permitting the script to determine if the program was successful or not, as well as collect the result of the run:
[program-name] [error-status] [error message | result]
Hanoj error "There was an error processing" or
Hanoj no-error Playlist-name track0 ... trackN
The complete AppleScript implementation of VocalSearch is available electronically; see "Resource Center," page 5.
As more and more music is available in digital form, methods are needed to help people manage their music collections and help them more efficiently find the music they like. The techniques we have developed and discussed here use and are implemented in a real-audio management MIR system.
For More Information
McNab, R.J. and L.A. Smith, et al. "Towards the Digital Music Library: Tune Retrieval From Acoustic Input;" Digital Libraries, ACM, 1996.
Hoos, H., K. Rentz, and M. Gorg. "GUIDO/MIR P an Experimental Music Information Retrieval System Based on GUIDO Music Notation;" Proceedings of ISMIR, 2001.
Mazzoni, D. and R.D. Dannenberg. "Melody Matching Directly from Audio;" Proceedings of ISMIR, 2001.
Pickings, J., J.P. Bello, G. Monti, T. Crawford, M. Dovey, M. Sandler, and D. Byrd. "Polyphonic Score Retrieval Using Polyphonic Audio Queries: A Harmonic Modeling Approach;" Proceedings of ISMIR, 2002.
Meek, C. and W. Birmingham. "Thematic Extractor;" International Symposium on Music Information Retrieval, 2001.
Rabiner, L. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition;" Proceedings of the IEEE. Vol. 77, No. 2, 1989.
Shifrin, J., B. Pardo, C. Meek, and W. Birmingham. "HMM-Based Musical Query Retrieval;" Joint Conference on Digital Libraries, 2002.
on run_viterbi() set args to UNIX_VS_PATH & UNIX_FILE_DELIM & PT_FILE & " " & VETERBI_UNIX_PATH & UNIX_FILE_DELIM & "dirBeatles.txt" & " " & DEFAULT_PLAYLIST & " " & VETERBI_UNIX_PATH set cmd to VETERBI_UNIX_PATH & UNIX_FILE_DELIM & VETERBI_APP & " " & args my my_log(cmd) tell application "Terminal" set theResult to do shell script cmd end tell my my_log(theResult) if the 1st word of theResult is equal to "error" then handle_error(theResult, VITERBI) end if set aList to theResult return aList end run_viterbi