Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Managing Your MP3 Library in Perl


Jul03: Managing Your MP3 Library in Perl

Managing Your MP3 Library in Perl

The Perl Journal July 2003

By Luis E. Muñoz

Luis is an Open Source and Perl advocate at a nationwide ISP in Venezuela. He can be contacted at [email protected].


These days, the practice of backing up one's music CDs as MP3 (or OGG files) has become widespread. I own a few hundred CDs that I've purchased over the years, and a few of them have become coasters through a complex process that involves the passage of time, rough surfaces, and dirt. However, keeping a few gigabytes worth of MP3 files in my laptop's hard drive is not the best way to back up this material. It is also not cheap. When I decided to write the tool I shall describe in this article, I had about 6 GB of MP3 files. I know people who have much more than that, though.

The solution, of course, was to burn a few CD-ROMs with the MP3 files to make room on my laptop's drive. After all, there's no point in carrying eight days of nonstop music with you unless you're a DJ.

Overview

I wanted a database to hold information about the ID3 tag of each song as well as its location in my backup library. This would allow me to quickly locate, say, all the songs from a given artist. Sometimes I find errors in the tags, so I would like my fixes to be incorporated in the library automatically.

I also wanted the process to be simple and ideally, integrated with the Mac OS X environment on my PowerBook, which is where I most wanted to use this tool. I wanted to simply insert a blank CD-ROM in my burner, fire up a command, and wait for the CD-ROM with my MP3s to be ejected. As you'll see later, I came really close.

My solution is based on the excellent File::Findmodule, which simplifies the task of traversing a file tree such as the one typically associated with an MP3 library. For managing operations with file names, paths, and the like, I used File::Spec and File::Path, respectively. This helps me ensure portability for my new tools. I also used MP3::Tag for reading the ID3 tags in the MP3 files. I handle the detection of changes in already archived files with Digest::MD5. The database is maintained with DB_File, Storable, and MLDBM, which allow me to conveniently store a Perl data structure in a file that can be accessed very quickly. Thanks to all these wonderful and free modules, easily found through a search in CPAN, my task became much simpler.

I decided to make multiple command-line tools for more or less specific tasks. This helped me to keep the interface simple enough to be easy to remember. At the time of this writing, there are two tools: mp3cat for copying files, adding them to the database or finding out which files are or are not archived; and mp3dump, a database reporting and backup utility that I hope will make it easier to find a particular song in the archive. I will visit each of those programs in turn and explain their inner workings. (Both utilities are available in their entirety online at http://www.tpj.com/source/.)

Access to the database is encapsulated through the use of a tied hash. Tied hashes provide a very simple model for manipulating data. Normally, all hash operations (especially keys, values, and each) work as expected with a tied hash. MLDBM allows for the storage of serialized Perl data structures, which can be easily read back when needed. To do this, MLDBM uses the help of a database module such as DB_File and a serialization module such as Storable.

mp3cat: Keeping Track of the Database

The first part of the script (Listing 1) handles the specification of the modules to use, loads warnings and strict, which should always be used as they help trap hard-to-find errors such as misspelled variables. After this comes the declaration of variables, specification of command-line options, some error checking in the command-line options with proper error responses courtesy of Pod::Usage, and the specification of the default database name.

Because the database will be accessed through a tied hash, at line 132, I declare the hash and make sure it is empty to begin with. To prevent abrupt interruptions from corrupting the database, lines 134 to 138 show a signal handler that kicks in when the user interrupts the script in the middle of its execution. This handler calls untie on the hash that is tied to the database at lines 140 and 141, to force it to close gracefully, preventing corruption. To be safe, I assume this might not be enough protection, and I regularly backup the database just in case. In the worst case, rebuilding the database is a matter of reinserting all the library CD-ROMs, but it is so much faster to simply copy a file to another directory.

132   my %db = ();
133   
134   $SIG{INT} = sub 
135   { 
136       untie %db; 
137       die "User requested interruption\n";
138   };
139   
140   tie %db, 'MLDBM', $opt_d, O_CREAT | O_RDWR, 0666
141       or die "Failed to tie database $opt_d: $!\n"; 

Next comes the part of the code that traverses the directories specified in the command line, at lines 145 through 156. This code is quite simple, thanks to File::Find. Basically, we're requesting a recursive traversal of each path in the command line, stored in $dir. For each filesystem object found, the subroutine analyze will be called. Lines 152 and 153 show some customization we've asked for, namely following symbolic links and not changing directories during the directory traversal.

145   for my $dir (@ARGV)
146   {
147       find( 
148             {
149                 wanted    => \&analyze,
150                    # Follow symlinks and don't chdir() into
151                               # each subdir
152                 follow    => 1,
153                 no_chdir  => 1,
154             }, $dir
155             );
156   } 

The subroutine analyze (lines 238 to 285), is responsible for extracting the suitable data from each MP3 file. Note how we restrict the file's extension with a simple regexp at line 240, to avoid unnecessary work. However, you could remove this and use this utility to perform incremental backups of your files. The hash reference $song will be used to store all the information we can get from the song we're processing. The first information element we have is its filename, which is passed by File::Find via the $File::Find::name scalar. I store this information at line 242.

238   sub analyze
239   {
240       return unless $File::Find::name =~ qr/\.mp3$/i;
241   
242       my $song = { path => $File::Find::name };
243   
244       if ($opt_s)
245       {
246           my $mp3 = MP3::Tag->new($File::Find::name);
247   
248           ($song->{name}, 
249            $song->{track}, 
250            $song->{artist}, 
251            $song->{album}) = $mp3->autoinfo;
252   
253           $mp3 = undef;                # Free any resources
254           
255  unless ($song->{name} or $song->{artist} or $song->{album})
256           {
257  warn "$File::Find::name contains no understandable tags\n";
258           }
259   
260         $song->{$_} ||= '?' for qw(name track artist album);
261       }
262 

When the -s option is specified, the if block on lines 244 through 261 decodes the ID3 tag information that may lie inside the MP3 file. Otherwise, this task is skipped to save time and resources. The information is decoded via a call to the autoinfo method that MP3::Tag provides, at line 251. This will even try to derive information from the filename if no tags are found. Since no more tag-related operations will be done, I request the destruction of the MP3::Tag object at line 253 by assigning undef to the object reference. Line 260 stores a placeholder for attributes in case the data is not available.

263       $song->{size} = -s $File::Find::name;
264   
265       my $fh = new IO::File $File::Find::name, "r";
266   
267       unless ($fh)
268       {
269           warn "Failed to open $File::Find::name: $!\n";
270           return;
271       }
272   
273       binmode($fh);
274   
275     $song->{md5} = Digest::MD5->new->addfile($fh)>hexdigest;
276   
277       $fh->close;
278   
279    $song->{file}=(File::Spec>splitpath($File::Find::name))[2]; 

In line 263, I store the file length in $song->{size} using the -s operator. With the code at lines 265 to 277, I calculate the MD5 signature of the MP3 file. An MD5 signature or "Message Digest" is actually a 128-bit number that is assigned to a sequence of bytes by a series of mathematical operations. I use this to recognize when a file has been changed because of two nice properties of message digests:

  • Any change in the file produces a completely different digest.
  • Finding two files with the same digest is really difficult.

Note that I read the file in binary mode, as requested at line 273, to prevent differences in the treatment of newlines among different operating systems from causing unnecessary duplication.

Because of the uniqueness of the MD5 signature, this is what we'll use as the key to the database, storing the whole $song hash reference. (If someone ever reports a collision, I promise to expand the key with some other data to avoid it.)

At line 279, I use the services of File::Spec to find the filename in the path name that File::Find gave us through the $File::Find::name scalar. This helps to ensure the portability of the code to operating systems that use different separators in the path names.

284       _perform $song; 

Once the data has been collected, a call is made to _perform at line 284 in order to decide what to do with this particular song before going to the next in turn.

194       $song->{vol} = $opt_V || 
195           (File::Spec->splitdir
196            ((File::Spec->splitpath
197     (File::Spec->canonpath($song->{path})))[1]))[1] || '?'; 

The first part of _perform, on lines 194 to 197, attempts to provide a volume name, the name chosen for each CD-ROM in the library, based on the mount point. This code selects the second component of the pathname given as the destination, which tends to work well when mounting out of /Volumes, the default place where Mac OS X mounts new volumes. Of course, the volume specified through the -V command line option takes precedence.

At lines 199 through 231 (Listing 2), an if...elsif block is used to perform slightly different actions depending on the command line options that were specified. These actions might be attempting to copy the file to a given destination through a call to _copy, adding or replacing an entry in the database with a statement such as $db{$song->{md5}} = $song, verifying if this song is already archived with a statement such as exists $db{$song->{md5}} or printing all the song data, such as in line 229. It is vital to use the exists in the verifications, to prevent the autovivification from adding empty entries to the database.

Whenever a reference is assigned to the tied hash %db, MLDBM will use Storable as requested at line 102, to serialize the Perl data structure referenced. The concept of serialization is very important, because it allows for data structures to be stored for later use. This is sometimes referred to as "persistence." In essence, serializing a data structure means to translate it to a representation that lacks things like pointers and references to a process's data. This is often called a flat or serial representation, thus the name. By storing the reference to a hash with all the song data, we can later access this information for other purposes.

Our serialized data is then stored in the database by the DB_File module. This process occurs in the opposite order whenever data is read from the tied hash. It is very important to keep in mind than there is no way to track accesses to the nested structures that might live within the reference. For instance, the following code won't usually work as expected:

# wrong
$db{$my_md5}->{title} = "The lost song"; 

It won't work because the tie interface will fetch the entry corresponding to $my_md5 from the underlying DB_File database and Storable will make sure that a hash reference is reconstructed from the stored data. However, that referenced data, which is being modified in your process, is never being stored back in the database. MLDBM has no way to tell that the referenced data has been altered. A correct alternative is shown below. It works because the reference is fetched from stable storage, modified and then explicitly restored.

# longer but correct
my $song = $db{$my_md5};
song->{title} = 'The lost song';
$db{$my_md5} = $song; 

Note that in the calls to _copy, a die follows in case of a false return value, as shown at lines 206, 215, and 223. This is useful to stop the process when a CD-R image is full, which helps make the backup process straightforward.

161   sub _copy
162   {
163       my $song = shift;
164       return 1 unless defined $opt_c;
165   
166  my $dest = File::Spec->canonpath(File::Spec->catfile($opt_c, 
167                                           $song->{path}));
168       my $dp = (File::Spec->splitpath($dest))[1]; 

I begin _copy at line 164 by checking whether a file copy destination was specified in the command line with the -c option, returning success otherwise. In lines 166 through 168, I obtain a destination path by combining whatever the user supplied in the command line with the source file path name. This is done with File::Spec in order to achieve portability to other operating systems with different file naming conventions.

170       mkpath([$dp]);
171   
172       unless (copy($song->{path}, $dest))
173       {
174           unlink $dest;
175           warn "Copy error: $!\n";
176           return;
177       }
178   
179       warn "$song->{path} transferred\n" if $opt_v;
180       return 1;
181   } 

Once a destination path has been established, the mkpath function from File::Path is used on line 170 to ensure the existence of all the required directory components. Next, I invoke copy from File::Copy at line 172 to attempt the copying of the MP3 file from its original location to the CD image. In case of error, I remove the possibly half-copied file, log the error, and return a false value on lines 174 to 177. If the copy is successful, true is returned.

mp3dump: Simple Reporting

In order to support a simple mechanism for backup, restore, reporting, and general manipulation of the database, I wrote mp3dump. I will omit the explanation of the beginning of the script, as it has a lot in common with mp3cat.

138   if ($opt_r)
139   {
140       my $csv = new Text::CSV_XS;
141       while (my $line = <>)
142       {
143           last unless $csv->parse($line);
144           my @col = $csv->fields;
145           my $song = {};
146           for my $k (@keys)
147           {
148               $song->{$k} = shift @col;
149           }
150           if (! exists $db{$song->{md5}} or $opt_F)
151           {
152               $db{$song->{md5}} = $song;
153           }
154       }
155   } 

The first interesting bit of code—the restore operation—appears at lines 138 to 155. Here, I read lines from STDIN or files specified in the command line, using the diamond operator at line 141 and then use the parse method from Text::CVS_XS at line 143 to obtain the columns from a comma-separated file. A $song hash reference is populated with the columns found at lines 146 to 149, which is added if missing or forced through the -F option, at lines 150 to 153.

156   else
157   {
158       for my $song (values %db)
159       {
160           if ($opt_c)
161           {
162               no warnings;
163           my $csv = new Text::CSV_XS { always_quote => 1 };
164               $csv->combine(map { $song->{$_} } @keys);
165               print $csv->string, "\n";
166           }
167           elsif ($opt_l)
168           {
169               if (@keys)
170               {
171                   no warnings;
172      print join(', ', map { "$_=$song->{$_}" } @keys), "\n";
173               }
174               else
175               {
176   print join(', ', map { "$_=$song->{$_}" } keys %$song), "\n";
177               }
178           }
179       }
180   } 

All the other command-line options specify a report to be generated from the database, so they appear grouped together in an else statement between lines 156 and 180.

When -c is specified in the command line, the code on lines 162 to 165 generates a comma-separated report. The no warnings at line 162 is necessary to prevent warnings when a given song does not contain a column specified in @keys. The -l option triggers a simpler report, friendlier for grep, which is generated at lines 169 to 177. As you see, managing a database tied to a hash is a trivial task in Perl.

Using the Tools

As an example, here are the commands I used the last time I created an incremental backup of my CD collection:

bash-2.05a$ cp mp3db mp3db.backup
bash-2.05a$ ./mp3cat -c /Volumes/MP3_014 -s -V mp3-
cd-014 ~/Music 

The cp statement makes a simple backup of the database. I always do this before starting, just in case the CD-ROM burning fails or something simply goes wrong. This saves me from having to alter the database in a more complex way.

The invocation of ./mp3cat requests that songs not in the database be copied (-c) to a directory below /Volumes/MP3_014, new songs have their information stored in the database (-s) and that a volume name of mp3-cd-014 (-V) be used. The songs to be processed are in subdirectories of ~/Music, which is where my MP3 library resides. A few minutes later, the command terminates and I happily burn my CD-ROM. If this were to fail, I could simply issue the command:

bash-2.05a$ cp mp3db.backup mp3db 

to restore my old version of the database, and start again.

Let's say that I want to load a copy of my database in a spreadsheet program to do some fancy formatting. I could do so with a command such as this:

bash-2.05a$ ./mp3dump -c > my_music.txt 

I could then simply import the CSV file. This file also happens to be a backup of the database, which could be easily restored by a command such as:

bash-2.05a$ ./mp3dump -r -d restored_db my_music.txt 

Another useful trick is restoring songs from my library. The following command shows how would I go about finding out where certain songs I want are backed up. Note that it would be a very good idea to install agrep alongside these tools, for tasks such as this.

bash-2.05a$ ./mp3dump -C vol,artist -l | egrep -i 'enya' | cut-f1 '-d, ' |  					sort | uniq -c 
  21 vol=mp3-cd-008
   2 vol=mp3-cd-012 

As you can easily see, these tools are very simple yet powerful enough to handle the task. I hope this discussion of these tools gives you a valuable start in using these techniques.

TPJ

Listing 1

1   #!/usr/bin/perl
2   
3   # This is free software restricted by the same terms as Perl itself.
4   # (c) 2003 Luis E. Muñoz, All rights reserved.
5   

87   use Fcntl;
88   use strict;
89   use warnings;
90   use IO::File;
91   use MP3::Tag;
92   use Storable;
93   use File::Find;
94   use File::Spec;
95   use File::Copy;
96   use File::Path;
97   use Pod::Usage;
98   use DB_File;
99   use Getopt::Std;
100   use Digest::MD5;
101   
102   use MLDBM qw(DB_File Storable);
103   
104   use vars qw($opt_c $opt_d $opt_F $opt_h $opt_l $opt_n $opt_s $opt_v 
                  $opt_V);
105   
106   our $VERSION = do { my @r = (q$Revision: 1.6 $ =~ /\d+/g);
                          sprintf " %d."."%03d" x $#r, @r };
107   
108   getopts('c:d:FhlnsvV:');
109   
110   if ($opt_h)
111   {
112       pod2usage
113       {
114           -verbose => 2,
115           -exitval => 255,
116           -message => "\n*** This is $0 version $VERSION ***\n\n",
117       };
118   }
119   
120   if (defined($opt_s) + defined($opt_n) + defined($opt_l) > 1)
121   {
122       pod2usage
123       {
124           -verbose => 1,
125           -exitval => 255,
126           -message => 
             "Only one of -s, -l and -n can be specified at the same time",
127       }
128   }
129   
130   $opt_d ||= './mp3db'; 

Back to Article

Listing 2

199       if ($opt_s)
200       {
201                                # If the song is not in the DB or the
202                                   # -F option is given, store it
203   
204           if (! exists $db{$song->{md5}} or $opt_F)
205           {
206               _copy($song) || die "Terminating due to copy failure\n";
207               print $song->{path}, " stored\n" if $opt_v;
208               $db{$song->{md5}} = $song;
209           }
210       }
211       elsif ($opt_n)
212       {
213           unless (exists $db{$song->{md5}})
214           {
215               _copy($song) || die "Terminating due to copy failure\n";
216               print $song->{path}, "\n";
217           }
218       }
219       elsif ($opt_l)
220       {
221           if (exists $db{$song->{md5}})
222           {
223               _copy($song) || die "Terminating due to copy failure\n";
224               print $song->{path}, "\n";
225           }
226       }
227       elsif ($opt_v)
228       {
229           print join(', ', map { "$_=$song->{$_}" } keys %$song), "\n";
230       }
231   } 

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.