Channels ▼
RSS

Web Development

Cleaning Up a Symlink Mess


July, 2004: Cleaning Up a Symlink Mess

Randal is a coauthor of Programming Perl, Learning Perl, Learning Perl for Win32 Systems and Effective Perl Programming, as well as a founding board member of the Perl Mongers (perl.org). Randal can be reached at merlyn@stonehenge.com.


The box that hosts http://www.stonehenge.com/ also takes care of http://www.geekcruises.com/, the company web site for my buddy, "Captain" Neil. As such, I have a dual role: I'm not only a frequent Geek Cruise attendee—I'm also the webmaster!

Recently, I noticed that Neil had moved a few pages around on his site to reorganize some of the information on past cruises. As quite a few links have been announced and bookmarked to the old location for a given page, he didn't want to break those. So he naively placed a symbolic link from the old location to the new location. This means that a reference to the old location such as:

http://www.geekcruises.com/cruises/2003/perlwhirl3.html

would be the same as:

http://www.geekcruises.com/past_cruises/perlwhirl3.html

because he had moved the page as follows:

$ cd /data/web/geekcruises # the DocumentRoot for his server
$ cd cruises/2003
$ mv perlwhirl3.html ../past_cruises
$ ln -s ../past_cruises/perlwhirl3.html .

Now, at first glance, this appears to work. When either page is referenced, the same material is delivered by the server. However, there's no way for anyone outside my server to know that these two pages are absolutely identical. This means that any cache (including browser caches, outward border caches at large organizations, or even our own reverse proxy cache) would now have two copies of the same material, having fetched the material needlessly twice.

Worse, some of the relative URLs are now somewhat broken. In the original location, getting back up to the index page requires ../../index.html, but in the new location, it was merely ../index.html. It was for this reason that I actually noticed the symlinks in the first place, because a badly constructed web crawler was sucking down multiple copies of the web site, thinking that each index.html at the top level was different as well.

The correct way to move such a page that might have been bookmarked or indexed is to have Apache issue an http redirect when the old URL is referenced. For example, in the configuration file for the Geek Cruises website, we can add:

Redirect /cruises/2003/perlwhirl3.html http://www.geekcruises.com/past_cruises/perlwhirl3.html

With this line in the configuration, a browser requesting the old URL will be asked to fetch the new URL instead. This redirect (also called an external redirect) is sufficient to ensure that caches will cache only one version (at the new URL) and indexers such as Google will invalidate the old URL over time.

Now, Neil doesn't have direct access to the web-server configuration master file, but he can add .htaccess files in the various affected directories. That particular command can be placed directly into the cruises/2003 subdirectory, and it would have the same result.

When I saw a few dozen of these symlinks all over the document tree for Neil's server, I explained this to him and then said it'd actually be a small matter of programming to automatically replace all of those symlinks with updated .htaccess files. When all I heard was silence at the other end of the connection, I recognized that I'd need to write the program myself, since I'd now claimed it could be done. And that program is in Listing 1.

Lines 1 through 3 start nearly every program I write, enabling warnings for development, compiler restrictions (forbidding undeclared variables, symbolic references, and barewords), and turning off the pesky output buffering.

Lines 5 through 9 define my configuration parameters. The $URL is needed because an external redirect has to include the hostname and there's no easy way to get at that from inside the .htaccess file. The $USER and $GROUP are the values for the newly created or updated .htaccess file; I'm running this as root so I have to set it correctly for Neil to be able to edit later. And $MODE gives the new permissions for a new .htaccess file.

Line 11 pulls in the abs_path routine from the core module Cwd. Lines 13 and 14 use my File::Finder module (found in CPAN) to easily get a list of directories below the document root.

Lines 17 to 60 iterate over each of those directories, which can be considered completely separately. Line 18 finds all the symbolic links within that directory using a simple grep over the result of a glob. Note that I'm presuming UNIX file syntax here, but that's safe because I know my server box is not likely to ever be anything but UNIX. Had I wanted this a bit more portable, I'd use File::Spec to construct the path.

Line 20 sets up the list of @deletes. These are the candidate symbolic links that are being replaced with .htaccess redirects, and can be deleted once the updated .htaccess file is in place. Line 21 computes the name of the .htaccess file for this particular directory.

Lines 23 to 45 process each symbolic link that was found in the directory separately. First, the target of the symbolic link is read in lines 24 and 25. If the $path is not defined, it's either not a symbolic link or something went horribly wrong, and we ignore it.

Next, lines 26 and 27 ignore absolute symbolic links. I'm not sure why this code is in there, but it seemed to be the safest thing to do, since I only wanted to fix relative symbolic links. I've learned over the years that when you have root power, and you're mucking around with stuff and deleting and replacing a lot of files, it's safest to try to ignore everything that doesn't precisely fit your desired goal.

Line 28 uses abs_path to compute the resulting absolute path of the symbolic link target. Line 29 is left over from debugging, where I wanted to see if my calculations were correct for all of the existing links.

Lines 30 and 31 strip off the document root path from the source of the symbolic link. I need to do this to ensure that my Redirect command is framed in terms of URLs and not UNIX pathnames. The \Q quotes any metacharacters in the pathname. Again, this is a safe thing to do, even though I know there are no metacharacters in the particular paths I've configured at the top. Always be very conservative with root power.

Line 32 takes the matched tail part of the symbolic link source and builds the source URL for the Redirect. I'm presuming the $1 here has been properly set from the previous match and there's no possible way I'm using the value from a stale match.

Lines 33 through 35 repeat the stripping and building for the destination path, although I now have to create a full scheme-based URL for the path. Without the http prefix, Apache would have treated this operation as an internal redirect, with all the same problems as a simple symlink, because no indication would be sent to the client that something had moved.

Lines 36 to 42 create the NEW handle to which the new .htaccess file is written. This happens only once per directory because, after the first time, the @deletes array contains some previous entry. The existing .htaccess file (if any) is also copied to the beginning of the new .htaccess file. Note that we're using the OLD filehandle in a list context, so it gets slurped in as a list of lines, then immediately dumped to the new filehandle.

Line 43 writes the proper Redirect command to the new .htaccess file. Line 44 marks the symbolic link as one to be deleted once the .htaccess file is in place. Lines 46 to 59 are executed once per directory but only if some symbolic link was found that was eligible for removal.

First, line 47 closes the output file handle to ensure that the data is completely flushed. Then, lines 48 through 51 set the permissions and ownership on the new file to be defined by the configuration parameters at the top of the program. Lines 52 to 53 try to rename the existing .htaccess file out of the way to end in .OLD. Again, with Great Power comes Great Responsibility, including the mandate to have a Great Undo when one makes a Great Mistake. So, after each run of this program, I can verify that I have not completely mangled the .htaccess files, and then delete the .OLD files manually. And carefully.

Lines 54 and 55 move the newly created .htaccess file into place. Note that, because I've created the completed file in a separate location and then renamed it atomically into place, there's no chance that the live Apache process will read a partially written .htaccess file. This is a very important principle when dealing with live production activity. Finally, lines 56 to 58 delete the existing useless symbolic links one at a time. And that's all there is!

Oddly enough, as I was writing this article, I discovered yet another symlink in the tree. So I ran the code once again and, sure enough, it got replaced with the right redirect. Good thing I've kept this code around. Looks like it's time to send Neil another refresher message, or maybe just disable symbolic links in his tree. Until next time, enjoy!

TPJ



Listing 1

=1=     #!/usr/bin/perl -w
=2=     use strict;
=3=     $|++;
=4=     
=5=     my $URL = "http://www.geekcruises.com";
=6=     my $DIR = "/data/web/geekcruises";
=7=     my $USER = 2100;
=8=     my $GROUP = 2100;
=9=     my $MODE = 0644;
=10=    
=11=    use Cwd qw(abs_path);
=12=    
=13=    use File::Finder;
=14=    my @dirs = File::Finder->type('d')->in($DIR);
=15=    
=16=    # print "$_\n" for @dirs;
=17=    for my $dir (@dirs) {
=18=      my @symlinks = grep -l, glob "$dir/*";
=19=      # print "$dir: @symlinks\n";
=20=      my @deletes;
=21=      my $htaccess = "$dir/.htaccess";
=22=    
=23=      for my $symlink (@symlinks) {
=24=        defined(my $path = readlink($symlink)) or
=25=          warn("Cannot read $symlink: $!"), next;
=26=        $path =~ m{^/} and
=27=          warn("skipping absolute $path for $symlink\n"), next;
=28=        my $abs_path = abs_path("$dir/$path");
=29=        # print "$symlink -> $path => $abs_path\n";
=30=        $symlink =~ m{^\Q$DIR\E/(.*)}s or
=31=          warn("$symlink doesn't begin with $DIR"), next;
=32=        my $original_url = "/$1";
=33=        $abs_path =~ m{^\Q$DIR\E/(.*)}s or
=34=          warn("$abs_path doesn't begin with $DIR"), next;
=35=        my $redirect_url = "$URL/$1";
=36=        unless (@deletes) {
=37=          ## print "in $dir...\n";
=38=          open NEW, ">$htaccess.NEW" or die;
=39=          if (open OLD, $htaccess) {
=40=            print NEW <OLD>;
=41=          }
=42=        }
=43=        print NEW "Redirect $original_url $redirect_url\n";
=44=        push @deletes, $symlink;
=45=      }
=46=      if (@deletes) {
=47=        close NEW;
=48=        chown $USER, $GROUP, "$htaccess.NEW"
=49=          or die "Cannot chown $htaccess.NEW: $!";
=50=        chmod $MODE, "$htaccess.NEW"
=51=          or die "Cannot chmod $htaccess.NEW: $!";
=52=        ! -e $htaccess or rename $htaccess, "$htaccess.OLD"
=53=          or die "Cannot mv $htaccess $htaccess.OLD: $!";
=54=        rename "$htaccess.NEW", $htaccess
=55=          or die "Cannot mv $htaccess.NEW $htaccess: $!";
=56=        for (@deletes) {
=57=          unlink $_ or warn "Cannot unlink $_: $!";
=58=        }
=59=      }
=60=    }
Back to article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 
Dr. Dobb's TV