Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Ten Things You (Probably) Didn't Know About Perl


Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at [email protected].


In my last article, I wrote about how important it was to keep on learning new technologies and broadening your horizons; this time, we're going to look at some things you might have missed inside Perl itself. I've just finished writing the new edition of Advanced Perl

Programming, which covers a lot of the more practical things about writing Perl applications - useful CPAN modules and techniques. But what about the more hidden and less obviously practical things? Here is a collection of ten facts to intrigue, delight and inspire...

You can write scripts in UTF-8

Someone mentioned to me the other day that Perl 6 was the only language they knew of which allowed Unicode identifiers. Funny, I thought - I'm sure you can do that in Perl 5 as well. Well, you can, so long as you remember to turn on the "utf8" pragma:

        use utf8;
        my $ = Dog->new();

Similarly, subroutines and comments can be in your native language and script. If you want to go further and start translating the actual language constructs ("if", "for", etc.) into your native language, then look at things like the "JCode" module on CPAN, and be aware that you're creating a maintainance nightmare for yourself.

There's a new way of trapping signals

I did not even know about this one until I was wandering around the Perl source trying to find interesting tidbits for this article, (which may reflect more on my ability to read the manual than anything else...) but it looks like a very useful tip. Normally, I'd trap signals in Perl with something like this:

        $SIG{ALRM} = \&handler;

However, there's a pragma in the Perl core called "sigtrap" which makes it easier to handle signals. With "sigtrap" you'd write the above statement as:

        use sigtrap handler => \&handler, "ALRM";

Why is this any better? Well, with "sigtrap," you can set special built-in handlers, to either die or to give a stack trace, and apply them to built-in sets of signals. For instance, to give a stack trace for any of the more "serious" error signals (ABRT, BUS, EMT, FPE, ILL, QUIT, SEGV, SYS and TRAP) that you haven't specified another handler for, you would say

        use sigtrap 'stack-trace', 'untrapped', 'error-signals';

This makes it much easier to work out what your signal handling code is actually trying to do.

Appending from a file to a scalar is special-cased. Try these two little programs:

        perl -we '$_ = <A>'
        perl -we '$_.= <A>'

Both will complain about "A" being used only once, and both will complain about reading an unopened filehandle, but look at how they do so:

        readline() on unopened filehandle A at -e line 1.
        append I/O operator() on unopened filehandle A at -e line 1.

"Append I/O operator()"? What's happening is that appending from a filehandle to the end of a scalar $x .= <..> is optimized by Perl into a special operation in its own right.

Actually, it turns out that readline, or the "<FH>" operator is actually three different operators, as you can see from running "perl -MO=Terse,-exec" on the following three lines:

        print <A>;
        $_ = <A>;
        $_ .= <A>;

The first one creates two operations, as you would expect: one reads the line, and the other prints it. The second line does something I didn't expect until I tried it - I would have expected one op to fetch $_, one to fetch "A", one "readline" and one to assign them, but there's no "sassign" op at the end. This is because "= <..>" is a special case of the readline op which knows to store its return value not onto the stack but into the scalar variable referenced by the next value on the stack, eliminating an assign op. And, as we've already seen, ".= <..>" is a special "rcatline" operation that reads and concatenates at the same time. Useful for micro-optimizers and interrnals hackers!

Subclassing Exporter

The usual way to write modules which provide subroutines is to optionally export them into the caller's package. So, for instance:

        package My::Application;
        use URI::Escape qw(uri_escape);
        print uri_escape("Hello there");

Here URI::Escape has provided the uri_escape subroutine, which our package has imported and used. It does this by subclassing the standard Exporter module, like so:

        package URI::Escape;
        use vars qw(@ISA @EXPORT @EXPORT_OK $VERSION);
        use vars qw(%escapes);

        require Exporter;
        @ISA = qw(Exporter);
        @EXPORT = qw(uri_escape uri_unescape);
        @EXPORT_OK = qw(%escapes uri_escape_utf8);

The heavy work of moving subroutines from the URI::Escape package into the My::Application package is done by Exporter's import method, and use Foo actually calls require Foo; Foo->import so the uri_escape subroutine comes our way rather behind the scenes.

Now, suppose you're writing your module which imports subroutines this way, but you also want to do some additional set-up code when the module is loaded up. The import routine is a sensible place to do this set-up, so you say something like this:

        sub import {
            my ($self, @stuff) = @_;
            $self->setup();
            $self->SUPER::import(@stuff);
        }

Unfortunately, this doesn't work. Exporter's import works by looking at caller to determine who's calling it, and therefore ends up trying to move symbols from Exporter to My::Application, instead of from My::Application to the client code. The way around this is to use the export_to_level routine instead, which gives you a parameter to control how far back up the stack to perform the exporting:

        sub import {
            my ($self, @stuff) = @_;
            $self->setup();
            $self->export_to_level(1, $self, @stuff);
        }

The first parameter, 1, says that we're going back one level from the point of view of Exporter; the next is the package name, and then this is followed by the import tags.

You can lock up hashes

Anyone who's done any development of "perl" itself over the past five years or so will cringe when I mention the word "pseudohashes". They were a great idea, to be honest, but the implementation was a little unfortunate.

The idea of a pseudohash goes like this. When you're using a hash as an object, you'll generally only have a fixed set of keys, known in

advance, that you're interested in using:

        package Person;
        sub new {
            my $class = shift;
            bless {
                name => ...,
                address => ....,
                job => ...,
                date_of_birth => ....
            }, $class
        }

Now if at some point I say

        $person->{dateofbirth}

then this is going to cause a bug. Perl should "know" that I meant to say "date_of_birth" and complain. Additionally, if we have a fixed set of keys, we can essentially turn these keys into constant indexes, and use an array instead of a hash, making access faster. So a pseudohash was something that behaved like an array and looked like a hash, and was implemented like the evil hybrid monster that this implies.

As a compromise, the functions in Hash::Util were implemented. These take an ordinary hash, and locks it down in various ways - prevents you from adding new keys, or from changing certain values, and so on.

        use Hash::Util qw(lock_keys);

        sub new {
            my $class = shift;
            my $self = bless {
                name => ...,
                address => ....,
                job => ...,
                date_of_birth => ....
            }, $class;
            lock_keys(%$self);
            return $self;
        }

Now if at some point I say $self->{dateofbirth} = "1978-05-29", Perl will die because I tried to add a new key to a locked hash.

It's important to note that it doesn't prevent you from accessing non-existant keys, so $person->{dateofbirth} will still slip by unnoticed. This means Hash::Util isn't a complete replacement for pseudohashes; try looking at Dave Cross's Tie::Hash::FixedKeys for a more robust but slower implementation of this idea.

DBM Filters

DBM files are essentially a hash on disk - they're random-access files which allow you to associate a key with a value. My irrational favourite is the Berkeley DB, DB_File:

        tie %hash, "DB_File", "test.db";
        $hash{"Larry Wall"} = "555-112-3581";

The only slight problem with DBMs is that they're generally implemented by an external C library which knows nothing about Perl, so you can't store complex Perl data structures as DBM values. The usual way around this is the "MLDBM" module, which sits in front of the DBM, marshalling the data that gets stored and retrieved. If it comes across an attempt to store a reference, it will use either "Storable" or "Data::Dumper" to serialize that reference into a string; similarly, if you're retrieving a string like that, it'll perform the appropriate inverse process ("Storable" again, or "eval") to turn the string back into a reference.

        use MLDBM qw(DB_File);
        tie %hash, "DB_File", "test.db";
        $hash{"Larry Wall"} = Person->new(...);

"MLDBM" works the slow, stupid way; it implements the "tie" interface itself, does the serializing and deserializing, and then passes on the request to another, underlying tied hash:

        sub FETCH {
            my ($s, $k) = @_;
            my $ret = $s->{DB}->FETCH($k);
            $s->{SR}->deserialize($ret);
        }

        sub STORE {
            my ($s, $k, $v) = @_;
            $v = $s->{SR}->serialize($v);
            $s->{DB}->STORE($k, $v);
        }

        sub DELETE  { my $s = shift; $s->{DB}->DELETE(@_); }
        sub FIRSTKEY    { my $s = shift; $s->{DB}->FIRSTKEY(@_); }
        sub NEXTKEY { my $s = shift; $s->{DB}->NEXTKEY(@_); }
        ...

There's actually a better way to do things nowadays, which hopefully "MLDBM" will move to behind the scenes. You can now add filters onto a DBM, so that when something is stored or retrieved, a subroutine of your choice gets called. So we can implement the same functionality as "MDLBM" with just a few lines of code:

        my $db = tie %hash, "DB_File", "test.db";

        use Storable qw(freeze thaw);
        $db->filter_store_value(sub { $_ = freeze($_) });
        $db->filter_fetch_value(sub { $_ = thaw($_)   });

        $hash{"Larry Wall"} = Person->new(...);

When we store the "Person" object, it goes through the filter we registered with filter_store_value, and the value is transformed via the freeze subroutine we got from Storable - this turns it into a scalar value suitable for storing in the DBM. Similarly, when we retrieve it again, it goes through Storable::thaw which turns it back into a reference.

For more about what you can do with DBM filters, see "perldoc perldbmfilter".

File handles with a _< in them?

Here's something that I was asked about the other day: why does my program contain globs which start "_<" followed by a filename? To see this for yourself, run this code:

        perl -le 'print for keys %main::'  

You'll see, amongst the rest of the keys in the symbol table, "_<universal.c". To make things more interesting, run the same code in the debugger:

        perl -d -le 'print for keys %main::'

This time you'll see a few more: I got

        _</usr/share/perl/5.8/Term/ReadLine.pm
        _</usr/share/perl/5.8/Carp/Heavy.pm
        _</usr/share/perl/5.8/strict.pm
        _</usr/share/perl/5.8/AutoLoader.pm

amongst others. Where do these come from? Well, there are actually two kinds of globs named like this. The first, usually non-Perl files like the "universal.c" that we saw earlier, are used as the globs attached to the XS subroutines that they contain. The second kind are provided by the debugger whenever a program file is loaded. From the "perldebguts" documentation:

  • Each array @{"_<$filename"} holds the lines of $filename for a file compiled by Perl... Values in this array are magical in numeric context: they compare equal to zero only if the line is not breakable.
  • Each hash %{"_<$filename"} contains breakpoints and actions keyed by line number.
  • Each scalar ${"_<$filename"} contains _<$filename.

You can find the name of a subroutine with Devel::Peek

This is another one which came up on IRC the other day. You have a subroutine reference, and you want to know what it's called, either so that you can report about it for your debugging, or you can do some dirty tricks. How do you know where the subroutine reference came from?

Let's say we want to make some method "private", in the sense that it can only be called by the class which created it. Here's as much of the "make_private" routine as we can do:

        package UNIVERSAL;
        sub make_private {
            my ($class, $method) = @_;
            # Find out where the method was actually defined
            my $orig = $class->can($method) || return;
            my $subname = some_magic($orig);
            *{$subname} = sub {
                my $class = shift;
                die "$method is private" unless $class eq caller;
                $class->$orig(@_);
            }
        }

In this routine, $orig is a code reference. Normally, the fully-qualified subroutine name for this method would be $class."::".$method, but since inheritance is in play, the method might actually come from somewhere else; that's why we use can() to find out where it came from.

Of course now we need to go back from the code reference to the subroutine name, so we can write our "guard" subroutine into the appropriate glob. The guard subroutine, the inner one in our code sample, checks that we're calling this from inside the appropriate class, and then dispatches to the original subroutine.

The key thing we're missing is getting the subroutine name from the code reference, and the answer to this is Devel::Peek::CvGV. Devel::Peek is better known for dumping out the internal details of Perl variables, but in this case it comes to the rescue by looking at the code reference's glob pointer, which tells us where in the symbol table it lives.

A "debugger" doesn't have to debug

You may well be familiar with the Perl debugger, invoked via "perl -d". As it happens, that's not "the" Perl debugger; it's just "a" Perl debugger, albeit the standard one which comes with Perl. A debugger is just something that sits in the "DB" package and implements a few subroutines in there. The DB::DB subroutine, for instance, is called by Perl for every statement in your program. DB::sub gets called for every subroutine run. "perldebguts" describes the variables available in these and other DB:: subroutines. Modules like Devel::Trace demonstrate how to write your own debugger:

        # This is the important part.  The rest is just fluff.
        sub DB::DB {
          return unless $TRACE;
          my ($p, $f, $l) = caller;
          my $code = \@{"::_<$f"};
          print STDERR ">> $f:$l: $code->[$l]";
        }

Notice that this uses the special glob for a Perl code file we discovered earlier, in order to extract the line of code currently being run. For a debugger to run neatly, it should be named "Devel::...". This is because Perl turns "perl -d:Foo" into the equivalent of "use Devel::Foo."

Devel::DProf, Devel::Coverage, and other modules in the Devel:: namespace show what can be done with customized debuggers.

The Internals package

Our final tip is the somewhat underdocumented "Internals" package. Like UNIVERSAL, this is a built-in package provided by the Perl core, which contains a few handy functions. The first is SvREADONLY. This is what is actually used to lock hashes, as mentioned above. It takes any Perl SV container - a scalar, an array or hash, or an array or hash reference - and gets and sets the read only flag on that container:

         % perl -le '@a=(0..10); Internals::SvREADONLY($a[5], 1); $a[5]++'
         Modification of a read-only value attempted at -e line 1.

Another function gets and sets the reference count of an SV:

        my $count = Internals::SvREFCNT($obj);
        Internals::SvREFCNT($obj, $count+1); # Make immortal

Still other functions can get and set the internal seed of the hashing algorithm for a hash, or fiddle with the placeholders set up when a locked hash is used.

Conclusion

There are still things I'm discovering even about the internals of Perl itself, and many techniques which are only now being exploited to give interesting results. Source filters sat in the Perl core for a few years before people realised what sort of things could be done with them. We've taken a look at ten of the lesser known corners of Perl... it's up to you to do interesting things with them!

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.