Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

Improving Template::Extract


brian has been a Perl user since 1994. He is founder of the first Perl Users Group, NY.pm, and Perl Mongers, the Perl advocacy organization. He has been teaching Perl through Stonehenge Consulting for the past five years, and has been a featured speaker at The Perl Conference, Perl University, YAPC, COMDEX, and Builder.com. Contact brian at [email protected].


Template::Extract looks very cool. It has three (out of three) 5-star ratings on CPAN Ratings, several effusive blog entries, and a hack in Spidering Hacks. However, it really isn't as cool as all of the hype. At least not yet. I ran into some problems with its parsing, but fixed it up so I can specify my own parser instead of its default one. It doesn't solve all of the module's problems, but it got me through today's project.

I started with something simple. I have colon separated fields and values, with one pair per line. I just want to get those values back. The actual template I wanted to use was a bit more fancy, but the other parts were working fine.

	#!/usr/bin/perl
	
	use Data::Dumper;
	use Template::Extract;
	
	my $tte = Template::Extract->new;

	my $template = <<'TEMPLATE';
	foo: [% foo %]
	bar: [% bar %]
	TEMPLATE
	
	my $document = <<"MAIL";
	foo: fred
	bar: barney
	MAIL
	
	my $data = $tte->extract( $template, $document );
	
	print Dumper( $data );

I think this is pretty simple, but it comes out wrong. The values have newlines after them even though the newlines come after the template directives. Somehow Template::Extract thinks they belong to the values rather than the template.

	$VAR1 = {
			  'bar' => 'barney
	',
			  'foo' => 'fred
	'
			};

If I change the template and output to have two template directives next to each other, the output is further away what I want. Both values show up in one directive, leaving the other one empty.

	#!/usr/bin/perl
	
	use Data::Dumper;
	use Template::Extract;
	
	my $tte = Template::Extract->new;

	my $template = <<'TEMPLATE';
	foo: [% foo %]
	bar: [% bar %] [% baz %]
	TEMPLATE
	
	my $document = <<"MAIL";
	foo: fred
	bar: barney rubble
	MAIL
	
	my $data = $tte->extract( $template, $document );
	
	print Dumper( $data );

I expected that in the "bar" line, Template::Extract would come up with some regex that said something like "b a r : space something space something newline". Instead, it gave me "b a r : space something".

	
	$VAR1 = {
			  'bar' => '',
			  'baz' => 'barney rubble
	',
			  'foo' => 'fred
	'
			};

This might not seem strange to you at first, and I was a bit forgiving at first too, but why should the process ignore literal characters in the template? The newline and now the space from the template show up in the values from the template directives instead of their literal, hard-coded values in the template. Notice, however, it correctly leaves out the whitespace that comes after the colon.

People like this module because they're using it in templates that have a lot of other text, and some template directives. That is, there is a lot of non-whitespace hints that Template::Extract can use in those cases. Not all templates are so simple though, or have extra hints to help the parser.

If I change my template to make it look like a bit of HTML (invalid as it may be), Template::Extract does a fine job:

	#!/usr/bin/perl
	
	use Data::Dumper;
	use Template::Extract;
	
	my $tte = Template::Extract->new;                                     
			   
	my $template = <<'TEMPLATE';
	foo: [% foo %]<br/>
	bar: [% bar %] [% baz %]<br/>
	TEMPLATE
	
	my $document = <<"MAIL";
	foo: fred<br/>
	bar: barney rubble<br/>
	MAIL
	
	my $data = $tte->extract( $template, $document );
	
	print Dumper( $data );

It correctly picks out all the values because it doesn't have a chance to react to whitespace. It uses other hints for where things end or begin.

	$VAR1 = {
			  'bar' => 'barney',
			  'baz' => 'rubble',
			  'foo' => 'fred'
			};

I investigated this a bit more because I really want to use this module. I need to look at the template and use more hints that what it is using. First, I just want to solve the newline problem. I want to see the regexen that the module uses. If I can figure how to change the regex, maybe I can work backwards to a code fix. I get the regexen as the return value of compile(), and then use Data::Dumper to inspect it.

	#!/usr/bin/perl
	
	use Data::Dumper;
	use Template::Extract;
		
	my $tte = Template::Extract->new;                                     
			   
	my $template = <<'TEMPLATE';
	foo: [% foo %]
	bar: [% bar %] [% baz %]
	TEMPLATE
	
	$regexen = $tte->compile( $template );
	
	print Dumper( $regexen );

I can now see the regular expressions that Template::Extract will use. The (?{}) part of the regular expression allows perl to run a bit of code in that spot. That code is the _ext() function from Template::Extract::Run. Other than that its not that scary. Notice that it looks for "f o o : space stuff" followed immediately by "b a r : space stuff" and so on. I don't see my literal new line in there.

	$VAR1 = 'foo\\:\\ (.*?)(?{
		_ext(([\'foo\'], $1, 1))
	})bar\\:\\ (.*?)(?{
		_ext(([\'bar\'], $2, 2))
	})(.*)(?{
		_ext(([\'baz\'], $3, 3))
	})';

That's about all I can do at the user level, so I need to look under the hood in Template::Extract::Compile. (Note, an easy way to look at the source of a module is to find the file with `perldoc -l`, then load that path into your favorite editor.) It's pretty easy to find the culprit. The module tells Template::Parser to pre- and post-chomp the directives so the whitespace around the directives disappears. It's not Template::Extract that doing it, at least not directly.

	my $parser = Template::Parser->new(
		{
			PRE_CHOMP  => 1,
			POST_CHOMP => 1,
		}
	);	

What happens if I modify that so I have my own Template::Parser setup? Just to try it, I change the values so it doesn't chomp anything. I simply change the 1s to 0s.

	my $parser = Template::Parser->new(
		{
			PRE_CHOMP  => 0,
			POST_CHOMP => 0,
		}
	);

Now the regexen from my last script show the whitespace and thus can use the literal whitespace as hints to decode the cooked template. On the line above "bar", there is really a newline at the end, and that's what the \\ escape (twice, because they are in a string and one needs to survive and stay in the regex.

	$VAR1 = 'foo\\:\\ (.*?)(?{
		_ext(([\'foo\'], $1, 1))
	})\\
	bar\\:\\ (.*?)(?{
		_ext(([\'bar\'], $2, 2))
	})\\ (.*)(?{
		_ext(([\'baz\'], $3, 3))
	})';

That still doesn't solve my newline problem. With this change, I still get a newline after 'rubble'. Although there is a literal newline after the 'foo' line in the regex, there isn't one at the end of the 'bar' line.

	$VAR1 = {
			  'bar' => 'barney',
			  'baz' => 'rubble
	
	',
			  'foo' => 'fred'
			};

That little bit isn't from Template::Parser though. After Template::Extract::Compile::compile() sets up the parser, it modifies the template a bit before it passes it on to the parser. It takes off all of the newlines at the end of the string.

	$template =~ s/\n+$//;

As a side note, I thought this was the culprit of the first problem until I realized it doesn't use the /m modifier at the end (meaning it doesn't remove newlines from the end of 'lines', just the end of the string. I comment out that line and everything works how I want it. The values don't have trailing newlines.

	$VAR1 = {
			  'bar' => 'barney',
			  'baz' => 'rubble',
			  'foo' => 'fred'
			};

Now that I know what I want to do, like a good open source programmer I need to make a patch that make my knowledge useful to the rest of the world. Now that I know how I want it to work, I need to translate that into code. Just like I don't want to use Template::Extract's idea of what the parser should do, other people probably don't want my idea. I can change the compile() function to take a parser as an optional second argument. If I don't supply that second argument, or the second arguments isn't a Template::Parser (or one of its subclasses), it makes its parser like it did before. And, instead of trimming all the newlines at the end of the file, I keep one.

	sub compile {
		my ( $self, $template, $parser ) = @_;
	
		$parser = undef unless UNIVERSAL::isa( $parser, 'Template::Parser' );
		
		$self->_init();
	
		if ( defined $template ) {
			$parser ||= Template::Parser->new(
				{
					PRE_CHOMP  => 1,
					POST_CHOMP => 1,
				}
			);
	
			$parser->{FACTORY} = ref($self);
			$template = $$template if UNIVERSAL::isa( $template, 'SCALAR' );
			$template =~ s/\n+$/\n/;
			$template =~ s/\[%\s*(?:\.\.\.|_|__)\s*%\]/[% \/.*?\/ %]/g;
			$template =~ s/\[%\s*(\/.*?\/)\s*%\]/'[% "' . quotemeta($1) . '" %]'/eg;
	
			return $parser->parse($template)->{BLOCK};
		}
		return undef;
	}

This isn't the only subroutine I have to change, since most people won't call compile() directly. Most people will use Template::Extract::extract(), which then calls compile() for them. That function needs to know about the extra argument too. The extract() method takes its argument and re-arranges them for its call to run.

	sub extract {
		my( $self, $template, $document, $values, $parser ) = @_;
		
		$self->run( $self->compile($template, $parser), $document, $values );
	}

Now everything works for me. All I have to do is send in the patch.

TPJ


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.