Date Created: 2024-11-13
By: 16BitMiker
[ BACK.. ]
Parsing strings is a fundamental task in many programming languages, and Perl has long been a powerhouse in this domain thanks to its rich regular expression capabilities. But every now and then, it’s worth revisiting how we structure our tools. Inspired by Ruby’s StringScanner
, this post walks through a Perl implementation that leverages closures to build a stateful string parser.
Let’s explore the mechanics of our custom strscan
package and how it offers a clean, functional interface for scanning and matching patterns in strings.
Ruby’s StringScanner
provides a way to step through a string, matching patterns and maintaining your position as you go. Our Perl version aims to do the same:
Track position within a string
Match regex patterns from the current position
Advance position after each match
Provide match context info (start, length, value)
Here’s the core of the strscan
package:
x#!/usr/bin/env perl
use ;
use ;
package ;
sub {
my $string = |@_|; # Combine input args into a single string
@_ = undef; # Clear @_ to free memory
my $pos = 0; # Start position in the string
my $eos = length($string); # End of string index
return {
# Closure that returns the current position
pos => sub { return $pos },
# Closure that allows updating the current position
=> sub { $pos = shift },
# Closure that checks if end-of-string is reached
=> sub { return $eos == $pos ? 0 : 1; },
# Closure that attempts to match a regex from current position
=> sub {
my $regex = shift;
# Match regex against substring starting at current position
if (my ($found) = substr($string, $pos) =~ m~($regex)~) {
# $-[0] gives byte offset of match in the substring
my ($start, $length) = ($-[0], length($found));
# Advance position by offset + length of match
$pos += $start + $length;
# Return detailed match information
return {
pos => $pos,
=> $start,
=> $length,
=> $found,
};
} else {
# If no match, jump to end of string
$pos = $eos;
return undef;
}
}
};
}
1;
__END__
The create
function returns a hashref containing four closures. Each closure has access to the lexical $pos
, $eos
, and $string
variables via Perl’s closure mechanism. This encapsulation ensures state is preserved across calls without exposing internal variables globally.
Let’s break down the closures:
pos
: Returns the current scanner position.
mod_pos
: Allows external code to set the scanner position manually.
eos_check
: Returns true (1) if not at the end of the string; false (0) otherwise.
find
: Accepts a regex, tries to match it from the current position, and returns detailed match info if successful.
Closures in Perl are a clean alternative to creating full-blown classes when all you need is localized state and behavior. This approach:
✅ Keeps encapsulation tight
✅ Avoids polluting the global namespace
✅ Mimics object-like behavior without the overhead
Let’s see how this scanner behaves in action:
xxxxxxxxxx
# Demo script
package ;
use :: ;
use |say|;
my $scan = :: ('This is just a test!');
say "start position: ", $scan->{pos}(), "\n";
# Loop until we reach end of string
while ($scan->{eos_check}()) {
if (my $match = $scan->{find}('\w+')) {
say "match: ", $match->{match};
say "pos: ", $match->{pos};
say "";
}
}
say "end position: ", $scan->{pos}();
Initializes a new scanner with the string "This is just a test!"
.
Prints the initial position (should be 0).
Enters a loop that continues until eos_check
returns false.
Each iteration:
Calls find
with a regex for word characters (\w+
).
If a match is found, prints the matched word and the updated position.
Finally, prints the ending position after all matches.
xxxxxxxxxx
start position: 0
match: This
pos: 4
match: is
pos: 7
match: just
pos: 12
match: a
pos: 14
match: test
pos: 19
end position: 19
Notice how the scanner seamlessly moves through the string, match by match, updating its position and reporting exactly what it found.
You might wonder: "Why not just use a while
loop with global regex matches?"
The answer comes down to control and flexibility:
With a scanner, you can pause, rewind, or jump to arbitrary positions.
You can match different patterns in each iteration.
You can track positions and match metadata manually.
You avoid side effects from Perl’s special regex variables ($1
, $&
, etc.).
This is especially helpful in building lexers, tokenizers, or custom parsers.
By borrowing a concept from Ruby and applying Perl’s powerful closures and regex tools, we’ve constructed a lightweight, stateful string scanner. It’s modular, easy to extend, and neatly encapsulates internal state without requiring a full object-oriented design.
This scanner is a great foundation if you're:
Writing a simple tokenizer
Parsing domain-specific languages
Validating structured input
Exploring functional-style Perl
Perl remains a remarkably expressive language for string processing. With techniques like these, you can elevate your parsing logic to be both elegant and flexible.
Happy scanning! 🧵