Date Created: 2023-09-17
By: 16BitMiker
[ BACK.. ]
If you've been working with Perl for any length of time, you're likely familiar with its powerful regular expression engine. But Perl 5.10 introduced something even more powerful—grammars—through the Regexp::Grammars
module. This module allows you to write readable, structured parsers using regular expression syntax augmented with rule-based semantics. It bridges the gap between raw regexes and full-fledged parser generators like Parse::RecDescent
or Marpa
.
In this blog, we’ll explore the core concepts of Regexp::Grammars
, walk through a practical example, and unravel the mechanics that make it such a flexible tool for parsing structured or semi-structured text.
Regexp::Grammars
?Regexp::Grammars
is a Perl module that extends the regex syntax to support recursive, rule-based grammars. Instead of matching text with monolithic regex blobs, you can build named, reusable, and hierarchical parsing rules. This makes it easier to write complex parsers using familiar Perl idioms.
📦 CPAN: https://metacpan.org/pod/Regexp::Grammars
Here’s a condensed cheat-sheet to get your bearings:
xxxxxxxxxx
use :: ; # Activates grammar parsing features in regexes
xxxxxxxxxx
%/ # A special hash populated with named captures from grammar rules
xxxxxxxxxx
<: > # Declares a grammar by name
<: > # Inherit rules from another grammar
xxxxxxxxxx
<: > # Rule with automatic whitespace support
<: > # Raw rule, no whitespace handling
<: => # Bless result into a class, with named rule
<: > # Shortcut, rule name derived from class
xxxxxxxxxx
<> # Invoke subrule; result => $MATCH{RuleName}
< (...)> # Call with arguments
<!> # Fail if subrule matches (negative lookahead)
<:> # Match pattern in $ARG{IDENT}
<%HASH> # Match longest key
<=> # Save match under alias
<[ ]> # Append result to array instead of overwriting
xxxxxxxxxx
<require: (?{ })> # Fail if condition is false
<: 2> # Kill match after 2 seconds
<: > # Enable match-time debugging
<: \ *> # Override auto-whitespace
< :> # Prune result tree
These constructs allow you to build modular grammars that can scale in complexity while remaining readable.
Let’s look at a real-world use case—parsing paragraphs from a block of text. Paragraphs are separated by one or more blank lines (possibly containing spaces).
xxxxxxxxxx
my $txt = <<'EOF';
start text
a second paragraph
that has two lines
A third paragraphs
with indented 2nd line.
EOF
xxxxxxxxxx
use :: ;
my $parser = qr{
<nocontext:> # Disable context capture for performance
<Text> # Start rule
# --- Rules ---
<rule: Text> # Root rule: match multiple paragraphs
\A <[Paragraph]>+ # Start of text, 1+ paragraphs collected
% <_Sep=(\n\s*\n)> # Separated by blank-ish lines
\Z # End of text
<token: Paragraph> # Define a paragraph as minimal text block
^.*?$ # Match one non-greedy line
}xsm;
xxxxxxxxxx
if ($txt =~ $parser) {
use :: ;
print \%/;
}
xxxxxxxxxx
{
'Text' => {
'Paragraph' => [
'start text',
'a second paragraph
that has two lines',
'A third paragraphs
with indented 2nd line.'
]
}
}
Let’s break down the key elements:
<rule: Text>
Acts as the entry point.
Uses \A
and \Z
to anchor the match at the beginning and end of the string.
<[Paragraph]>+
means match one or more Paragraph
entries and collect them in an array.
% <_Sep=...>
defines a named separator rule (_Sep
) and tells the parser to split the matches using that rule. The underscore prefix _
indicates that the separator should not be included in the final match tree.
<token: Paragraph>
This is a non-whitespace-matching rule (no automatic skipping of whitespace).
It uses ^.*?$
, which matches a single line in non-greedy mode.
The m
modifier ensures ^
and $
match line boundaries, not string boundaries.
The s
modifier ensures that .
matches newline characters so that lines with embedded newlines can still be matched.
👥 Readable: Grammar rules are more descriptive than raw regexes.
🔄 Modular: You can compartmentalize logic into reusable rules.
📦 Structured Output: The %/
hash provides a tree-like data structure, great for further processing or JSON conversion.
⚡ Efficient: With careful design and token
rules, performance is acceptable even for moderately large data sets.
Regexp::Grammars
is a great fit when:
You're parsing semi-structured data (like logs, config files, or DSLs).
You need to extract nested or recursive patterns.
You want a parser without the overhead of a separate parser generator.
For flat, single-line regex tasks, stick with Perl’s built-in regex. But when your parsing needs start resembling a context-free grammar, it’s time to level up.
🧰 Parse::RecDescent – for comparison
🚀 Marpa::R2 – for performance-heavy parsing
Regexp::Grammars
empowers Perl developers to write expressive, maintainable parsers using the language they already know. Whether you're building a DSL interpreter or wrangling messy input, it’s a tool worth having in your Perl toolbox.
Happy parsing! 🧵