Parsing Expression Grammars (Part 2 of N)

In post #1 you took a quick dive into a calculator PEG example destined for parsing with TCL. Now it is time to look closer at what each grammar expression has to offer.

Literals (a.k.a. Terminals)

Unlike the LR parsers YACC and BISON which have separate lexical analysis front ends (Lex & Flex respectively) as a necessary evil to support Context-Free Grammars (CFG), PEG descriptions contain explicit recognition rules for characters making tokenization unnecessary. This simplifies the problem by making parsing expressions identical for tokens, terminals and non-terminals. Since there are no tokens, the grammar writer must specify literal characters using apostrophes and quotation marks. A good example comes from the definition of the parsing expression grammar written in its own language!

void:   APOSTROPH      <- "'"                ;
void:   DAPOSTROPH     <- '"'                ;

You specify keywords by a sequence of characters inside quotation marks.

void:   PEG            <- "PEG"   WHITESPACE ;

To specify any arbitrary character utilize the dot. In the example below the expression says look for any single character that is not a backslash. This expression utilizes one of the syntactic predicates (the !) which I will introduce later on. Note the literal (single) character inside the quotation marks. This is a backslashed "backslash" which represents the single literal backslash character.

leaf:   CharUnescaped   <- !"\\" . ;

From time to time you will find a need to represent a range of characters. TCL has built-in string match operations so the grammar for the TCL parser of PEGs implements an extension that permits character ranges like those used in regular expressions. For example following parsing expression matches exactly one character that is a digit, a lower case alpha character or an upper case alpha character:

void:   HexDigit       <- [0-9a-fA-F] ;

Some of these ranges have special names that are mostly self explanatory. A few of them are:

leaf: ALNUM <- "<alnum>" WHITESPACE ;
leaf: ALPHA <- "<alpha>" WHITESPACE ;
leaf: XDIGIT <- "<xdigit>" WHITESPACE ;

Note that XDIGIT and HexDigit would match the same sequence of characters. TCL can match characters outside of the expected (ASCII) range in the unicode character set so be careful when using them.


Operations to construct parsing expressions with nonterminals and literals are represented pretty naturally with the PEG grammar. The available operations are: sequence, ordered choice, zero or more, one or more, zero or more (option), AND predicate, and NOT predicate. The sequence operation is almost pathological in its simplicity. It looks just like a list of things separated by spaces.

EXAMPLE1 <- e1 e2 e3 e4;/code>

In EXAMPLE1 the expression succeeds if first, e1 is found, followed by e2 (found), e3 (found) and e4 (found) otherwise it fails. Along the way, parsing of each expression consumes the input text before the next nonterminal is examined. More interestingly, the ordered choice operator using the '/' character only proceeds to any succeeding expression elements if earlier ones fail to match. So, as another way to recognize a set of characters that are octal digits you could write the following:

OCTAL <- '0' / '1' / '2' / '3' / '4' / '5' / '6' / '7';

When presented with "102" the parser will grab the next input character, see that it is not '1' and then backtrack. After the backtrack, the parser is ready to check to see if the next expression matches. In this case the expression following the '0' / is a literal '1'. It does match so the OCTAL terminal immediately succeeds. This seems like a great way to slow down the parser (if you ask me) when character ranges exist for the same purpose but it makes for a simple example. The next three operations, * + ?, have a conventional meaning like they do in regular expressions. So for the following expression:

GORT <- Klaatu* Barada+ Nikto?;

the text must include in sequence zero or more Klaatu terminals, at least one but possibly more Barada terminals followed last by zero or one Nikto terminals. Unfortunately, Gort may well destroy the earth if this syntax is followed!

Look Ahead with Syntactic Predicates

The last two operators, &e !e, perform tests on the expression e that act like look ahead. Using &e says you want to know if e exists next in the input stream but if you find it do not consume the text matching ejust continue to the next expression. Likewise, using !e says you want to know if e is not present next in the input stream and if you do not find it then do not consume the text matching e then proceed. So the behavior permits a look ahead in the input stream to make a decision about how to proceed.

Nesting Expressions

While not really an operation the open and close parenthesis characters allow for nesting expressions.

See Also:

Formal Grammar Definition Copyright (C) 2014, Aspen Logic, Inc. All Rights Reserved.