Parsing Expression Grammars (Part 1 of N)

Aspen Logic Knowledge Cave Article

Looking for a little simplicity, found a lot of power

I recently cracked open the help section of Active State's TCL help guide to learn about the parsing tools available to me in the TCL library. The obtuse documentation for the API led me to the conclusion they must be really powerful or why else would the documentation be so unhelpful. After one day of fiddling around I was able to create a script to parse

-357 *
# Put in an expression
(6 + 2)

and get a very nice looking Abstract Syntax Tree (AST) using the Parsing Expression Grammar (PEG) below


PEG calculator (Expression) ## Lexing constructs void:   WHITESPACE      <- (" " / "\t" / EOL / COMMENT)* ; void:   COMMENT         <- '#' (!EOL .)* EOL ; void:   EOL             <- "\n\r" / "\n" / "\r" ; void:   EOF             <- !. ; ## Syntactical constructs void: Digits            <- [0-9]+ ; void: Sign              <- '-' / '+' ; Number                  <- Sign? Digits WHITESPACE ; Expression              <- Term (AddOp Term)* ; MulOp                   <- ('*' / '/') WHITESPACE ; Term                    <- Factor (MulOp Factor)* ; AddOp                   <- ('+'/'-') WHITESPACE ; Factor                  <- ('(' Expression ')' / Number) WHITESPACE ; END; 


Simple? There were just a few things I had to learn along the way to bring my understand up slightly above the caveman level to make this simple.

  • How do I specify whitespace and in particular newlines?
  • Packrat parsing
  • Why doesn't my TCL 8.5 installation have the appropriate packages loaded?
  • Where do I get the right packages?
  • What are the right packages for that matter! (Sorry, I cannot remember the English language syntax precedence for ! over ?)
  • How do I provide input text to the parser?
  • What good is the abstract syntax tree?
  • PEG serialization
  • What is a TCL Param? (Not what you think!)
  • ...


Backing up a little might be helpful. I want a simple, easy to maintain way to parse expressions that a user types in (the expressions will not be standard arithmetic types). LEX and YACC, ANTLR, FLEX and BISON are all big name tools for parsing but they require an Einstein to understand and a Feynman to implement. The TCL library offered parsing tools but I had only glanced at them briefly in the past. After diving in I learned about parsing expression grammars and the packrat parsers that make them practical to deploy. Bryan Ford developed the Parsing Expression Grammar at MIT and talks about it in a short presentation you will want to read. A PEG combines the lexical description (for tokenization) and syntax analysis (for understanding structure)  into one language. It allows you to get around the shift/reduce conflicts in YACC/BISON and brings you beyond the assembly language level of regular expressions. Each PEG contains a list of parsing expressions that describe matching rules. The rule

Number <- Sign? Digits WHITESPACE ;

rule from the calculator PEG above consists of a terminal named Number, an arrow, and a PE. Once the parser has matched the expression on the right, the AST will contain a reference to Number and the character range that envelopes the matching text. How do you read the parsing expression? The terminal Number matches zero or exactly one Sign followed (in sequence) by Digits then WHITESPACE. In a simple sense, the terminal Number is a sub-routine in the parsing world. If a match in the text can be found according to the parsing expression on the right hand side then the parse of the terminal is successful and the parser can move on. Notice there is also a rule in the grammar for Sign and for Digits illustrating the hierarchical nature of the grammar description. At the root of the PE hierarchy you find the start terminal. The matching process the parser uses begins at the start terminal and works its way down to leaf terminals which are in some respect tokens. In the calculator PEG example above the parser will look for an Expression and while doing so it will look for a Term and while doing so it will look for a Factor and while doing so it will look for a '(' and lacking that it will then look for a Number. The miracle of PEGs lies in the simple '/' character between '(' Expression ')'and Number. The slash indicates a priority. "If you can't find '(' then backtrack and look for Number." That makes all the difference in the world to solving real world and very practical parsing problems.


The abstract syntax tree resulting from a parse of the string example at the very top of the post looks like this:

<Expression> :: 0 38
    <Term> :: 0 38
        <Factor> :: 0 4
            <Number> :: 0 4
        <MulOp> :: 5 30
        <Factor> :: 31 38
            <Expression> :: 32 36
                <Term> :: 32 33
                    <Factor> :: 32 33
                        <Number> :: 32 33
                <AddOp> :: 34 35
                <Term> :: 36 36
                    <Factor> :: 36 36
                        <Number> :: 36 36

The name of the terminal discovered in the text shows between angle brackets. The range of text characters covered by the terminal appears after the double colon as a start and end index (from 0). There are three things to note immediately about this AST. The Expression range covers the entire piece of text! Second, no references appear to Sign or Digits terminals. The ranges encompass the entire set of characters that matched for that terminal and its associated parsing expression. This may not be particularly useful depending on the application of any given parser but it allows for the extraction of any relevant portion of the structure. The Sign and Digits terminals are explicitly removed from the AST as a result of the void: mode specified for the terminals. This mode says that upon recognition of the specific terminal's expression to continue parsing but do not make a reference to the discovery in the AST. There is very little point in calling out every last bit of whitespace found in the text or for any comments that are found. This mode capability helps the PEG designer to clean up the resulting AST by only including important syntactical structures. And lastly you might notice the MulOp terminal covers the range 5 to 30. Huh? The way the grammar was laid out the parser expects to find WHITESPACE after a '*' or '/'. In the text being parsed a Comment appears in the middle of the whole expression right after the '*'. So, the MulOp will encompass the comment.

Up Next

In the next post, I will break down some of the useful things to know about constructing the PEG. Copyright (C) 2014, Aspen Logic, Inc. All Rights Reserved.