pyPEG – a PEG Parser-Interpreter in Python

pyPEG 1.5 of Fr Dec 30 2011 – Copyleft 2009-2011, Volker Birk

for Python version 3.x see pyPEG version 2

Introduction

Python is a nice scripting language. It even gives you access to its own parser and compiler. It also gives you access to different other parsers for special purposes like XML and string templates.

But sometimes you may want to have your own parser. This is what's pyPEG for. And pyPEG supports Unicode.

To get a quick view on what's happening, please read this article on how to parse an arbitrary language to XML with pyPEG on my blog.

What is PEG?

PEG means Parsing Expression Grammar. It's something like the idea of Regular Expressions for context free languages; a very clear explanation you'll find in the Wikipedia article about PEG.

With PEGs you can describe the same languages like with BNF (and they're even similar).

What is a Parser-Interpreter?

Common parsers are not using PEGs and top-down parsing, but LR(n) or LL(n) and bottom-up parsing. This results in the idea of implementing parser generators.

Because with LR(n) or LL(n) parsers you need to calculate out a DFA first, usually you let the parser generator do this for you. The result is a parser implementation for your BNF grammar, which was the input. One could call a parser generator a compiler from BNF to a parser implementation.

A Parser-Interpreter does work as an interpreter instead of being such a compiler. Just give your grammar as input, and it parses the described language out of text. There will be no program generated.

Using pyPEG

That means: using pyPEG is very easy ;-) If you know regular expressions already, you will learn to use pyPEG quickly.

A small sample

An example: think of a simple language like this one:

function fak(n) {
    if (n==0) { // 0! is 1 by definition
        return 1;
    } else {
        return n * fak(n - 1);
    };
}

A pyPEG for that language looks like this (see also the sample script):

def comment():          return [re.compile(r"//.*"), re.compile("/\*.*?\*/", re.S)]
def literal():          return re.compile(r'\d*\.\d*|\d+|".*?"')
def symbol():           return re.compile(r"\w+")
def operator():         return re.compile(r"\+|\-|\*|\/|\=\=")
def operation():        return symbol, operator, [literal, functioncall]
def expression():       return [literal, operation, functioncall]
def expressionlist():   return expression, -1, (",", expression)
def returnstatement():  return keyword("return"), expression
def ifstatement():      return keyword("if"), "(", expression, ")", block, keyword("else"), block
def statement():        return [ifstatement, returnstatement], ";"
def block():            return "{", -2, statement, "}"
def parameterlist():    return "(", symbol, -1, (",", symbol), ")"
def functioncall():     return symbol, "(", expressionlist, ")"
def function():         return keyword("function"), symbol, parameterlist, block
def simpleLanguage():   return function

How does this work?

pyPEG Expression Syntax

The pyPEG syntax is very easy; we're using Python objects to express a grammar:

Terminal Symbols using Python strings

A Python string defines a terminal symbol to parse. The result is not being output into the resulting pyAST.

Sample:

def block(): return "{", -2, statement, "}"

Terminal Symbols using the keyword() class

keyword(str) also defines a terminal symbol. This symbol is not being output into the pyAST, too. The difference is if you're using the skipws option of the parser keyword() terminal symbols are parsed only, if they're separated by whitespace correctly.

Sample:

def returnstatement(): return keyword("return"), expression

Terminal Symbols using Python RegEx objects

If you're defining a Python Regular Expression using re.compile(regex), then the matched characters are parsed and output into the pyAST. This is for parsing names of variables or functions of your language, for example.

Sample:

def symbol(): return re.compile(r"\w+")

Terminal Symbols using the ignore() class

ignore(regex) matches regex like re.compile(regex), but the result is not added to the parse tree.

Non-Terminal Symbols using Functions

If you define a Python function, which is returning a pyPEG, then this function represents a non-terminal symbol. The result is output into the pyAST under the name of this function.

Sample:

def parameterlist(): return "(", symbol, -1, (",", symbol), ")"

If you want to suppress the output of the name into the pyAST, then start the name with an underscore.

Production Sequences using Tuples

If you're using a tuple, then this represents a production sequence.

Sample:

def functioncall(): return symbol, "(", expressionlist, ")"

Precede an element with an integer for optionals and repetitions; 0 means optional (what is the ? in a RegEx), -1 means optional and can be repeated as often as you want (what is the * in a RegEx), -2 means optional and has to be there, but can be repeated as often as you want (what is the + in a RegEx), and any positive number means to repeat the following object this amount of times (whats the {m} in a RegEx).

Samples:

def block():         return "{", -2, statement, "}"
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"

Options using Lists

If you give a Python list of objects in your pyPEG, then this means that one of these options has to match.

Sample:

def expression(): return [literal, operation, functioncall]

With PEGs, the sequence is important, because we're not using BNF here ;-) So the first list element is checked first, and only if this does not match the second one is checked and so on. This is exactly like the behaviour of | separated groups in a RegEx.

Look ahead with _not() and _and()

Sometimes its the easiest to describe, what should not be true – for this case, the _not() class is there. The _and() class does the opposite – the things inside have to be matched by the same text to parse as the following pyPEG.

Using pyPEG in your program

The simple way is:

import re, fileinput
from pyPEG import parse
from pyPEG import keyword, _and, _not

# ...

files = fileinput.input()
result = parse(simpleLanguage(), files, True, comment)
print result

Parsing with parse()

parse() parses the content of a file.

Syntax:

parse(language, lineSource, skipWS = True, skipComments = None, packrat = False, lineCount = True)

The parse() function reads lineSource and outputs the resulting pyAST.

the parameters are:

languagepyPEG language description
linesourcea fileinput.Fileinput object
skipwsflag if whitespace should be skipped (default: True)
skipcommentsfunction which matches comments for cutting out by the parser
packratif True use memoization (faster with unhandy grammars)
lineCountdecorate pyAST with line numbers (default: True)

The parse() function raises a SyntaxError, if it finds one and gives the line number in the msg parameter of this exception in a description string.

It also raises a SyntaxError, if the language description is invalid. parse() returns the resulting pyAST, if no error occured.

Using parseLine()

If you want to parse a string and not the content of a file, there is the parseLine() function.

Syntax:

parseLine(textline, pattern, resultSoFar = [], skipWS = True, skipComments = None, packrat = False)

The parseLine() function parses textline using pyPEG pattern, then adds the result to resultSoFar.

The parameters are:

textlinetext to parse
patternpyPEG language description
resultSoFarpyAST expression with the parsing result
skipwsflag if whitespace should be skipped (default: True)
skipcommentsfunction which matches comments for cutting out by the parser
packratif True use memoization (faster with unhandy grammars)

The parseLine() function raises SyntaxError if the beginning of textline is not in your language and if your pyPEG grammar is invalid. It returns a tuple of the pyAST result and a string with the unparsable rest of your textline, if it is not completely in your language.

textline may contain one or more line feed characters.

Tracing the grammar rules

You can trace the grammar rules, which are applied, by setting pyPEG.print_trace to True. Then pyPEG will output a trace to stderr.

This feature will be deactivated, if you run your Python shell with optimizations switched on.

pyAST – the resulting output

Also the output is plain and simple: just a list of parsed symbols. If it's a terminal symbol, then the output is unicode text, if it's a non-terminal symbol, then it's a named list of class pyPEG.Symbol, which is a list of a string giving the name of the non-terminal symbol as pyPEG.Name followed by a list of contents of this symbol.

pyPEG.Symbol is derived from list, the Python list class. That means, you can handle it like any Python list. It has two advantages: it has a .__name__ attribute, where you can find the name of the symbol, and it has a .what attribute, where you can find the contents of a symbol. So if somesym is of class pyPEG.Symbol, the terms somesym.__name__ and somesym[0] are equivalent. Analogically, the two terms somesym.what and somesym[1] are equivalent.

pyPEG.Name is derived from unicode, the Python unicode string class. That means, you can handle it like any Python text.

The reason, why it's not the original Python string class, is, that pyPEG.Name has an extra feature: there are line numbers in the objects, so your compiler can decorate its error messages with them for semantic errors, if you want to. The line numbers you're getting with the attribute .line of any pyPEG.Name object.

Our sample above gives you the following pyAST:

[Symbol(u'symbol', u'fak'), Symbol(u'parameterlist', [Symbol(u'symbol', u'n')]),
Symbol(u'block', [Symbol(u'statement', [Symbol(u'ifstatement', [Symbol(u'expression',
[Symbol(u'operation', [Symbol(u'symbol', u'n'), Symbol(u'operator', u'=='),
Symbol(u'literal', u'0')])]), Symbol(u'block', [Symbol(u'statement',
[Symbol(u'returnstatement', [Symbol(u'expression', [Symbol(u'literal', u'1')])])])]),
Symbol(u'block', [Symbol(u'statement', [Symbol(u'returnstatement', [Symbol(u'expression',
[Symbol(u'operation', [Symbol(u'symbol', u'n'), Symbol(u'operator', u'*'),
Symbol(u'functioncall', [Symbol(u'symbol', u'fak'), Symbol(u'expressionlist',
[Symbol(u'expression', [Symbol(u'operation', [Symbol(u'symbol', u'n'),
Symbol(u'operator', u'-'), Symbol(u'literal', u'1')])])])])])])])])])])])])]

XML output with the XML backend

Together with the XML backend, pyAST can be easily translated into XML, see this sample:

vb@bayhorse:~/pyPEG % python xmlsample.py sample.simple | xmlstarlet fo
<?xml version="1.0"?>
<simpleLanguage>
  <function>
    <symbol>fak</symbol>
    <parameterlist>
      <symbol>n</symbol>
    </parameterlist>
    <block>
      <ifstatement>
        <expression>
          <operation>
            <symbol>n</symbol>
            <operator>==</operator>
            <literal>0</literal>
          </operation>
        </expression>
        <block>
          <returnstatement>
            <expression>
              <literal>1</literal>
            </expression>
          </returnstatement>
        </block>
        <block>
          <returnstatement>
            <expression>
              <operation>
                <symbol>n</symbol>
                <operator>*</operator>
                <functioncall>
                  <symbol>fak</symbol>
                  <expressionlist>
                    <expression>
                      <operation>
                        <symbol>n</symbol>
                        <operator>-</operator>
                        <literal>1</literal>
                      </operation>
                    </expression>
                  </expressionlist>
                </functioncall>
              </operation>
            </expression>
          </returnstatement>
        </block>
      </ifstatement>
    </block>
  </function>
</simpleLanguage>
vb@bayhorse:~/pyPEG % 

To process that, you can use YML. This sample YSLT script selects the defined symbols:

include yslt.yml2

tstylesheet {
    template "//function" | function «symbol» has parameters `apply "parameterlist"`
    template "parameterlist" {
        > «symbol»
        if "position()!=last()" > , 
    }
}

You can execute this using:

% python xmlsample.py sample.simple | yml2proc -I ~/yml2 -y symbols.ysl2 -x -
function fak has parameters n
% _

For more information, please read the YML homepage.

Want to download? Go to the ^Top^ and look to the right ;-)