pyPEG – a PEG Parser-Interpreter in Python
for Python version 3.x see pyPEG version 2
Python is a nice scripting language. It even gives you access to its own parser and compiler. It also gives you access to different other parsers for special purposes like XML and string templates.
But sometimes you may want to have your own parser. This is what's pyPEG for. And pyPEG supports Unicode.
To get a quick view on what's happening, please read this article on how to parse an arbitrary language to XML with pyPEG on my blog.
PEG means Parsing Expression Grammar. It's something like the idea of Regular Expressions for context free languages; a very clear explanation you'll find in the Wikipedia article about PEG.
With PEGs you can describe the same languages like with BNF (and they're even similar).
Common parsers are not using PEGs and top-down parsing, but LR(n) or LL(n) and bottom-up parsing. This results in the idea of implementing parser generators.
Because with LR(n) or LL(n) parsers you need to calculate out a DFA first, usually you let the parser generator do this for you. The result is a parser implementation for your BNF grammar, which was the input. One could call a parser generator a compiler from BNF to a parser implementation.
A Parser-Interpreter does work as an interpreter instead of being such a compiler. Just give your grammar as input, and it parses the described language out of text. There will be no program generated.
That means: using pyPEG is very easy ;-) If you know regular expressions already, you will learn to use pyPEG quickly.
An example: think of a simple language like this one:
function fak(n) {
if (n==0) { // 0! is 1 by definition
return 1;
} else {
return n * fak(n - 1);
};
}
A pyPEG for that language looks like this (see also the sample script):
def comment(): return [re.compile(r"//.*"), re.compile("/\*.*?\*/", re.S)]
def literal(): return re.compile(r'\d*\.\d*|\d+|".*?"')
def symbol(): return re.compile(r"\w+")
def operator(): return re.compile(r"\+|\-|\*|\/|\=\=")
def operation(): return symbol, operator, [literal, functioncall]
def expression(): return [literal, operation, functioncall]
def expressionlist(): return expression, -1, (",", expression)
def returnstatement(): return keyword("return"), expression
def ifstatement(): return keyword("if"), "(", expression, ")", block, keyword("else"), block
def statement(): return [ifstatement, returnstatement], ";"
def block(): return "{", -2, statement, "}"
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
def functioncall(): return symbol, "(", expressionlist, ")"
def function(): return keyword("function"), symbol, parameterlist, block
def simpleLanguage(): return function
How does this work?
The pyPEG syntax is very easy; we're using Python objects to express a grammar:
A Python string defines a terminal symbol to parse. The result is not being output into the resulting pyAST.
Sample:
def block(): return "{", -2, statement, "}"
keyword(str)
also defines a terminal symbol. This symbol is not being output into
the pyAST, too. The difference is if you're using the skipws
option of
the parser keyword()
terminal symbols are parsed only, if they're separated by
whitespace correctly.
Sample:
def returnstatement(): return keyword("return"), expression
If you're defining a Python Regular Expression
using re.compile(regex)
, then the matched characters are parsed and output into the
pyAST. This is for parsing names of
variables or
functions of your language, for example.
Sample:
def symbol(): return re.compile(r"\w+")
ignore(regex)
matches regex like re.compile(regex)
, but the result is not added
to the parse tree.
If you define a Python function, which is returning a pyPEG, then this function represents a non-terminal symbol. The result is output into the pyAST under the name of this function.
Sample:
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
If you want to suppress the output of the name into the pyAST, then start the name with an underscore.
If you're using a tuple, then this represents a production sequence.
Sample:
def functioncall(): return symbol, "(", expressionlist, ")"
Precede an element with an integer
for optionals and repetitions; 0
means optional (what is the ?
in a RegEx), -1
means
optional and can be repeated as often as you want (what is the *
in a RegEx), -2
means
optional and has to be there, but can be repeated as often as you want (what is the +
in
a RegEx), and any positive number means to repeat the following object this amount of times
(whats the {m}
in a RegEx).
Samples:
def block(): return "{", -2, statement, "}"
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
If you give a Python list of objects in your pyPEG, then this means that one of these options has to match.
Sample:
def expression(): return [literal, operation, functioncall]
With PEGs, the sequence is important, because we're not using BNF here ;-) So the first
list element is checked first, and only if this does not match the second one is checked
and so on. This is exactly like the behaviour of |
separated groups in a RegEx.
Sometimes its the easiest to describe, what should not be true – for this case, the _not()
class is there. The _and()
class does the opposite – the things inside have to be matched
by the same text to parse as the following pyPEG.
The simple way is:
import re, fileinput
from pyPEG import parse
from pyPEG import keyword, _and, _not
# ...
files = fileinput.input()
result = parse(simpleLanguage(), files, True, comment)
print result
parse()
parses the content of a file.
Syntax:
parse(language, lineSource, skipWS = True, skipComments = None, packrat = False, lineCount = True)
The parse()
function reads lineSource
and outputs the resulting pyAST.
the parameters are:
language | pyPEG language description |
linesource | a fileinput.Fileinput object |
skipws | flag if whitespace should be skipped (default: True ) |
skipcomments | function which matches comments for cutting out by the parser |
packrat | if True use memoization (faster with unhandy grammars) |
lineCount | decorate pyAST with line numbers (default: True ) |
The parse()
function raises a SyntaxError
, if it finds one and gives the line number
in the msg
parameter of this exception in a description string.
It also raises a SyntaxError
, if the language description is invalid. parse()
returns the resulting pyAST, if no error occured.
If you want to parse a string and not the content of a file, there is the parseLine()
function.
Syntax:
parseLine(textline, pattern, resultSoFar = [], skipWS = True, skipComments = None, packrat = False)
The parseLine()
function parses textline
using pyPEG pattern
, then adds the
result to resultSoFar
.
The parameters are:
textline | text to parse |
pattern | pyPEG language description |
resultSoFar | pyAST expression with the parsing result |
skipws | flag if whitespace should be skipped (default: True ) |
skipcomments | function which matches comments for cutting out by the parser |
packrat | if True use memoization (faster with unhandy grammars) |
The parseLine()
function raises SyntaxError
if the beginning of textline
is not in
your language and if your pyPEG grammar is invalid. It returns a tuple of the
pyAST result and a string with the unparsable rest of your textline
, if it is
not completely in your language.
textline
may contain one or more line feed characters.
You can trace the grammar rules, which are applied, by setting pyPEG.print_trace
to
True
. Then pyPEG will output a trace to stderr
.
This feature will be deactivated, if you run your Python shell with optimizations switched on.
Also the output is plain and simple: just a list of parsed symbols. If it's a terminal
symbol, then the output is unicode text, if it's a non-terminal symbol, then it's a named list
of class pyPEG.Symbol
, which is a list of a string giving the name of the non-terminal
symbol as pyPEG.Name
followed by a list of contents of this symbol.
pyPEG.Symbol
is derived from list
, the Python list class. That means, you can handle
it like any Python list. It has two advantages: it has a .__name__
attribute, where you
can find the name of the symbol, and it has a .what
attribute, where you can find the
contents of a symbol. So if somesym
is of class pyPEG.Symbol
, the terms
somesym.__name__
and somesym[0]
are equivalent. Analogically, the two terms
somesym.what
and somesym[1]
are equivalent.
pyPEG.Name
is derived from unicode
, the Python unicode string class. That means, you can
handle it like any Python text.
The reason, why it's not the original Python string class, is, that pyPEG.Name
has an extra
feature: there are line numbers in the objects, so your compiler can decorate its error
messages with them for semantic errors, if you want to. The line numbers you're getting
with the attribute .line
of any pyPEG.Name
object.
Our sample above gives you the following pyAST:
[Symbol(u'symbol', u'fak'), Symbol(u'parameterlist', [Symbol(u'symbol', u'n')]),
Symbol(u'block', [Symbol(u'statement', [Symbol(u'ifstatement', [Symbol(u'expression',
[Symbol(u'operation', [Symbol(u'symbol', u'n'), Symbol(u'operator', u'=='),
Symbol(u'literal', u'0')])]), Symbol(u'block', [Symbol(u'statement',
[Symbol(u'returnstatement', [Symbol(u'expression', [Symbol(u'literal', u'1')])])])]),
Symbol(u'block', [Symbol(u'statement', [Symbol(u'returnstatement', [Symbol(u'expression',
[Symbol(u'operation', [Symbol(u'symbol', u'n'), Symbol(u'operator', u'*'),
Symbol(u'functioncall', [Symbol(u'symbol', u'fak'), Symbol(u'expressionlist',
[Symbol(u'expression', [Symbol(u'operation', [Symbol(u'symbol', u'n'),
Symbol(u'operator', u'-'), Symbol(u'literal', u'1')])])])])])])])])])])])])]
Together with the XML backend, pyAST can be easily translated into XML, see this sample:
vb@bayhorse:~/pyPEG % python xmlsample.py sample.simple | xmlstarlet fo
<?xml version="1.0"?>
<simpleLanguage>
<function>
<symbol>fak</symbol>
<parameterlist>
<symbol>n</symbol>
</parameterlist>
<block>
<ifstatement>
<expression>
<operation>
<symbol>n</symbol>
<operator>==</operator>
<literal>0</literal>
</operation>
</expression>
<block>
<returnstatement>
<expression>
<literal>1</literal>
</expression>
</returnstatement>
</block>
<block>
<returnstatement>
<expression>
<operation>
<symbol>n</symbol>
<operator>*</operator>
<functioncall>
<symbol>fak</symbol>
<expressionlist>
<expression>
<operation>
<symbol>n</symbol>
<operator>-</operator>
<literal>1</literal>
</operation>
</expression>
</expressionlist>
</functioncall>
</operation>
</expression>
</returnstatement>
</block>
</ifstatement>
</block>
</function>
</simpleLanguage>
vb@bayhorse:~/pyPEG %
To process that, you can use YML. This sample YSLT script selects the defined symbols:
include yslt.yml2
tstylesheet {
template "//function" | function «symbol» has parameters `apply "parameterlist"`
template "parameterlist" {
> «symbol»
if "position()!=last()" > ,
}
}
You can execute this using:
% python xmlsample.py sample.simple | yml2proc -I ~/yml2 -y symbols.ysl2 -x -
function fak has parameters n
% _
For more information, please read the YML homepage.