pyPEG - a PEG Parser-Interpreter in Python
Python is a nice scripting language. It even gives you access to it's own parser and compiler. It also gives you access to different other parsers for special purposes like XML and string templates.
But sometimes you may want to have your own parser. This is what's pyPEG for.
To get a quick view on what's happening, please read this article on how to parse an arbitrary language to XML with pyPEG on my blog.
PEG means Parsing Expression Grammar. It's something like the idea of Regular Expressions for context free languages; a very clear explanation you'll find in the Wikipedia article about PEG.
With PEGs you can describe the same languages like with BNF (and they're even similar).
Common parsers are not using PEGs and top-down parsing, but LR(n) or LL(n) and bottom-up parsing. This results in the idea of implementing parser generators.
Because with LR(n) or LL(n) parsers you need to calculate out a DFA first, usually you let the parser generator do this for you. The result is a parser implementation for your BNF grammar, which was the input. One could call a parser generator a compiler from BNF to a parser implementation.
A Parser-Interpreter does work as an interpreter instead of being such a compiler. Just give your grammar as input, and it parses the described language out of text. There will be no program generated.
That means: using pyPEG is very easy ;-) If you know regular expressions already, you will learn to use pyPEG quickly.
An example: think of a simple language like this one:
function fak(n) {
if (n==0) { // 0! is 1 by definition
return 1;
} else {
return n * fak(n - 1);
};
}
A pyPEG for that language looks like this (see also the sample script):
def comment(): return [re.compile(r"//.*"), re.compile("/\*.*?\*/", re.S)]
def literal(): return re.compile(r'\d*\.\d*|\d+|".*?"')
def symbol(): return re.compile(r"\w+")
def operator(): return re.compile(r"\+|\-|\*|\/|\=\=")
def operation(): return symbol, operator, [literal, functioncall]
def expression(): return [literal, operation, functioncall]
def expressionlist(): return expression, -1, (",", expression)
def returnstatement(): return keyword("return"), expression
def ifstatement(): return keyword("if"), "(", expression, ")", block, keyword("else"), block
def statement(): return [ifstatement, returnstatement], ";"
def block(): return "{", -2, statement, "}"
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
def functioncall(): return symbol, "(", expressionlist, ")"
def function(): return keyword("function"), symbol, parameterlist, block
def simpleLanguage(): return function
How does this work?
The pyPEG syntax is very easy; we're using Python objects to express a grammar:
A Python string defines a terminal symbol to parse. The result is not being output into the resulting pyAST.
Sample:
def block(): return "{", -2, statement, "}"
keyword(str) also defines a terminal symbol. This symbol is not being output into the pyAST, too. The difference is if you're using the skipws option of the parser keyword() terminal symbols are parsed only, if they're separated by whitespace correctly.
Sample:
def returnstatement(): return keyword("return"), expression
If you're defining a Python Regular Expression using re.compile(regex), then the matched characters are parsed and output into the pyAST. This is for parsing names of variables or functions of your language, for example.
Sample:
def symbol(): return re.compile(r"\w+")
ignore(regex) matches regex like re.compile(regex), but the result is not added to the parse tree.
If you define a Python function, which is returning a pyPEG, then this function represents a non-terminal symbol. The result is output into the pyAST under the name of this function.
Sample:
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
If you want to suppress the output of the name into the pyAST, then start the name with an underscore.
If you're using a tuple, then this represents a production sequence.
Sample:
def functioncall(): return symbol, "(", expressionlist, ")"
Precede an element with an integer for optionals and repetitions; 0 means optional (what is the ? in a RegEx), -1 means optional and can be repeated as often as you want (what is the * in a RegEx), -2 means optional and has to be there, but can be repeated as often as you want (what is the + in a RegEx), and any positive number means to repeat the following object this amount of times (whats the {m} in a RegEx).
Samples:
def block(): return "{", -2, statement, "}"
def parameterlist(): return "(", symbol, -1, (",", symbol), ")"
If you give a Python list of objects in your pyPEG, then this means that one of these options has to match.
Sample:
def expression(): return [literal, operation, functioncall]
With PEGs, the sequence is important, because we're not using BNF here ;-) So the first list element is checked first, and only if this does not match the second one is checked and so on. This is exactly like the behaviour of | separated groups in a RegEx.
Sometimes its the easiest to describe, what should not be true - for this case, the _not() class is there. The _and() class does the opposite - the things inside have to be matched by the same text to parse as the following pyPEG.
The simple way is:
import re, fileinput
from pyPEG import parse
from pyPEG import keyword, _and, _not
# ...
files = fileinput.input()
result = parse(simpleLanguage(), files, True, comment)
print result
parse() parses the content of a file.
Syntax:
parse(language, lineSource, skipWS = True, skipComments = None, outputPos = False)
The parse() function reads lineSource and outputs the resulting pyAST.
the parameters are:
language | pyPEG language description |
linesource | a fileinput.Fileinput object |
skipws | flag if whitespace should be skipped (default: True) |
skipcomments | function which matches comments for cutting out by the parser |
outputPos | if True output integers with character positions into the pyAST |
The parse() function raises a SyntaxError, if it finds one and gives the line number in the msg parameter of this exception in a description string.
It also raises a SyntaxError, if the language description is invalid. parse() returns the resulting pyAST, if no error occured.
If you want to parse a string and not the content of a file, there is the parseLine() function.
Syntax:
parseLine(textline, pattern, resultSoFar = [], skipWS = True, skipComments = None, outputPos = False)
The parseLine() function parses textline using pyPEG pattern, then adds the result to resultSoFar.
The parameters are:
textline | text to parse |
pattern | pyPEG language description |
resultSoFar | pyAST expression with the parsing result |
skipws | flag if whitespace should be skipped (default: True) |
skipcomments | function which matches comments for cutting out by the parser |
outputPos | if True output integers with character positions into the pyAST |
The parseLine() function raises SyntaxError if the beginning of textline is not in your language and if your pyPEG grammar is invalid. It returns a tuple of the pyAST result and a string with the unparsable rest of your textline, if it is not completely in your language.
textline may contain one or more line feed characters.
Also the output is plain and simple: just a list of parsed symbols. If it's a terminal symbol, then the output is text, if it's a non-terminal symbol, then it's a tuple of a string giving the name of the non-terminal symbol followed by a list of contents of this symbol.
Our sample above gives you the following pyAST:
[('symbol', 'fak'), ('parameterlist', [('symbol', 'n')]),
('block', [('statement', [('ifstatement', [('expression',
[('operation', [('symbol', 'n'), ('operator', '=='),
('literal', '0')])]), ('block', [('statement',
[('returnstatement', [('expression', [('literal', '1')])])
)]), ('block', [('statement', [('returnstatement',
[('expression', [('operation', [('symbol', 'n'), ('operator',
'*'), ('functioncall', [('symbol', 'fak'), ('expressionlist',
[('expression', [('operation', [('symbol', 'n'), ('operator',
'-'), ('literal', '1')])])])])])])])])])])])])]
Together with the XML backend, pyAST can be easily translated into XML, see this sample:
vb@bayhorse:~/pyPEG % python xmlsample.py sample.simple | xmlstarlet fo
<?xml version="1.0"?>
<simpleLanguage>
<function>
<symbol>fak</symbol>
<parameterlist>
<symbol>n</symbol>
</parameterlist>
<block>
<ifstatement>
<expression>
<operation>
<symbol>n</symbol>
<operator>==</operator>
<literal>0</literal>
</operation>
</expression>
<block>
<returnstatement>
<expression>
<literal>1</literal>
</expression>
</returnstatement>
</block>
<block>
<returnstatement>
<expression>
<operation>
<symbol>n</symbol>
<operator>*</operator>
<functioncall>
<symbol>fak</symbol>
<expressionlist>
<expression>
<operation>
<symbol>n</symbol>
<operator>-</operator>
<literal>1</literal>
</operation>
</expression>
</expressionlist>
</functioncall>
</operation>
</expression>
</returnstatement>
</block>
</ifstatement>
</block>
</function>
</simpleLanguage>
vb@bayhorse:~/pyPEG %
To process that, you can use YML. This sample YSLT script selects the defined symbols:
include yslt.yml2
tstylesheet {
template "//function" | function «symbol» has parameters `apply "parameterlist"`
template "parameterlist" {
> «symbol»
if "position()!=last()" > ,
}
}
You can execute this using:
% python xmlsample.py sample.simple | yml2proc -I ~/yml2 -y symbols.ysl2 -x -
function fak has parameters n
% _
For more information, please read the YML homepage.