pyPEG – a PEG Parser-Interpreter in Python
Requires Python 3.x or 2.7
Older versions: pyPEG 1.x
Python is a nice scripting language. It even gives you access to its own parser and compiler. It also gives you access to different other parsers for special purposes like XML and string templates.
But sometimes you may want to have your own parser. This is what's pyPEG for. And pyPEG supports Unicode.
pyPEG is a plain and simple intrinsic parser interpreter framework for Python version 2.7 and 3.x. It is based on Parsing Expression Grammar, PEG. With pyPEG you can parse many formal languages in a very easy way. How does that work?
You can install a 2.x
series pyPEG release from PyPY with:
pip install pypeg2
PEG is something like Regular Expressions with recursion. The grammars are like templates. Let's make an example. Let's say, you want to parse a function declaration in a C like language. Such a function declaration consists of:
type declaration | |
name | |
parameters | |
block with instructions |
int f(int a, long b)
{ do_this; do_that; }
With pyPEG you're declaring a Python class for each object type you want to parse. This class is then instanciated for each parsed object. This class gets an attribute grammar
with a description what should be parsed in what way. In our simple example, we are supporting two different things declared as keywords in our language: int
and long
. So we're writing a class declaration for the typing, which supports an Enum
of the two possible keywords as its grammar
:
class Type(Keyword):
grammar = Enum( K("int"), K("long") )
Common parsing tasks are included in the pyPEG framework. In this example, we're using the Keyword
class because the result will be a keyword, and we're using Keyword
objects (with the abbreviation K
), because what we parse will be one of the enlisted keywords.
The total result will be a Function
. So we're declaring a Function
class:
class Function:
grammar = Type, …
The next thing will be the name of the Function
to parse. Names are somewhat special in pyPEG. But they're easy to handle: to parse a name, there is a ready made name()
function you can call in your grammar to generate a .name
Attribute
:
class Function:
grammar = Type, name(), …
Now for the Parameters
part. First let's declare a class for the parameters. Parameters
has to be a collection, because there may be many of them. pyPEG has some ready made collections. For the case of the Parameters
, the Namespace
collection will fit. It provides indexed access by name, and Parameters
have names (in our example: a
and b
). We write it like this:
class Parameters(Namespace):
grammar = …
A single Parameter
has a structure itself. It has a Type
and a name()
. So let's define:
class Parameter:
grammar = Type, name()
class Parameters(Namespace):
grammar = …
pyPEG will instantiate the Parameter
class for each parsed parameter. Where will the Type
go to? The name()
function will generate a .name
Attribute
, but the Type
object? Well, let's move it to an Attribute
, too, named .typing
. To generate an Attribute
, pyPEG offers the attr()
function:
class Parameter:
grammar = attr("typing", Type), name()
class Parameters(Namespace):
grammar = …
By the way: name()
is just a shortcut for attr("name", Symbol)
. It generates a Symbol
.
How can we fill our Namespace
collection named Parameters
? Well, we have to declare, how a list of Parameter
objects will look like in our source text. An easy way is offered by pyPEG with the cardinality functions. In this case we can use maybe_some()
. This function represents the asterisk cardinality, *
class Parameter:
grammar = attr("typing", Type), name()
class Parameters(Namespace):
grammar = Parameter, maybe_some(",", Parameter)
This is how we express a comma separated list. Because this task is so common, there is a shortcut generator function again, csl()
. The code below will do the same as the code above:
class Parameter:
grammar = attr("typing", Type), name()
class Parameters(Namespace):
grammar = csl(Parameter)
Maybe a function has no parameters. This is a case we have to consider. What should happen then? In our example, then the Parameters
Namespace
should be empty. We're using another cardinality function for that case, optional()
. It represents the question mark cardinality, ?
class Parameter:
grammar = attr("typing", Type), name()
class Parameters(Namespace):
grammar = optional(csl(Parameter))
We can continue with our Function
class. The Parameters
will be in parantheses, we just put that into the grammar
:
class Function:
grammar = Type, name(), "(", Parameters, ")", …
Now for the block of instructions. We could declare another collection for the Instructions. But the function itself can be seen as a list of instructions. So let us declare it this way. First we make the Function
class itself a List
:
class Function(List):
grammar = Type, name(), "(", Parameters, ")", …
If a class is a List
, pyPEG will put everything inside this list, which will be parsed and does not generate an Attribute
. So with that modification, our Parameters
now will be put into that List, too. And so will be the Type
. This is an option, but in our example, it is not what we want. So let's move them to an Attribute
.typing
and an Attribute
.parms
respectively:
class Function(List):
grammar = attr("typing", Type), name(), \
"(", attr("parms", Parameters), ")", …
Now we can define what a block
will look like, and put it just behind into the grammar
of a Function
. The Instruction
class we have plain and simple. Of course, in a real world example, it can be pretty complex ;-) Here we just have it as a word
. A word
is a predefined RegEx
; it is re.compile(r"\w+")
.
class Instruction(str):
grammar = word, ";"
block = "{", maybe_some(Instruction), "}"
Now let's put that to the tail of our Function.grammar
:
class Function(List):
grammar = attr("typing", Type), name(), \
"(", attr("parms", Parameters), ")", block
Caveat: pyPEG 2.x is written for Python 3. You can use it with Python 2.7 with the following import (you don't need that for Python 3):
from __future__ import unicode_literals, print_function
Well, that looks pretty good now. Let's try it out using the parse()
function:
>>> from pypeg2 import *
>>> class Type(Keyword):
... grammar = Enum( K("int"), K("long") )
...
>>> class Parameter:
... grammar = attr("typing", Type), name()
...
>>> class Parameters(Namespace):
... grammar = optional(csl(Parameter))
...
>>> class Instruction(str):
... grammar = word, ";"
...
>>> block = "{", maybe_some(Instruction), "}"
>>> class Function(List):
... grammar = attr("typing", Type), name(), \
... "(", attr("parms", Parameters), ")", block
...
>>> f = parse("int f(int a, long b) { do_this; do_that; }",
... Function)
>>> f.name
Symbol('f')
>>> f.typing
Symbol('int')
>>> f.parms["b"].typing
Symbol('long')
>>> f[0]
'do_this'
>>> f[1]
'do_that'
pyPEG can do more. It is not only a framework for parsing text, it can compose source code, too. A pyPEG grammar
is not only “just like” a template, it can actually be used as a template for composing text. Just call the compose()
function:
>>> compose(f, autoblank=False)
'intf(inta, longb){do_this;do_that;}'
As you can see, for composing first there is a lack of whitespace. This is because we used the automated whitespace removing functionality of pyPEG while parsing (which is enabled by default) but we disabled the automated adding of blanks if violating syntax otherwise. To improve on that we have to extend our grammar
templates a little bit. For that case, there are callback function objects in pyPEG. They're only executed by compose()
and ignored by parse()
. And as usual, there are predefined ones for the common cases. Let's try that out. First let's add blank
between things which should be separated:
class Parameter:
grammar = attr("typing", Type), blank, name()
class Function(List):
grammar = attr("typing", Type), blank, name(), \
"(", attr("parms", Parameters), ")", block
After resetting everything, this will lead to the output:
>>> compose(f, autoblank=False)
'int f(int a, long b){do_this;do_that;}'
The blank
after the comma int a, long b
was generated by the csl()
function; csl(Parameter)
generates:
Parameter, maybe_some(",", blank, Parameter)
In C like languages (like our example) we like to indent blocks. Indention is something, which is relative to a current position. If something is inside a block already, and should be indented, it has to be indented two times (and so on). For that case pyPEG has an indention system.
The indention system basically is using the generating function indent()
and the callback function object endl
. With indent we can mark what should be indented, sending endl
means here should start the next line of the source code being output. We can use this for our block
:
class Instruction(str):
grammar = word, ";", endl
block = "{", endl, maybe_some(indent(Instruction)), "}", endl
class Function(List):
grammar = attr("typing", Type), blank, name(), \
"(", attr("parms", Parameters), ")", endl, block
This changes the output to:
>>> print(compose(f))
int f(int a, long b)
{
do_this;
do_that;
}
With User defined Callback Functions pyPEG offers the needed flexibility to be useful as a general purpose template system for code generation. In our simple example let's say we want to have processing information in comments in the Function
declaration, i.e. the indention level in a comment bevor each Instruction
. For that we can define our own Callback Function:
class Instruction(str):
def heading(self, parser):
return "/* on level " + str(parser.indention_level) \
+ " */", endl
Such a Callback Function is called with two arguments. The first argument is the object to output. The second argument is the parser object to get state information of the composing process. Because this fits the convention for Python methods, you can write it as a method of the class where it belongs to.
The return value of such a Callback Function must be the resulting text. In our example, a C comment shell be generated with notes. We can put this now into the grammar
.
class Instruction(str):
def heading(self, parser):
return "/* on level " + str(parser.indention_level) \
+ " */", endl
grammar = heading, word, ";", endl
The result is corresponding:
>>> print(compose(f))
int f(int a, long b)
{
/* on level 1 */
do_this;
/* on level 1 */
do_that;
}
Sometimes you want to process what you parsed with the XML toolchain, or with the YML toolchain. Because of that, pyPEG has an XML backend. Just call the thing2xml()
function to get bytes
with encoded XML:
>>> from pypeg2.xmlast import thing2xml
>>> print(thing2xml(f, pretty=True).decode())
<Function typing="int" name="f">
<Parameters>
<Parameter typing="int" name="a"/>
<Parameter typing="long" name="b"/>
</Parameters>
<Instruction>do_this</Instruction>
<Instruction>do_that</Instruction>
</Function>
The complete sample code you can download here.