pyPEG – a PEG Parser-Interpreter in Python
Requires Python 3.x or 2.7
Older versions: pyPEG 1.x
Caveat: pyPEG 2.x is written for Python 3. That means, it accepts Unicode strings only. You can use it with Python 2.7 by writing u'string'
instead of 'string'
or with the following import (you don't need that for Python 3):
from __future__ import unicode_literals
The samples in this documentation are written for Python 3, too. To execute them with Python 2.7, you'll need this import:
from __future__ import print_function
pyPEG 2.x supports new-style classes only.
A str
instance as well as an instance of pypeg2.Literal
is parsed in the source text as a Terminal Symbol. It is removed and no result is put into the Abstract syntax tree. If it does not exist at the correct position in the source text, a SyntaxError
is raised.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = parse("this=something", Key)
>>> k.name
Symbol('this')
>>> k
'something'
str
instances and pypeg2.Literal
instances are being output literally.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = Key("a value")
>>> k.name = Symbol("give me")
>>> compose(k)
'give me=a value\n'
pyPEG uses Python's re
module. You can use Python Regular Expression Objects purely, or use the pypeg2.RegEx
encapsulation. Regular Expressions are parsed as Terminal Symbols. The matching result is put into the AST. If no match can be achieved, a SyntaxError
is raised.
pyPEG predefines different RegEx objects:
| Regular expression for scanning a word. |
| Regular expression for rest of line. |
| Regular expression for scanning whitespace. |
| Shell script style comment. |
| C++ style comment. |
| C style comment without nesting. |
| Pascal style comment without nesting. |
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = parse("this=something", Key)
>>> k.name
Symbol('this')
>>> k
'something'
For RegEx
objects their corresponding value in the AST will be output. If this value does not match the RegEx
a ValueError
is raised.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = Key("a value")
>>> k.name = Symbol("give me")
>>> compose(k)
'give me=a value\n'
A tuple
or an instance of pypeg2.Concat
specifies, that different things have to be parsed one after another. If not all of them parse in their sequence, a SyntaxError
is raised.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = parse("this=something", Key)
>>> k.name
Symbol('this')
>>> k
'something'
In a tuple
there may be integers preceding another thing in the tuple
. These integers represent a cardinality. For example, to parse three times a word
, you can have as a grammar
:
grammar = word, word, word
or:
grammar = 3, word
which is equivalent. There are special cardinality values:
|
|
|
|
|
|
The special cardinality values can be generated with the Cardinality Functions. Other negative values are reserved and may not be used.
For tuple
instances and instances of pypeg2.Concat
all attributes of the corresponding thing (and elements of the corresponding collection if that applies) in the AST will be composed and the result is concatenated.
Example:
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = Key("a value")
>>> k.name = Symbol("give me")
>>> compose(k)
'give me=a value\n'
A list
instance which is not derived from pypeg2.Concat
represents different options. They're tested in their sequence. The first option which parses is chosen, the others are not tested any more. If none matches, a SyntaxError
is raised.
Example:
>>> number = re.compile(r"\d+")
>>> parse("hello", [number, word])
'hello'
The elements of the list
are tried out in their sequence, if one of them can be composed. If none can a ValueError
is raised.
Example:
>>> letters = re.compile(r"[a-zA-Z]")
>>> number = re.compile(r"\d+")
>>> compose(23, [letters, number])
'23'
None
parses to nothing. And it composes to nothing. It represents the no-operation value.
Symbol(str)
Used to scan a Symbol
.
If you're putting a Symbol
somewhere in your grammar
, then Symbol.regex
is used to scan while parsing. The result will be a Symbol
instance. Optionally it is possible to check that a Symbol
instance will not be identical to any Keyword
instance. This can be helpful if the source language forbids that.
A class which is derived from Symbol
can have an Enum
as its grammar
only. Other values for its grammar
are forbidden and will raise a TypeError
. If such an Enum
is specified, each parsed value will be checked if being a member of this Enum
additionally to the RegEx
matching.
| regular expression to scan, default |
| flag if a |
| name of the |
__init__(self, name, namespace=None)
Construct a Symbol
with that name
in namespace
.
| if |
| if |
Parsing a Symbol
is done by scanning with Symbol.regex
. In our example we're using the name()
function, which is often used to parse a Symbol
. name()
equals to attr("name", Symbol)
.
Example:
>>> Symbol.regex = re.compile(r"[\w\s]+")
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> k = parse("this one=foo bar", Key)
>>> k.name
Symbol('this one')
>>> k
'foo bar'
Composing a Symbol
is done by converting it to text.
Example:
>>> k.name = Symbol("that one")
>>> compose(k)
'that one=foo bar'
Keyword(Symbol)
Used to access the keyword table.
The Keyword
class is meant to be instanciated for each Keyword
of the source language. The class holds the keyword table as a Namespace
instance. There is the abbreviation K
for Keyword
. The latter is useful for instancing keywords.
| regular expression to scan; default |
|
|
| name of the |
__init__(self, keyword)
Adds keyword
to the keyword table.
When a Keyword
instance is parsed, it is removed and nothing is put into the resulting AST. When a Keyword
class is parsed, an instance is created and put into the AST.
Example:
>>> class Type(Keyword):
... grammar = Enum( K("int"), K("long") )
...
>>> k = parse("long", Type)
>>> k.name
'long'
When a Keyword
instance is in a grammar
, it is converted into a str
instance, and the resulting text is added to the result. When a Keyword
class is in the grammar
, the correspoding instance in the AST is converted into a str
instance and added to the result.
Example:
>>> k = K("do")
>>> compose(k)
'do'
List(list)
A List of things.
A List
is a collection for parsed things. It can be used as a base class for collections in the grammar
. If a List
class has no class variable grammar
, grammar = csl(Symbol)
is assumed.
__init__(self, L=[], **kwargs)
Construct a List, and construct its attributes from keyword arguments.
A List
is parsed by following its grammar
. If a List
is parsed, then all things which are parsed and which are not attributes are appended to the List
.
Example:
>>> class Instruction(str): pass
...
>>> class Block(List):
... grammar = "{", maybe_some(Instruction), "}"
...
>>> b = parse("{ hello world }", Block)
>>> b[0]
'hello'
>>> b[1]
'world'
>>>
If a List
is composed, then its grammar is followed and composed.
Example:
>>> class Instruction(str): pass
...
>>> class Block(List):
... grammar = "{", blank, csl(Instruction), blank, "}"
...
>>> b = Block()
>>> b.append(Instruction("hello"))
>>> b.append(Instruction("world"))
>>> compose(b)
'{ hello, world }'
Namespace(_UserDict)
A dictionary of things, indexed by their name.
A Namespace holds an OrderedDict
mapping the name
attributes of the collected things to their respective representation instance. Unnamed things cannot be collected with a Namespace
.
__init__(self, *args, **kwargs)
Initialize an OrderedDict containing the data of the Namespace. Arguments are put into the Namespace, keyword arguments give the attributes of the Namespace.
A Namespace
is parsed by following its grammar
. If a Namespace
is parsed, then all things which are parsed and which are not attributes are appended to the Namespace
and indexed by their name
attribute.
Example:
>>> Symbol.regex = re.compile(r"[\w\s]+")
>>> class Key(str):
... grammar = name(), "=", restline, endl
...
>>> class Section(Namespace):
... grammar = "[", name(), "]", endl, maybe_some(Key)
...
>>> class IniFile(Namespace):
... grammar = some(Section)
...
>>> ini_file_text = """[Number 1]
... this=something
... that=something else
... [Number 2]
... once=anything
... twice=goes
... """
>>> ini_file = parse(ini_file_text, IniFile)
>>> ini_file["Number 2"]["once"]
'anything'
If a Namespace
is composed, then its grammar is followed and composed.
Example:
>>> ini_file["Number 1"]["that"] = Key("new one")
>>> ini_file["Number 3"] = Section()
>>> print(compose(ini_file))
[Number 1]
this=something
that=new one
[Number 2]
once=anything
twice=goes
[Number 3]
Enum(Namespace)
A Namespace which is treated as an Enum. Enums can only contain Keyword
or Symbol
instances. An Enum
cannot be modified after creation. An Enum
is allowed as the grammar of a Symbol
only.
__init__(self, *things)
Construct an Enum
using a tuple
of things.
An Enum
is parsed as a selection for possible values for a Symbol
. If a value is parsed which is not member of the Enum
, a SyntaxError
is raised.
Example:
>>> class Type(Keyword):
... grammar = Enum( K("int"), K("long") )
...
>>> parse("int", Type)
Type('int')
>>> parse("string", Type)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pypeg2/__init__.py", line 382, in parse
t, r = parser.parse(text, thing)
File "pypeg2/__init__.py", line 469, in parse
raise r
File "<string>", line 1
string
^
SyntaxError: 'string' is not a member of Enum([Keyword('int'),
Keyword('long')])
>>>
When a Symbol
is composed which has an Enum
as its grammar, the composed value is checked if it is a member of the Enum
. If not, a ValueError
is raised.
>>> class Type(Keyword):
... grammar = Enum( K("int"), K("long") )
...
>>> t = Type("int")
>>> compose(t)
'int'
>>> t = Type("string")
>>> compose(t)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pypeg2/__init__.py", line 403, in compose
return parser.compose(thing, grammar)
File "pypeg2/__init__.py", line 819, in compose
raise ValueError(repr(thing) + " is not in " + repr(grammar))
ValueError: Type('string') is not in Enum([Keyword('int'),
Keyword('long')])
Grammar generator function generate a piece of a grammar
. They're meant to be used in a grammar
directly.
some(*thing)
At least one occurrence of thing, + operator. Inserts -2
as cardinality before thing.
Parsing some()
parses at least one occurence of thing
, or as many as there are. If there aren't things then a SyntaxError
is generated.
Example:
>>> w = parse("hello world", some(word))
>>> w
['hello', 'world']
>>> w = parse("", some(word))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pypeg2/__init__.py", line 390, in parse
t, r = parser.parse(text, thing)
File "pypeg2/__init__.py", line 477, in parse
raise r
File "<string>", line 1
^
SyntaxError: expecting match on \w+
Composing some()
composes as many things as there are, but at least one. If there is no matching thing, a ValueError
is raised.
Example:
>>> class Words(List):
... grammar = some(word, blank)
...
>>> compose(Words("hello", "world"))
'hello world '
>>> compose(Words())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pypeg2/__init__.py", line 414, in compose
return parser.compose(thing, grammar)
File "pypeg2/__init__.py", line 931, in compose
result = compose_tuple(thing, thing[:], grammar)
File "pypeg2/__init__.py", line 886, in compose_tuple
raise ValueError("not enough things to compose")
ValueError: not enough things to compose
>>>
maybe_some(*thing)
No thing or some of them, * operator. Inserts -1
as cardinality before thing.
Parsing maybe_some()
parses all occurrences of thing
. If there aren't things then the result is empty.
Example:
>>> parse("hello world", maybe_some(word))
['hello', 'world']
>>> parse("", maybe_some(word))
[]
Composing maybe_some()
composes as many things as there are.
>>> class Words(List):
... grammar = maybe_some(word, blank)
...
>>> compose(Words("hello", "world"))
'hello world '
>>> compose(Words())
''
optional(*thing)
Thing or no thing, ? operator. Inserts 0
as cardinality before thing.
Parsing optional()
parses one occurrence of thing
. If there aren't things then the result is empty.
Example:
>>> parse("hello", optional(word))
['hello']
>>> parse("", optional(word))
[]
>>> number = re.compile("[-+]?\d+")
>>> parse("-23 world", (optional(word), number, word))
['-23', 'world']
Composing optional()
composes one thing if there is any.
Example:
>>> class OptionalWord(str):
... grammar = optional(word)
...
>>> compose(OptionalWord("hello"))
'hello'
>>> compose(OptionalWord())
''
csl(*thing, separator=",")
csl(*thing)
Generate a grammar for a simple comma separated list.
csl(Something)
generates Something, maybe_some(",", blank, Something)
attr(name, thing=word, subtype=None)
Generate an Attribute
with that name
, referencing the thing
. An Attribute
is a namedtuple("Attribute", ("name", "thing"))
.
| reference to |
An Attribute
is parsed following its grammar in thing
. The result is not put into another thing directly; instead the result is added as an attribute to containing thing.
Example:
>>> class Type(Keyword):
... grammar = Enum( K("int"), K("long") )
...
>>> class Parameter:
... grammar = attr("typing", Type), blank, name()
...
>>> p = parse("int a", Parameter)
>>> p.typing
Type('int')
An Attribute
is cmposed following its grammar in thing
.
Example:
>>> p = Parameter()
>>> p.typing = K("int")
>>> p.name = "x"
>>> compose(p)
'int x'
flag(name, thing=None)
Generate an Attribute
with that name
which is valued True
or False
. If no thing
is given, Keyword(name)
is assumed.
A flag
is usually a Keyword
which can be there or not. If it is there, the resulting value is True
. If it is not there, the resulting value is False
.
Example:
>>> class BoolLiteral(Symbol):
... grammar = Enum( K("True"), K("False") )
...
>>> class Fact:
... grammar = name(), K("is"), flag("negated", K("not")), \
... attr("value", BoolLiteral)
...
>>> f1 = parse("a is not True", Fact)
>>> f2 = parse("b is False", Fact)
>>> f1.name
Symbol('a')
>>> f1.value
BoolLiteral('True')
>>> f1.negated
True
>>> f2.negated
False
If the flag
is True
compose the grammar. If the flag
is False
don't compose anything.
Example:
>>> class ValidSign:
... grammar = flag("invalid", K("not")), blank, "valid"
...
>>> v = ValidSign()
>>> v.invalid = True
>>> compose(v)
'not valid'
name()
Generate a grammar for a Symbol with a name. This is a shortcut for attr("name", Symbol)
.
ignore(*grammar)
Ignore what matches to the grammar.
Parse what's to be ignored. The result is added to an attribute named "_ignore" + str(i)
with i as a serial number.
Compose the result as with any attr()
.
indent(*thing)
Indent thing by one level.
The indent
function has no meaning while parsing. The parameters are parsed as if they would be in a tuple
.
While composing the indent
function increases the level of indention.
Example:
>>> class Instruction(str):
... grammar = word, ";", endl
...
>>> class Block(List):
... grammar = "{", endl, maybe_some(indent(Instruction)), "}"
...
>>> print(compose(Block(Instruction("first"), \
... Instruction("second"))))
{
first;
second;
}
contiguous(*thing)
Temporary disable automated whitespace removing while parsing thing
.
While parsing whitespace removing is disabled. That means, if whitespace is not part of the grammar, it will lead to a SyntaxError
if whitespace will be found between the parsed objects.
Example:
class Path(List):
grammar = flag("relative", "."), maybe_some(Symbol, ".")
class Reference(GrammarElement):
grammar = contiguous(attr("path", Path), name())
While composing the contiguous
function has no effect.
separated(*thing)
Temporary enable automated whitespace removing while parsing thing
. Whitespace removing is enabled by default. This function is for temporary enabling whitespace removing after it was disabled with the contiguous
function.
While parsing whitespace removing is enabled again. That means, if whitespace is not part of the grammar, it will be omitted if whitespace will be found between parsed objects.
While composing the separated
function has no effect.
omit(*thing)
Omit what matches the grammar. This function cuts out thing
and throws it away.
While parsing omit()
cuts out what matches the grammar thing
and throws it away.
Example:
>>> p = parse("hello", omit(Symbol))
>>> print(p)
None
>>> _
While composing omit()
does not compose text for what matches the grammar thing
.
Example:
>>> compose(Symbol('hello'), omit(Symbol))
''
>>> _
Callback functions are called while composing only. They're ignored while parsing.
blank(thing, parser)
Space marker for composing text.
blank
is outputting a space character (ASCII 32) when called.
endl(thing, parser)
End of line marker for composing text.
endl
is outputting a linefeed charater (ASCII 10) when called. The indention system reacts when reading endl
while composing.
callback_function(thing, parser)
Arbitrary callback functions can be defined and put into the grammar
. They will be called while composing.
Example:
>>> class Instruction(str):
... def heading(self, parser):
... return "/* on level " + str(parser.indention_level) \
... + " */", endl
... grammar = heading, word, ";", endl
...
>>> print(compose(Instruction("do_this")))
/* on level 0 */
do_this;
If a method of the following is present in a grammar element, it will override the standard behaviour.
parse(cls, parser, text, pos)
Overwrites the parsing behaviour. If present, this class method is called at each place the grammar references the grammar element instead of automatic parsing.
| class object of the grammar element |
| parser object which is calling |
| text to be parsed |
|
|
compose(cls, parser)
Overwrites the composing behaviour. If present, this class method is called at each place the grammar references the grammar element instead of automatic composing.
| class object of the grammar element |
| parser object which is calling |