pyPEG – a PEG Parser-Interpreter in Python

pyPEG 2.15.3 of We May 05 2021 – Copyleft 2009-2021, Volker Birk

Requires Python 3.x or 2.7
Older versions: pyPEG 1.x

Grammar Elements

Caveat: pyPEG 2.x is written for Python 3. That means, it accepts Unicode strings only. You can use it with Python 2.7 by writing u'string' instead of 'string' or with the following import (you don't need that for Python 3):

from __future__ import unicode_literals

The samples in this documentation are written for Python 3, too. To execute them with Python 2.7, you'll need this import:

from __future__ import print_function

pyPEG 2.x supports new-style classes only.

Basic Grammar Elements

str instances and Literal

Parsing

A str instance as well as an instance of pypeg2.Literal is parsed in the source text as a Terminal Symbol. It is removed and no result is put into the Abstract syntax tree. If it does not exist at the correct position in the source text, a SyntaxError is raised.

Example:

>>> class Key(str):
...     grammar = name(), "=", restline, endl
... 
>>> k = parse("this=something", Key)
>>> k.name
Symbol('this')
>>> k
'something'

Composing

str instances and pypeg2.Literal instances are being output literally.

Example:

>>> class Key(str):
...     grammar = name(), "=", restline, endl
... 
>>> k = Key("a value")
>>> k.name = Symbol("give me")
>>> compose(k)
'give me=a value\n'

Regular Expressions

Parsing

pyPEG uses Python's re module. You can use Python Regular Expression Objects purely, or use the pypeg2.RegEx encapsulation. Regular Expressions are parsed as Terminal Symbols. The matching result is put into the AST. If no match can be achieved, a SyntaxError is raised.

pyPEG predefines different RegEx objects:

word = re.compile(r"\w+")

Regular expression for scanning a word.

restline = re.compile(r".*")

Regular expression for rest of line.

whitespace = re.compile("(?m)\s+")

Regular expression for scanning whitespace.

comment_sh = re.compile(r"\#.*")

Shell script style comment.

comment_cpp = re.compile(r"//.*")

C++ style comment.

comment_c = re.compile(r"(?m)/\*.*?\*/")

C style comment without nesting.

comment_pas = re.compile(r"(?m)\(\*.*?\*\)")

Pascal style comment without nesting.

Example:

>>> class Key(str):
...     grammar = name(), "=", restline, endl
... 
>>> k = parse("this=something", Key)
>>> k.name
Symbol('this')
>>> k
'something'

Composing

For RegEx objects their corresponding value in the AST will be output. If this value does not match the RegEx a ValueError is raised.

Example:

>>> class Key(str):
...     grammar = name(), "=", restline, endl
... 
>>> k = Key("a value")
>>> k.name = Symbol("give me")
>>> compose(k)
'give me=a value\n'

tuple instances and Concat

Parsing

A tuple or an instance of pypeg2.Concat specifies, that different things have to be parsed one after another. If not all of them parse in their sequence, a SyntaxError is raised.

Example:

>>> class Key(str):
...     grammar = name(), "=", restline, endl
... 
>>> k = parse("this=something", Key)
>>> k.name
Symbol('this')
>>> k
'something'

In a tuple there may be integers preceding another thing in the tuple. These integers represent a cardinality. For example, to parse three times a word, you can have as a grammar:

grammar = word, word, word

or:

grammar = 3, word

which is equivalent. There are special cardinality values:

-2, thing

some(thing); this represents the plus cardinality, +

-1, thing

maybe_some(thing); this represents the asterisk cardinality, *

0, thing

optional(thing); this represents the question mark cardinality, ?

The special cardinality values can be generated with the Cardinality Functions. Other negative values are reserved and may not be used.

Composing

For tuple instances and instances of pypeg2.Concat all attributes of the corresponding thing (and elements of the corresponding collection if that applies) in the AST will be composed and the result is concatenated.

Example:

>>> class Key(str):
...     grammar = name(), "=", restline, endl
... 
>>> k = Key("a value")
>>> k.name = Symbol("give me")
>>> compose(k)
'give me=a value\n'

list instances

Parsing

A list instance which is not derived from pypeg2.Concat represents different options. They're tested in their sequence. The first option which parses is chosen, the others are not tested any more. If none matches, a SyntaxError is raised.

Example:

>>> number = re.compile(r"\d+")
>>> parse("hello", [number, word])
'hello'

Composing

The elements of the list are tried out in their sequence, if one of them can be composed. If none can a ValueError is raised.

Example:

>>> letters = re.compile(r"[a-zA-Z]")
>>> number = re.compile(r"\d+")
>>> compose(23, [letters, number])
'23'

Constant None

None parses to nothing. And it composes to nothing. It represents the no-operation value.

Grammar Element Classes

Class Symbol

Class definition

Symbol(str)

Used to scan a Symbol.

If you're putting a Symbol somewhere in your grammar, then Symbol.regex is used to scan while parsing. The result will be a Symbol instance. Optionally it is possible to check that a Symbol instance will not be identical to any Keyword instance. This can be helpful if the source language forbids that.

A class which is derived from Symbol can have an Enum as its grammar only. Other values for its grammar are forbidden and will raise a TypeError. If such an Enum is specified, each parsed value will be checked if being a member of this Enum additionally to the RegEx matching.

Class variables

regex

regular expression to scan, default re.compile(r"\w+")

check_keywords

flag if a Symbol has to be checked for not being a Keyword; default: False

Instance variables

name

name of the Keyword as str instance

Method __init__(self, name, namespace=None)

Construct a Symbol with that name in namespace.

Raises:

ValueError

if check_keywords is True and value is identical to a Keyword

TypeError

if namespace is given and not an instance of Namespace

Parsing

Parsing a Symbol is done by scanning with Symbol.regex. In our example we're using the name() function, which is often used to parse a Symbol. name() equals to attr("name", Symbol).

Example:

>>> Symbol.regex = re.compile(r"[\w\s]+")
>>> class Key(str):
...     grammar = name(), "=", restline, endl
...
>>> k = parse("this one=foo bar", Key)
>>> k.name
Symbol('this one')
>>> k
'foo bar'

Composing

Composing a Symbol is done by converting it to text.

Example:

>>> k.name = Symbol("that one")
>>> compose(k)
'that one=foo bar'

Class Keyword

Class definition

Keyword(Symbol)

Used to access the keyword table.

The Keyword class is meant to be instanciated for each Keyword of the source language. The class holds the keyword table as a Namespace instance. There is the abbreviation K for Keyword. The latter is useful for instancing keywords.

Class variables

regex

regular expression to scan; default re.compile(r"\w+")

table

Namespace with keyword table

Instance variables

name

name of the Keyword as str instance

Method __init__(self, keyword)

Adds keyword to the keyword table.

Parsing

When a Keyword instance is parsed, it is removed and nothing is put into the resulting AST. When a Keyword class is parsed, an instance is created and put into the AST.

Example:

>>> class Type(Keyword):
...     grammar = Enum( K("int"), K("long") )
... 
>>> k = parse("long", Type)
>>> k.name
'long'

Composing

When a Keyword instance is in a grammar, it is converted into a str instance, and the resulting text is added to the result. When a Keyword class is in the grammar, the correspoding instance in the AST is converted into a str instance and added to the result.

Example:

>>> k = K("do")
>>> compose(k)
'do'

Class List

Class definition

List(list)

A List of things.

A List is a collection for parsed things. It can be used as a base class for collections in the grammar. If a List class has no class variable grammar, grammar = csl(Symbol) is assumed.

Method __init__(self, L=[], **kwargs)

Construct a List, and construct its attributes from keyword arguments.

Parsing

A List is parsed by following its grammar. If a List is parsed, then all things which are parsed and which are not attributes are appended to the List.

Example:

>>> class Instruction(str): pass
...
>>> class Block(List):
...     grammar = "{", maybe_some(Instruction), "}"
... 
>>> b = parse("{ hello world }", Block)
>>> b[0]
'hello'
>>> b[1]
'world'
>>> 

Composing

If a List is composed, then its grammar is followed and composed.

Example:

>>> class Instruction(str): pass
... 
>>> class Block(List):
...     grammar = "{", blank, csl(Instruction), blank, "}"
... 
>>> b = Block()
>>> b.append(Instruction("hello"))
>>> b.append(Instruction("world"))
>>> compose(b)
'{ hello, world }'

Class Namespace

Class definition

Namespace(_UserDict)

A dictionary of things, indexed by their name.

A Namespace holds an OrderedDict mapping the name attributes of the collected things to their respective representation instance. Unnamed things cannot be collected with a Namespace.

Method __init__(self, *args, **kwargs)

Initialize an OrderedDict containing the data of the Namespace. Arguments are put into the Namespace, keyword arguments give the attributes of the Namespace.

Parsing

A Namespace is parsed by following its grammar. If a Namespace is parsed, then all things which are parsed and which are not attributes are appended to the Namespace and indexed by their name attribute.

Example:

>>> Symbol.regex = re.compile(r"[\w\s]+")
>>> class Key(str):
...     grammar = name(), "=", restline, endl
... 
>>> class Section(Namespace):
...     grammar = "[", name(), "]", endl, maybe_some(Key)
... 
>>> class IniFile(Namespace):
...     grammar = some(Section)
... 
>>> ini_file_text = """[Number 1]
... this=something
... that=something else
... [Number 2]
... once=anything
... twice=goes
... """
>>> ini_file = parse(ini_file_text, IniFile)
>>> ini_file["Number 2"]["once"]
'anything'

Composing

If a Namespace is composed, then its grammar is followed and composed.

Example:

>>> ini_file["Number 1"]["that"] = Key("new one")
>>> ini_file["Number 3"] = Section()
>>> print(compose(ini_file))
[Number 1]
this=something
that=new one
[Number 2]
once=anything
twice=goes
[Number 3]

Class Enum

Class definition

Enum(Namespace)

A Namespace which is treated as an Enum. Enums can only contain Keyword or Symbol instances. An Enum cannot be modified after creation. An Enum is allowed as the grammar of a Symbol only.

Method __init__(self, *things)

Construct an Enum using a tuple of things.

Parsing

An Enum is parsed as a selection for possible values for a Symbol. If a value is parsed which is not member of the Enum, a SyntaxError is raised.

Example:

>>> class Type(Keyword):
...     grammar = Enum( K("int"), K("long") )
... 
>>> parse("int", Type)
Type('int')
>>> parse("string", Type)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pypeg2/__init__.py", line 382, in parse
    t, r = parser.parse(text, thing)
  File "pypeg2/__init__.py", line 469, in parse
    raise r
  File "<string>", line 1
    string
    ^
SyntaxError: 'string' is not a member of Enum([Keyword('int'),
Keyword('long')])
>>> 

Composing

When a Symbol is composed which has an Enum as its grammar, the composed value is checked if it is a member of the Enum. If not, a ValueError is raised.

>>> class Type(Keyword):
...     grammar = Enum( K("int"), K("long") )
... 
>>> t = Type("int")
>>> compose(t)
'int'
>>> t = Type("string")
>>> compose(t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pypeg2/__init__.py", line 403, in compose
    return parser.compose(thing, grammar)
  File "pypeg2/__init__.py", line 819, in compose
    raise ValueError(repr(thing) + " is not in " + repr(grammar))
ValueError: Type('string') is not in Enum([Keyword('int'),
Keyword('long')])

Grammar generator functions

Grammar generator function generate a piece of a grammar. They're meant to be used in a grammar directly.

Function some()

Synopsis

some(*thing)

At least one occurrence of thing, + operator. Inserts -2 as cardinality before thing.

Parsing

Parsing some() parses at least one occurence of thing, or as many as there are. If there aren't things then a SyntaxError is generated.

Example:

>>> w = parse("hello world", some(word))
>>> w
['hello', 'world']
>>> w = parse("", some(word))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pypeg2/__init__.py", line 390, in parse
    t, r = parser.parse(text, thing)
  File "pypeg2/__init__.py", line 477, in parse
    raise r
  File "<string>", line 1
    
    ^
SyntaxError: expecting match on \w+

Composing

Composing some() composes as many things as there are, but at least one. If there is no matching thing, a ValueError is raised.

Example:

>>> class Words(List):
...     grammar = some(word, blank)
... 
>>> compose(Words("hello", "world"))
'hello world '
>>> compose(Words())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pypeg2/__init__.py", line 414, in compose
    return parser.compose(thing, grammar)
  File "pypeg2/__init__.py", line 931, in compose
    result = compose_tuple(thing, thing[:], grammar)
  File "pypeg2/__init__.py", line 886, in compose_tuple
    raise ValueError("not enough things to compose")
ValueError: not enough things to compose
>>> 

Function maybe_some()

Synopsis

maybe_some(*thing)

No thing or some of them, * operator. Inserts -1 as cardinality before thing.

Parsing

Parsing maybe_some() parses all occurrences of thing. If there aren't things then the result is empty.

Example:

>>> parse("hello world", maybe_some(word))
['hello', 'world']
>>> parse("", maybe_some(word))
[]

Composing

Composing maybe_some() composes as many things as there are.

>>> class Words(List):
...     grammar = maybe_some(word, blank)
... 
>>> compose(Words("hello", "world"))
'hello world '
>>> compose(Words())
''

Function optional()

Synopsis

optional(*thing)

Thing or no thing, ? operator. Inserts 0 as cardinality before thing.

Parsing

Parsing optional() parses one occurrence of thing. If there aren't things then the result is empty.

Example:

>>> parse("hello", optional(word))
['hello']
>>> parse("", optional(word))
[]
>>> number = re.compile("[-+]?\d+")
>>> parse("-23 world", (optional(word), number, word))
['-23', 'world']

Composing

Composing optional() composes one thing if there is any.

Example:

>>> class OptionalWord(str):
...     grammar = optional(word)
... 
>>> compose(OptionalWord("hello"))
'hello'
>>> compose(OptionalWord())
''

Function csl()

Synopsis

Python 3.x:

csl(*thing, separator=",")

Python 2.7:

csl(*thing)

Generate a grammar for a simple comma separated list.

csl(Something) generates Something, maybe_some(",", blank, Something)

Function attr()

Synopsis

attr(name, thing=word, subtype=None)

Generate an Attribute with that name, referencing the thing. An Attribute is a namedtuple("Attribute", ("name", "thing")).

Instance variables

Class

reference to Attribute class generated by namedtuple()

Parsing

An Attribute is parsed following its grammar in thing. The result is not put into another thing directly; instead the result is added as an attribute to containing thing.

Example:

>>> class Type(Keyword):
...     grammar = Enum( K("int"), K("long") )
... 
>>> class Parameter:
...     grammar = attr("typing", Type), blank, name()
... 
>>> p = parse("int a", Parameter)
>>> p.typing
Type('int')

Composing

An Attribute is cmposed following its grammar in thing.

Example:

>>> p = Parameter()
>>> p.typing = K("int")
>>> p.name = "x"
>>> compose(p)
'int x'

Function flag()

Synopsis

flag(name, thing=None)

Generate an Attribute with that name which is valued True or False. If no thing is given, Keyword(name) is assumed.

Parsing

A flag is usually a Keyword which can be there or not. If it is there, the resulting value is True. If it is not there, the resulting value is False.

Example:

>>> class BoolLiteral(Symbol):
...     grammar = Enum( K("True"), K("False") )
... 
>>> class Fact:
...     grammar = name(), K("is"), flag("negated", K("not")), \
...             attr("value", BoolLiteral)
... 
>>> f1 = parse("a is not True", Fact)
>>> f2 = parse("b is False", Fact)
>>> f1.name
Symbol('a')
>>> f1.value
BoolLiteral('True')
>>> f1.negated
True
>>> f2.negated
False

Composing

If the flag is True compose the grammar. If the flag is False don't compose anything.

Example:

>>> class ValidSign:
...     grammar = flag("invalid", K("not")), blank, "valid"
... 
>>> v = ValidSign()
>>> v.invalid = True
>>> compose(v)
'not valid'

Function name()

Synopsis

name()

Generate a grammar for a Symbol with a name. This is a shortcut for attr("name", Symbol).

Function ignore()

Synopsis

ignore(*grammar)

Ignore what matches to the grammar.

Parsing

Parse what's to be ignored. The result is added to an attribute named "_ignore" + str(i) with i as a serial number.

Composing

Compose the result as with any attr().

Function indent()

Synopsis

indent(*thing)

Indent thing by one level.

Parsing

The indent function has no meaning while parsing. The parameters are parsed as if they would be in a tuple.

Composing

While composing the indent function increases the level of indention.

Example:

>>> class Instruction(str):
...     grammar = word, ";", endl
... 
>>> class Block(List):
...     grammar = "{", endl, maybe_some(indent(Instruction)), "}"
... 
>>> print(compose(Block(Instruction("first"), \
...         Instruction("second"))))
{
    first;
    second;
}

Function contiguous()

Synopsis

contiguous(*thing)

Temporary disable automated whitespace removing while parsing thing.

Parsing

While parsing whitespace removing is disabled. That means, if whitespace is not part of the grammar, it will lead to a SyntaxError if whitespace will be found between the parsed objects.

Example:

class Path(List):
    grammar = flag("relative", "."), maybe_some(Symbol, ".")

class Reference(GrammarElement):
    grammar = contiguous(attr("path", Path), name())

Composing

While composing the contiguous function has no effect.

Function separated()

Synopsis

separated(*thing)

Temporary enable automated whitespace removing while parsing thing. Whitespace removing is enabled by default. This function is for temporary enabling whitespace removing after it was disabled with the contiguous function.

Parsing

While parsing whitespace removing is enabled again. That means, if whitespace is not part of the grammar, it will be omitted if whitespace will be found between parsed objects.

Composing

While composing the separated function has no effect.

Function omit()

Synopsis

omit(*thing)

Omit what matches the grammar. This function cuts out thing and throws it away.

Parsing

While parsing omit() cuts out what matches the grammar thing and throws it away.

Example:

>>> p = parse("hello", omit(Symbol))
>>> print(p)
None
>>> _

Composing

While composing omit() does not compose text for what matches the grammar thing.

Example:

>>> compose(Symbol('hello'), omit(Symbol))
''
>>> _

Callback functions

Callback functions are called while composing only. They're ignored while parsing.

Callback function blank()

Synopsis

blank(thing, parser)

Space marker for composing text.

blank is outputting a space character (ASCII 32) when called.

Callback function endl()

Synopsis

endl(thing, parser)

End of line marker for composing text.

endl is outputting a linefeed charater (ASCII 10) when called. The indention system reacts when reading endl while composing.

User defined callback functions

Synopsis

callback_function(thing, parser)

Arbitrary callback functions can be defined and put into the grammar. They will be called while composing.

Example:

>>> class Instruction(str):
...     def heading(self, parser):
...         return "/* on level " + str(parser.indention_level) \
...                 + " */", endl
...     grammar = heading, word, ";", endl
... 
>>> print(compose(Instruction("do_this")))
/* on level 0 */
do_this;

Common class methods for grammar elements

If a method of the following is present in a grammar element, it will override the standard behaviour.

parse() class method of a grammar element

Synopsis

parse(cls, parser, text, pos)

Overwrites the parsing behaviour. If present, this class method is called at each place the grammar references the grammar element instead of automatic parsing.

cls

class object of the grammar element

parser

parser object which is calling

text

text to be parsed

pos

(lineNo, charInText) with positioning information

compose() method of a grammar element

Synopsis

compose(cls, parser)

Overwrites the composing behaviour. If present, this class method is called at each place the grammar references the grammar element instead of automatic composing.

cls

class object of the grammar element

parser

parser object which is calling

Want to download? Go to the ^Top^ and look to the right ;-)