Pseudocode Parser

The vc2_pseudocode_parser.parser module contains a parser and associated Abstract Syntax Tree (AST) representation for the pseudocode language used within the VC-2 specifications [VC2].

A quick-start example illustrating much of the pseudocode syntax and basic usage of this module is given below:

>>> from vc2_pseudocode_parser.parser import parse

>>> source = '''
...     some_function(arg1, arg2, arg3):
...         # Assignments
...         hex_foo = 0xF00
...         sum = arg1 + arg2 + arg3
...
...         # If statements
...         if (sum > 0):
...             sign = 1
...         else if (sum < 0):
...             sign = -1
...         else:
...             sign = 0
...
...         # For-each: loop over fixed set of values
...         sum2 = 0
...         for each value in arg1, arg2, arg3:
...             sum2 += value
...
...         # For: loop over range of integers
...         sum_1_to_100 = 0
...         for n = 1 to 100:
...             sum_1_to_100 += n
...
...         # While loop
...         count = 0
...         while (sum > 0):
...             sum //= 2
...             count += 1
...
...         # Maps (like Python's dictionaries
...         map = {}
...
...         # Maps subscripted with labels
...         map[foo] = 123
...         map[bar] = 321
...
...         # Labels are first-class values (and are defined by their first use)
...         label = baz
...         map[label] = 999
...
...         # Function calls
...         foo(map)
...
...         # Return from functions
...         return count
... '''

>>> ast = parse(source)

>>> ast.functions[0].name
'some_function'

>>> assignment = ast.functions[0].body[0]
>>> assignment.variable
Variable(offset=68, name='hex_foo')
>>> assignment.value
NumberExpr(offset=78, offset_end=83, value=3840, display_base=16, display_digits=3)

Parser

A pseudocode snippet may be parsed into an Abstract Syntax Tree (AST) using the parse() function:

parse(string)

Parse a pseudocode listing into an abstract syntax tree (Listing).

May raise a ParseError or ASTConstructionError exception on failure.

Parsing failures will result of one of the exceptions below being raised. In all cases, the str representation of these errors produces a user-friendly description of the problem.

For example:

>>> from vc2_pseudocode_parser.parser import ParseError, ASTConstructionError

>>> try:
...     ast = parse("foo(): return (a + 3")
... except (ParseError, ASTConstructionError) as e:
...     print(str(e))
At line 1 column 21:
    foo(): return (a + 3
                        ^
Expected ')' or <operator>
exception ParseError

Re-exported from peggie.ParseError.

exception ASTConstructionError(line, column, snippet)

Exceptions thrown during construction of an AST.

exception LabelUsedAsVariableNameError(line, column, snippet, variable_name)

Bases: vc2_pseudocode_parser.parser.ast.ASTConstructionError

Thrown when a name previously used as a label is assigned to like a variable.

exception CannotSubscriptLabelError(line, column, snippet, label_name)

Bases: vc2_pseudocode_parser.parser.ast.ASTConstructionError

Thrown when name which has previously used as a label is subscripted like a variable.

Abstract Syntax Tree (AST)

The parser generates a fairly detailed AST containing both semantic and some non-semantic features of the source such as explicit uses of parentheses, comments and vertical whitespace. Nodes in the AST also include character indices of the corresponding source code enabling the construction of helpful error messages.

Every node in the AST is a subclass of the ASTNode base class:

class ASTNode(offset: int, offset_end: int)
offset: int

Index of first character in the source related to this node.

offset_end: int

Index of the character after the final character related to this node.

AST Root (Listing)

The root element of a complete AST is Listing:

class Listing(functions, leading_empty_lines=<factory>)

The root of a pseudocode AST.

functions: List[vc2_pseudocode_parser.parser.ast.Function]

List of Function trees for each function defined in the tree.

leading_empty_lines: List[vc2_pseudocode_parser.parser.ast.EmptyLine]

List of EmptyLine trees relating to empty (or comment-only) lines at the start of the source listing.

This in turn is principally made up of a list of Function nodes:

class Function(offset, name, arguments, body, eol=None)

A function definition:

<name>(<arguments[0]>, <arguments[1]>, <...>): <eol>
    <body>
name: str

The name of the function.

arguments: List[vc2_pseudocode_parser.parser.ast.Variable]

The Variable objects corresponding with the arguments to this function.

body: List[vc2_pseudocode_parser.parser.ast.Stmt]

The list of Stmt which make up this function.

eol: Optional[vc2_pseudocode_parser.parser.ast.EOL] = None

EOL at the end of the function heading.

Statements

AST nodes representing statements in the AST are subclasses of Stmt.

class Stmt(offset, offset_end)

Base-class for all statement AST nodes.

If-else-if-else statements are defined as follows:

class IfElseStmt(if_branches, else_branch=None)

An if statement with an arbitrary number of else if clauses and optional else clause.

A if statement is broken as illustrated by the following example:

if (condition0):       # \ if_branches[0]
    body0              # /
else if (condition1):  # \ if_branches[1]
    body1              # /
else if (condition2):  # \ if_branches[2]
    body2              # /
else:                  # \ else_branch
    body3              # /
if_branches: List[vc2_pseudocode_parser.parser.ast.IfBranch]

The opening if clause followed by any else if clauses are represented as a series of IfBranch.

else_branch: Optional[vc2_pseudocode_parser.parser.ast.ElseBranch] = None

If an else clause is present, a ElseBranch giving its contents.

class IfBranch(offset, condition, body, eol=None)

An if or else if clause in an if-else-if-else statement:

if (<condition>): <eol>
    <body>

Or:

else if (<condition>): <eol>
    <body>
condition: vc2_pseudocode_parser.parser.ast.Expr

The Expr representing the condition for the branch condition.

body: List[vc2_pseudocode_parser.parser.ast.Stmt]

The list of Stmt to execute when the condition is True.

eol: Optional[vc2_pseudocode_parser.parser.ast.EOL] = None

EOL following the if or else if clause heading.

class ElseBranch(offset, body, eol=None)

An else clause in an if-else-if-else statement:

else: <eol>
    <body>
body: List[vc2_pseudocode_parser.parser.ast.Stmt]

The list of Stmt to execute when the else branch is reached.

eol: Optional[vc2_pseudocode_parser.parser.ast.EOL] = None

EOL following the else clause heading.

For-each loops are defined as follows:

class ForEachStmt(offset, variable, values, body, eol=None)

A for each loop:

for each <variable> in <values[0]>, <values[1]>, <...>: <eol>
    <body>
variable: vc2_pseudocode_parser.parser.ast.Variable

The loop Variable.

values: List[vc2_pseudocode_parser.parser.ast.Expr]

The Expr giving the set of values the loop will iterate over.

body: List[vc2_pseudocode_parser.parser.ast.Stmt]

The list of Stmt to execute in each iteration.

eol: Optional[vc2_pseudocode_parser.parser.ast.EOL] = None

EOL following the for each heading.

For loops are defined as follows:

class ForStmt(offset, variable, start, end, body, eol=None)

A for loop:

for <variable> = <start> to <end>: <eol>
    <body>
variable: vc2_pseudocode_parser.parser.ast.Variable

The loop Variable.

start: vc2_pseudocode_parser.parser.ast.Expr

The Expr giving the loop starting value.

end: vc2_pseudocode_parser.parser.ast.Expr

The Expr giving the (inclusive) loop ending value.

body: List[vc2_pseudocode_parser.parser.ast.Stmt]

The list of Stmt to execute in each iteration.

eol: Optional[vc2_pseudocode_parser.parser.ast.EOL] = None

EOL following the for heading.

While loops are defined as follows:

class WhileStmt(offset, condition, body, eol=None)

A while loop:

while (<condition>): <eol>
    <body>
condition: vc2_pseudocode_parser.parser.ast.Expr

The Expr representing the loop condition.

body: List[vc2_pseudocode_parser.parser.ast.Stmt]

The list of Stmt to execute in each iteration.

eol: Optional[vc2_pseudocode_parser.parser.ast.EOL] = None

EOL following the while heading.

Function call statements are defined as follows:

class FunctionCallStmt(call, eol)

A statement which represents a call to a function.

call: vc2_pseudocode_parser.parser.ast.FunctionCallExpr

The FunctionCallExpr which defines the function call.

eol: vc2_pseudocode_parser.parser.ast.EOL

EOL following the function call.

Return statements are defined as follows:

class ReturnStmt(offset, value, eol)

A return statement:

return <value> <eol>
value: vc2_pseudocode_parser.parser.ast.Expr

A Expr giving the value to be returned.

eol: vc2_pseudocode_parser.parser.ast.EOL

EOL following the return statement.

Assignment statements are defined as follows:

class AssignmentStmt(variable, op, value, eol)

A simple or compound assignment statement:

<variable> <op> <value> <eol>  # e.g. x += 1 + 2
variable: Union[vc2_pseudocode_parser.parser.ast.Variable, vc2_pseudocode_parser.parser.ast.Subscript]

The Variable or Subscript being assigned to.

op: vc2_pseudocode_parser.parser.operators.AssignmentOp

The type of assignment being performed (AssignmentOp).

value: vc2_pseudocode_parser.parser.ast.Expr

The Expr giving the value being assigned.

eol: vc2_pseudocode_parser.parser.ast.EOL

EOL following the assignment statement.

The following assignment operators are defined:

class AssignmentOp(value)

An enumeration.

assign = '='
add_assign = '+='
sub_assign = '-='
mul_assign = '*='
idiv_assign = '//='
pow_assign = '**='
and_assign = '&='
xor_assign = '^='
or_assign = '|='
lsh_assign = '<<='
rsh_assign = '>>='

Expressions

AST nodes representing expressions in the AST are subclasses of Expr.

class Expr(offset: int, offset_end: int)

Unary expressions are defined as follows:

class UnaryExpr(offset, op, value)

A unary expression, e.g. -foo.

op: vc2_pseudocode_parser.parser.operators.UnaryOp

The operator (UnaryOp)

value: vc2_pseudocode_parser.parser.ast.Expr

The Expr the operator applies to.

With unary operators enumerated as:

class UnaryOp(value)

An enumeration.

plus = '+'
minus = '-'
bitwise_not = '~'
logical_not = 'not'

Binary expressions are defined as follows:

class BinaryExpr(lhs, op, rhs)

A binary expression, e.g. a + 1.

lhs: vc2_pseudocode_parser.parser.ast.Expr

The Expr on the left-hand side of the expression.

op: vc2_pseudocode_parser.parser.operators.BinaryOp

The operator (BinaryOp)

rhs: vc2_pseudocode_parser.parser.ast.Expr

The Expr on the right-hand side of the expression.

With binary operators enumerated as:

class BinaryOp(value)

An enumeration.

logical_or = 'or'
logical_and = 'and'
eq = '=='
ne = '!='
lt = '<'
le = '<='
gt = '>'
ge = '>='
bitwise_or_ = '|'
bitwise_xor = '^'
bitwise_and_ = '&'
lsh = '<<'
rsh = '>>'
add = '+'
sub = '-'
mul = '*'
idiv = '//'
mod = '%'
pow = '**'

Calls to functions are defined as follows:

class FunctionCallExpr(offset, offset_end, name, arguments)

A call to a function:

<name>(<arguments[0]>, <arguments[1]>, <...>)  # e.g. foo(1, 2, 3)
name: str

The name of the function to be called.

arguments: List[vc2_pseudocode_parser.parser.ast.Expr]

The list of Expr giving the arguments to the call.

Uses of variables and subscripted variables are defined as follows:

class VariableExpr(variable)

A use of a variable or subscripted variable.

variable: Union[vc2_pseudocode_parser.parser.ast.Variable, vc2_pseudocode_parser.parser.ast.Subscript]

The Variable or Subscript used.

Value literals are defined as follows:

class BooleanExpr(offset, value)

A boolean literal, i.e. True and False.

value: bool

The boolean value.

class NumberExpr(offset, offset_end, value, display_base=10, display_digits=1)

A numerical literal integer, e.g. 123 or 0xF00.

value: int

The (parsed) integer value.

display_base: int = 10

The base which the literal was encoded in.

display_digits: int = 1

The number of digits used in the literal, including leading zeros but excluding any prefix (e.g. 0x or 0b).

class EmptyMapExpr(offset, offset_end)

An empty map literal (e.g. {}).

class LabelExpr(label)

A label literal.

label: vc2_pseudocode_parser.parser.ast.Label

The Label used.

For parenthesised expressions, e.g. (1 + 2), the presence of the parentheses is explicitly marked in the AST. While this is not semantically important (since evaluation order is explicit in an AST) it may be helpful in retaining parentheses added for human legibility when translating the pseudocode into other forms.

class ParenExpr(offset, offset_end, value)

A parenthesised expression.

value: vc2_pseudocode_parser.parser.ast.Expr

The parenthesised Expr

Variables and subscripts

A Variable is defined as follows:

class Variable(offset, name)

A use of a variable.

name: str

The name of the variable.

Variables may be subscripted (multiple times) and this is represented by a nesting of Subscript objects:

class Subscript(offset_end, variable, subscript)

A subscripted variable (e.g. x[1]).

variable: Union[vc2_pseudocode_parser.parser.ast.Subscript, vc2_pseudocode_parser.parser.ast.Variable]

The Variable being subscripted (or Subscript in the case of a multiply subscripted variable.

subscript: vc2_pseudocode_parser.parser.ast.Expr

The Expr giving the subscript value.

Labels

Labels are defined as follows:

class Label(offset, name)

A label value.

name: str

The label name.

Note

Labels and variables are syntactically ambiguous in the pseudocode language but are disambiguated in the AST. Names in the pseudocode are deemed to be variables if they correspond with function arguments, loop variables or assignment targets within a function’s namespace. All other names are deemed to be labels.

Comments and vertical whitespace

All comments and vertical whitespace (i.e. blank lines) are captured by the AST. This enables these non-semantic components to be retained in language translations.

class Comment(offset, string)

An end-of-line comment.

string: str

The comment string, including leading ‘#’ but not trailing newline.

class EmptyLine(offset, offset_end, comment=None)

Represents an empty (or comment-only) line in the source.

comment: Optional[vc2_pseudocode_parser.parser.ast.Comment] = None

The Comment on this line, if present.

Simple statements are syntactically terminated by an optional comment followed by a newline and then a number of empty or comment-only lines. These details are captured by the EOL node.

class EOL(offset, offset_end, comment=None, empty_lines=<factory>)

The end-of-line terminator following a statement.

comment: Optional[vc2_pseudocode_parser.parser.ast.Comment] = None

A Comment on the same line as the statement, if present.

empty_lines: List[vc2_pseudocode_parser.parser.ast.EmptyLine]

Any trailing EmptyLine after this statement.

For example, given a snippet as follows:

foo() # Comment 0

bar() # Comment 1
# Comment 2

# Comment 3

baz()  # Comment 4

Here the function call bar() would be captured by a FunctionCallStmt. The FunctionCallStmt.eol would contain an EOL with EOL.comment set to a Comment containing # Comment 1 and EOL.empty_lines the following four lines (and their comments).

Function declarations, if, else if, else, for each, for and while headings are separated from their bodies by a : and optional EOL. When the : is followed by a newline, an EOL object will be given, but for in-line definitions, the EOL will be omitted. For example:

func1():
    foo()

func2(): foo()

Here the Function for func1 will have Function.eol set to an EOL node for func2 it will be None.

Grammar

The pseudocode language grammar can be described by its peggie grammar:

# Grammar for the VC-2 specification pseudocode language
start <- any_ws @=function+ eof

# Function definition
function           <- identifier ws function_arguments ws stmt_block
function_arguments <- "(" ws (identifier ws ("," ws identifier ws)* ","?)?  ws ")"

# A series of statements
stmt_block <- ":" ws single_line_stmt
            / ":" eol @>((@=stmt)+)

# Statements (all end with an eol)
stmt <- if_else_stmt
      / for_each_stmt
      / for_stmt
      / while_stmt
      / function_call_stmt
      / return_stmt
      / assignment_stmt

single_line_stmt <- function_call_stmt
                  / return_stmt
                  / assignment_stmt

function_call_stmt <- function_call eol
if_else_stmt       <- @=("if" ws condition ws stmt_block)
                      @=(@=("else" ws_ "if" ws condition ws stmt_block)*)
                      @=(("else" ws stmt_block)?)
for_each_stmt      <- "for" ws_ "each" ws_ identifier ws_ "in" ws_ for_each_list ws stmt_block
for_stmt           <- "for" ws_ identifier ws "=" ws expr ws_ "to" ws_ expr ws stmt_block
while_stmt         <- "while" ws condition ws stmt_block
assignment_stmt    <- variable ws assignment_op ws expr eol
return_stmt        <- "return" ws_ expr eol

condition      <- "(" ws expr ws ")"
for_each_list  <- expr (ws "," ws expr)*
assignment_op  <- r"(\+|-|\*|//|\*\*|&|\^|\||<<|>>)?="

# Expressions (defined below in ascending order of precidence)
expr <- maybe_log_or_expr
maybe_log_or_expr  <- maybe_log_and_expr (ws_ "or"              ws_ maybe_log_and_expr)*
maybe_log_and_expr <- maybe_log_not_expr (ws_ "and"             ws_ maybe_log_not_expr)*
maybe_log_not_expr <- "not" ws_ maybe_log_not_expr / maybe_cmp_expr
maybe_cmp_expr     <- maybe_or_expr      (ws r"==|!=|<=|>=|<|>" ws maybe_or_expr)*
maybe_or_expr      <- maybe_xor_expr     (ws "|"                ws maybe_xor_expr)*
maybe_xor_expr     <- maybe_and_expr     (ws "^"                ws maybe_and_expr)*
maybe_and_expr     <- maybe_shift_expr   (ws "&"                ws maybe_shift_expr)*
maybe_shift_expr   <- maybe_arith_expr   (ws r"<<|>>"           ws maybe_arith_expr)*
maybe_arith_expr   <- maybe_prod_expr    (ws r"\+|-"            ws maybe_prod_expr)*
maybe_prod_expr    <- maybe_unary_expr   (ws r"\*|//|%"         ws maybe_unary_expr)*
maybe_unary_expr   <- r"\+|-|~" ws maybe_unary_expr / maybe_pow_expr
maybe_pow_expr     <- maybe_paren_expr   (ws "**"               ws maybe_unary_expr)*
maybe_paren_expr   <- "(" ws expr ws ")" / atom

# Atoms
atom <- function_call
      / variable
      / empty_map
      / boolean
      / number

variable  <- identifier (ws subscript)*
subscript <- "[" ws expr ws "]"

function_call           <- identifier ws function_call_arguments
function_call_arguments <- "(" ws (expr ws ("," ws expr ws)* ","?)? ws ")"

# Literals
identifier    <- !reserved_word r"[a-zA-Z_][a-zA-Z0-9_]*"
reserved_word <- r"(if|else|elif|elseif|for|each|foreach|in|to|while|return|[Tt]rue|[Ff]alse|and|or|not)(?![a-zA-Z0-9_])"
boolean       <- ("True" / "False")
number        <- r"(0[bB][01]+)|(0[xX][0-9a-fA-F]+)|([0-9]+)"
empty_map     <- "{" ws "}"

# Whitespace and comments
comment  <- r"#((?![\n\r]).)*(\n|\r\n|\r|(?!.))"
ws       <- h_space?
ws_      <- h_space
any_ws   <- (comment / h_space / v_space)*
eol      <- h_space? (comment / v_space / eof) any_ws
h_space  <- r"[ \t]+"
v_space  <- "\n" / "\r\n" / "\r"
eof      <- !.

Operator precedence and associativity tables

A table of operator precedence and associativities is also provided which may be useful for, for example, producing pretty-printed outputs (with excess parentheses removed).

OPERATOR_PRECEDENCE_TABLE = {operator: int, ...}

BinaryOp and UnaryOp operator precedence scores. Higher scores mean higher precedence.

OPERATOR_ASSOCIATIVITY_TABLE = {operator: associativity, ...}

BinaryOp and UnaryOp operator associativities.

class Associativity(value)

Operator associativity types.

left = 'left'
right = 'right'