vc2_conformance.symbol_re: Regular expressions for VC-2 sequences

The vc2_conformance.symbol_re module contains a regular expression matching system for sequences of data unit types in VC-2 bitstreams.

This module has two main applications: checking the order of data units during validation and generating valid sequences during bitstream generation.

During bitstream validation (see vc2_conformance.decoder) the validator must check that sequences contain the right pattern of data unit types. For example, some levels might require bitstreams to include a sequence header between each picture while others may require fragmented and non-fragmented picture types are not used in the same stream. These rules are described by regular expressions which are evaluated by this module.

During bitstream generation, the encoder (see vc2_conformance.encoder) must produce sequences conforming to the above patterns. To facilitate this, this module can also generate sequence orderings which fulfil the rules described by regular expressions.

Regular expressions of symbols

Unlike string-matching regular expression libraries (e.g. re) which match sequences of characters, this module is designed to match sequences of ‘symbols’ where each symbol is just a string like "sequence_header" or "high_quality_picture".

As an example, we might write the following regular expression to describe the rule ‘all sequences must start with a sequence header and end with an end of sequence data unit’:

sequence_header .* end_of_sequence

In the regular expression syntax used by this module, whitespace is ignored but otherwise follows the typical form for regular expressions. For example, . is a wildcard which matches any symbol and * means ‘zero or more occurrences of the previous pattern’.

This regular expression would match the following sequence of symbols (data unit type names):

"sequence_header"
"high_quality_picture"
"sequence_header"
"high_quality_picture"
"end_of_sequence"

But would not match the following sequence (because it does not start with a sequence header):

"high_quality_picture"
"sequence_header"
"high_quality_picture"
"end_of_sequence"

Matching VC-2 sequences

The Matcher class implements a regular expression matching state machine, taking the regular expression to be matched as its argument. By convention, we use the parse code names defined in ParseCodes as symbols to represent each type of data unit in a VC-2 sequence.

Data unit parse code

Symbol

0

sequence_header

16

end_of_sequence

32

auxiliary_data

48

padding_data

200

low_delay_picture

232

high_quality_picture

204

low_delay_picture_fragment

236

high_quality_picture_fragment

In this example below we’ll create a Matcher which checks if a sequence consists of alternating sequence headers and high quality picture data units:

>>> from vc2_conformance.symbol_re import Matcher

>>> m = Matcher("(sequence_header high_quality_picture)* end_of_sequence")

This Matcher instance is then used to check if a sequence matches the required pattern by feeding it symbols one at a time via the match_symbol() method. This returns True so long as the sequence matches to expression:

>>> m.match_symbol("sequence_header")
True
>>> m.match_symbol("high_quality_picture")
True
>>> m.match_symbol("sequence_header")
True
>>> m.match_symbol("high_quality_picture")
True

When we reach the end of the sequence, the is_complete() method will check to see if the regular expression has completely matched the sequence (i.e. it isn’t expecting any other data units):

>>> # Missing the required end_of_sequence!
>>> m.is_complete()
False

>>> # Now complete
>>> m.match_symbol("end_of_sequence")
True
>>> m.is_complete()
True

When a non-matching sequence is encountered, the Matcher.valid_next_symbols() method may be used to enumerate which symbols the regular expression matcher was expecting. For example:

>>> # NB: Each Matcher can only be used once so we must create a new one
>>> # for this new sequence!
>>> m = Matcher("(sequence_header high_quality_picture)* end_of_sequence")

>>> m.match_symbol("sequence_header")
True
>>> m.match_symbol("high_quality_picture")
True
>>> m.match_symbol("high_quality_picture")  # Not allowed!
False
>>> m.valid_next_symbols()
{"sequence_header", "end_of_sequence"}

Generating VC-2 sequences

This module provides the make_matching_sequence() function which can generate minimal sequences matching a set of regular expressions. For example, say we wish to generate a sequence containing three pictures matching the regular expression we used in our previous example:

>>> from vc2_conformance.symbol_re import make_matching_sequence

>>> from pprint import pprint
>>> pprint(make_matching_sequence(
...     ["high_quality_picture"]*3,
...     "(sequence_header high_quality_picture)* end_of_sequence",
... ))
['sequence_header',
 'high_quality_picture',
 'sequence_header',
 'high_quality_picture',
 'sequence_header',
 'high_quality_picture',
 'end_of_sequence']

Here, the sequence headers and the final end of sequence data units have been added automatically.

API

class Matcher(pattern)

Test whether a sequence of symbols (alpha-numeric strings with underscores, e.g. "foo" or "bar_123") conforms to a pattern described by a regular expression.

match_symbol() should be called for each symbol in the sequence. If False is returned, the sequence does not match the specified regular expression. valid_next_symbols() may be used to list what symbols would have been allowed at this stage of the sequence. Once the entire sequence has been passed to match_symbol(), is_complete() should be used to check that a complete pattern has been matched.

Note

Each instance of Matcher can only be used to match a single sequence. To match another sequence a new Matcher must be created.

Parameters
patternstr

The regular expression describing the pattern this Matcher should match.

Regular expressions may be specified with a syntax similar (but different to) that used by common string-matching regular expression libraries.

  • An alpha-numeric-with-underscore expression matches a single instance of the specified symbol. For example foo_123 will match the symbol "foo_123".

  • A dot (.) will match any (singular) symbol.

  • A dollar ($) will match only at the end of the sequence.

  • Two expressions (separated by any amount of whitespace) will match the first expression followed by the second. For example . foo will match a sequence "anything", "foo".

  • Two expressions separated by a bar (|) will match either the first expression or the second expression.

  • A question mark (?) suffix to an expression will match zero or one instances of that expression. For example foo? will match an empty sequence or a single "foo".

  • An asterisk (*) suffix to an expression will match zero or more instances of that expression. For example foo* will match an empty sequence, a sequence of a single "foo" symbol or a sequence of many "foo" symbols.

  • A plus (+) suffix to an expression will match one or more instances of that expression. For example foo+ will match a sequence of a single "foo" symbol or a sequence of many "foo" symbols.

  • Parentheses (( and )) may be used to group expressions together into a single logical expression.

The expression suffixes bind tightly to their left-hand expression. Beyond this, consider operator precedence undefined: be explicit to help readability!

match_symbol(symbol)

Attempt to match the next symbol in the sequence.

Returns True if the symbol matched and False otherwise.

If no symbol was matched, the state machine will not be advanced (i.e. you can try again with a different symbol as if nothing happened).

is_complete()

Is it valid for the sequence to terminate at this point?

valid_next_symbols()

Return the set of valid next symbols in the sequence.

If a wildcard is allowed, WILDCARD will be returned as one of the symbols in addition to any concretely allowed symbols.

If it is valid for the sequence to end at this point, END_OF_SEQUENCE will be in the returned set.

make_matching_sequence(initial_sequence, *patterns, **kwargs)

Given a sequence of symbols, returns a new sequence based on this which matches the supplied set of patterns. The new sequence will be a copy of the supplied sequence with additional symbols inserted where necessary.

Parameters
initial_sequence[symbol, …]

The minimal set of entries which must be included in the sequence, in the order they are required to appear.

patternsstr

A series of one or more regular expression specifications (as accepted by Matcher) which the generated sequence must simultaneously satisfy.

depth_limitint

Keyword-only argument specifying the maximum number of consecutive symbols to try inserting before giving up on finding a matching sequence. Defaults to 4.

symbol_priority[symbol, …]

Keyword-only argument. If supplied, orders possible symbols from most to least preferable. Though this function will always return a sequence of the shortest possible length, when several equal-length sequences would be valid, this argument may be used to influence which is returned. Any symbols not appearing in the list will be given the lowest priority, and sorted in alphabetical order. If this argument is not supplied (or is empty), whenever ‘wildcard’ value (.) is required by a regular expression, the WILDCARD sentinel will be inserted into the output sequence rather than a concrete symbol.

Returns
matching_sequence[symbol, …]

The shortest sequence of symbols which satisfies all of the supplied regular expression patterns. This will contain a superset of the sequence in initial_sequence, that is one where additional symbols may have been inserted but none are removed or reordered.

Raises
ImpossibleSequenceError

Thrown if no sequence of symbols could be found which matches all of the supplied patterns.

WILDCARD = '.'

A constant representing a wildcard match.

END_OF_SEQUENCE = ''

A constant representing the end-of-sequence.

exception SymbolRegexSyntaxError

Thrown when a regular expression string is provided which could not be parsed.

exception ImpossibleSequenceError

Thrown when make_matching_sequence() is unable to find a suitable sequence of symbols.

Internals

Beyond the parts exposed by the public API above, this module internally is built on top of two main parts:

  • A parser which parses the regular expression syntax accepted by Matcher into an Abstract Syntax Tree (AST)

  • A Non-deterministic Finite-state Automaton (NFA) representation which is constructed from the AST using Thompson’s constructions.

The parser is broken into two stages: a simple tokenizer/lexer (tokenize_regex()) and a recursive descent parser (parse_expression()). A utility function combining these steps is provided by parse_regex().

tokenize_regex(regex_string)

A generator which tokenizes a sequence regular expression specification into (token_type, token_value, offset) 3-tuples.

Token types are:

  • "string" (value is the string)

  • "modifier" (value is one of ?*+)

  • "wildcard" (value is .)

  • "end_of_sequence" (value is $)

  • "bar" (value is |)

  • "parenthesis" (value is one of ())

Throws a SymbolRegexSyntaxError if an invalid character is encountered.

parse_expression(tokens)

A recursive-descent parser which parses an Abstract Syntax Tree (AST) from the regex specification.

The parsed tokens will be removed from the provided token list. Tokens are consumed right-to-left making implementing tight binding of modifiers (i.e. ‘?’, ‘*’ and ‘+’) easy.

This function will return as soon as it runs out of tokens or reaches an unmatched opening parenthesis.

Returns an AST node: one of Symbol, Star, Concatenation, Union or None.

parse_regex(regex_string)

Parse a sequence regular expression specification into an Abstract Syntax Tree (AST) consisting of None (empty), Symbol, Star Concatenation and Union objects.

The parser outputs an AST constructed from the following elements:

class Symbol(symbol)

Leaf AST node for a symbol.

class Star(expr)

AST node for a Kleene Star pattern (*).

class Concatenation(a, b)

AST node for a concatenation of two expressions.

class Union(a, b)

AST node for a union (|) of two expressions.

An NFA can be constructed from an AST using the NFA.from_ast() class method.

class NFA(start=None, final=None)

A Non-deterministic Finite-state Automaton (NFA) with a labelled ‘start’ and ‘final’ state.

Attributes
startNFANode
finalNFANode
classmethod from_ast(ast)

Convert a regular expression AST node into a new NFA object using Thompson’s constructions.

class NFANode

A node (i.e. state) in a Non-deterministic Finite-state Automaton (NFA).

Attributes
transitions{symbol: set([NFANode, …]), …}

The transition rules from this node.

Empty transitions are listed under the symbol None and are always bidirectional.

add_transition(dest_node, symbol=None)

Add a transition rule from this node to the specified destination.

If no symbols are specified, a (bidirectional) empty transition between the two nodes will be added.

equivalent_nodes()

Iterate over the set of NFANode nodes connected to this one by only empty transitions (includes this node).

follow(symbol)

Iterate over the NFANodes reachable from this node following the given symbol.