vc2_conformance.symbol_re
: Regular expressions for VC-2 sequences¶
The vc2_conformance.symbol_re
module contains a regular expression
matching system for sequences of data unit types in VC-2 bitstreams.
This module has two main applications: checking the order of data units during validation and generating valid sequences during bitstream generation.
During bitstream validation (see vc2_conformance.decoder
) the
validator must check that sequences contain the right pattern of data unit
types. For example, some levels might require bitstreams to include a sequence
header between each picture while others may require fragmented and
non-fragmented picture types are not used in the same stream. These rules are
described by regular expressions which are evaluated by this module.
During bitstream generation, the encoder (see
vc2_conformance.encoder
) must produce sequences conforming to the
above patterns. To facilitate this, this module can also generate sequence
orderings which fulfil the rules described by regular expressions.
Regular expressions of symbols¶
Unlike string-matching regular expression libraries (e.g. re
) which
match sequences of characters, this module is designed to match sequences of
‘symbols’ where each symbol is just a string like "sequence_header"
or
"high_quality_picture"
.
As an example, we might write the following regular expression to describe the rule ‘all sequences must start with a sequence header and end with an end of sequence data unit’:
sequence_header .* end_of_sequence
In the regular expression syntax used by this module, whitespace is ignored but
otherwise follows the typical form for regular expressions. For example, .
is a wildcard which matches any symbol and *
means ‘zero or more
occurrences of the previous pattern’.
This regular expression would match the following sequence of symbols (data unit type names):
"sequence_header"
"high_quality_picture"
"sequence_header"
"high_quality_picture"
"end_of_sequence"
But would not match the following sequence (because it does not start with a sequence header):
"high_quality_picture"
"sequence_header"
"high_quality_picture"
"end_of_sequence"
Matching VC-2 sequences¶
The Matcher
class implements a regular expression matching state
machine, taking the regular expression to be matched as its argument. By
convention, we use the parse code names defined in
ParseCodes
as symbols to represent each type of
data unit in a VC-2 sequence.
Data unit parse code |
Symbol |
---|---|
0 |
sequence_header |
16 |
end_of_sequence |
32 |
auxiliary_data |
48 |
padding_data |
200 |
low_delay_picture |
232 |
high_quality_picture |
204 |
low_delay_picture_fragment |
236 |
high_quality_picture_fragment |
In this example below we’ll create a Matcher
which checks if a
sequence consists of alternating sequence headers and high quality picture data
units:
>>> from vc2_conformance.symbol_re import Matcher
>>> m = Matcher("(sequence_header high_quality_picture)* end_of_sequence")
This Matcher
instance is then used to check if a sequence matches
the required pattern by feeding it symbols one at a time via the
match_symbol()
method. This returns True so long as the
sequence matches to expression:
>>> m.match_symbol("sequence_header")
True
>>> m.match_symbol("high_quality_picture")
True
>>> m.match_symbol("sequence_header")
True
>>> m.match_symbol("high_quality_picture")
True
When we reach the end of the sequence, the is_complete()
method will check to see if the regular expression has completely matched the
sequence (i.e. it isn’t expecting any other data units):
>>> # Missing the required end_of_sequence!
>>> m.is_complete()
False
>>> # Now complete
>>> m.match_symbol("end_of_sequence")
True
>>> m.is_complete()
True
When a non-matching sequence is encountered, the
Matcher.valid_next_symbols()
method may be used to enumerate which
symbols the regular expression matcher was expecting. For example:
>>> # NB: Each Matcher can only be used once so we must create a new one
>>> # for this new sequence!
>>> m = Matcher("(sequence_header high_quality_picture)* end_of_sequence")
>>> m.match_symbol("sequence_header")
True
>>> m.match_symbol("high_quality_picture")
True
>>> m.match_symbol("high_quality_picture") # Not allowed!
False
>>> m.valid_next_symbols()
{"sequence_header", "end_of_sequence"}
Generating VC-2 sequences¶
This module provides the make_matching_sequence()
function which can
generate minimal sequences matching a set of regular expressions. For example,
say we wish to generate a sequence containing three pictures matching the
regular expression we used in our previous example:
>>> from vc2_conformance.symbol_re import make_matching_sequence
>>> from pprint import pprint
>>> pprint(make_matching_sequence(
... ["high_quality_picture"]*3,
... "(sequence_header high_quality_picture)* end_of_sequence",
... ))
['sequence_header',
'high_quality_picture',
'sequence_header',
'high_quality_picture',
'sequence_header',
'high_quality_picture',
'end_of_sequence']
Here, the sequence headers and the final end of sequence data units have been added automatically.
API¶
-
class
Matcher
(pattern)¶ Test whether a sequence of symbols (alpha-numeric strings with underscores, e.g.
"foo"
or"bar_123"
) conforms to a pattern described by a regular expression.match_symbol()
should be called for each symbol in the sequence. IfFalse
is returned, the sequence does not match the specified regular expression.valid_next_symbols()
may be used to list what symbols would have been allowed at this stage of the sequence. Once the entire sequence has been passed tomatch_symbol()
,is_complete()
should be used to check that a complete pattern has been matched.Note
Each instance of
Matcher
can only be used to match a single sequence. To match another sequence a newMatcher
must be created.- Parameters
- patternstr
The regular expression describing the pattern this
Matcher
should match.Regular expressions may be specified with a syntax similar (but different to) that used by common string-matching regular expression libraries.
An alpha-numeric-with-underscore expression matches a single instance of the specified symbol. For example
foo_123
will match the symbol"foo_123"
.A dot (
.
) will match any (singular) symbol.A dollar (
$
) will match only at the end of the sequence.Two expressions (separated by any amount of whitespace) will match the first expression followed by the second. For example
. foo
will match a sequence"anything"
,"foo"
.Two expressions separated by a bar (
|
) will match either the first expression or the second expression.A question mark (
?
) suffix to an expression will match zero or one instances of that expression. For examplefoo?
will match an empty sequence or a single"foo"
.An asterisk (
*
) suffix to an expression will match zero or more instances of that expression. For examplefoo*
will match an empty sequence, a sequence of a single"foo"
symbol or a sequence of many"foo"
symbols.A plus (
+
) suffix to an expression will match one or more instances of that expression. For examplefoo+
will match a sequence of a single"foo"
symbol or a sequence of many"foo"
symbols.Parentheses (
(
and)
) may be used to group expressions together into a single logical expression.
The expression suffixes bind tightly to their left-hand expression. Beyond this, consider operator precedence undefined: be explicit to help readability!
-
match_symbol
(symbol)¶ Attempt to match the next symbol in the sequence.
Returns True if the symbol matched and False otherwise.
If no symbol was matched, the state machine will not be advanced (i.e. you can try again with a different symbol as if nothing happened).
-
is_complete
()¶ Is it valid for the sequence to terminate at this point?
-
valid_next_symbols
()¶ Return the
set
of valid next symbols in the sequence.If a wildcard is allowed,
WILDCARD
will be returned as one of the symbols in addition to any concretely allowed symbols.If it is valid for the sequence to end at this point,
END_OF_SEQUENCE
will be in the returned set.
-
make_matching_sequence
(initial_sequence, *patterns, **kwargs)¶ Given a sequence of symbols, returns a new sequence based on this which matches the supplied set of patterns. The new sequence will be a copy of the supplied sequence with additional symbols inserted where necessary.
- Parameters
- initial_sequence[symbol, …]
The minimal set of entries which must be included in the sequence, in the order they are required to appear.
- patternsstr
A series of one or more regular expression specifications (as accepted by
Matcher
) which the generated sequence must simultaneously satisfy.- depth_limitint
Keyword-only argument specifying the maximum number of consecutive symbols to try inserting before giving up on finding a matching sequence. Defaults to 4.
- symbol_priority[symbol, …]
Keyword-only argument. If supplied, orders possible symbols from most to least preferable. Though this function will always return a sequence of the shortest possible length, when several equal-length sequences would be valid, this argument may be used to influence which is returned. Any symbols not appearing in the list will be given the lowest priority, and sorted in alphabetical order. If this argument is not supplied (or is empty), whenever ‘wildcard’ value (
.
) is required by a regular expression, theWILDCARD
sentinel will be inserted into the output sequence rather than a concrete symbol.
- Returns
- matching_sequence[symbol, …]
The shortest sequence of symbols which satisfies all of the supplied regular expression patterns. This will contain a superset of the sequence in
initial_sequence
, that is one where additional symbols may have been inserted but none are removed or reordered.
- Raises
- ImpossibleSequenceError
Thrown if no sequence of symbols could be found which matches all of the supplied patterns.
-
WILDCARD
= '.'¶ A constant representing a wildcard match.
-
END_OF_SEQUENCE
= ''¶ A constant representing the end-of-sequence.
-
exception
SymbolRegexSyntaxError
¶ Thrown when a regular expression string is provided which could not be parsed.
-
exception
ImpossibleSequenceError
¶ Thrown when
make_matching_sequence()
is unable to find a suitable sequence of symbols.
Internals¶
Beyond the parts exposed by the public API above, this module internally is built on top of two main parts:
A parser which parses the regular expression syntax accepted by
Matcher
into an Abstract Syntax Tree (AST)A Non-deterministic Finite-state Automaton (NFA) representation which is constructed from the AST using Thompson’s constructions.
The parser is broken into two stages: a simple tokenizer/lexer
(tokenize_regex()
) and a recursive descent parser
(parse_expression()
). A utility function combining these steps is
provided by parse_regex()
.
-
tokenize_regex
(regex_string)¶ A generator which tokenizes a sequence regular expression specification into (token_type, token_value, offset) 3-tuples.
Token types are:
"string"
(value is the string)"modifier"
(value is one of?*+
)"wildcard"
(value is.
)"end_of_sequence"
(value is$
)"bar"
(value is|
)"parenthesis"
(value is one of()
)
Throws a
SymbolRegexSyntaxError
if an invalid character is encountered.
-
parse_expression
(tokens)¶ A recursive-descent parser which parses an Abstract Syntax Tree (AST) from the regex specification.
The parsed tokens will be removed from the provided token list. Tokens are consumed right-to-left making implementing tight binding of modifiers (i.e. ‘?’, ‘*’ and ‘+’) easy.
This function will return as soon as it runs out of tokens or reaches an unmatched opening parenthesis.
Returns an AST node: one of
Symbol
,Star
,Concatenation
,Union
orNone
.
-
parse_regex
(regex_string)¶ Parse a sequence regular expression specification into an Abstract Syntax Tree (AST) consisting of None (empty),
Symbol
,Star
Concatenation
andUnion
objects.
The parser outputs an AST constructed from the following elements:
-
class
Symbol
(symbol)¶ Leaf AST node for a symbol.
-
class
Star
(expr)¶ AST node for a Kleene Star pattern (
*
).
-
class
Concatenation
(a, b)¶ AST node for a concatenation of two expressions.
-
class
Union
(a, b)¶ AST node for a union (
|
) of two expressions.
An NFA can be constructed from an AST using the NFA.from_ast()
class
method.
-
class
NFA
(start=None, final=None)¶ A Non-deterministic Finite-state Automaton (NFA) with a labelled ‘start’ and ‘final’ state.
-
classmethod
from_ast
(ast)¶ Convert a regular expression AST node into a new
NFA
object using Thompson’s constructions.
-
classmethod
-
class
NFANode
¶ A node (i.e. state) in a Non-deterministic Finite-state Automaton (NFA).
- Attributes
- transitions{symbol: set([
NFANode
, …]), …} The transition rules from this node.
Empty transitions are listed under the symbol
None
and are always bidirectional.
- transitions{symbol: set([
-
add_transition
(dest_node, symbol=None)¶ Add a transition rule from this node to the specified destination.
If no symbols are specified, a (bidirectional) empty transition between the two nodes will be added.