Pre-parse Checks
The pre-parse checks allow issues that can affect XML parsing to be identified and potentially fixed before an attempt is made to parse the document.
Character encoding issues
Both DAPT and the IMSC Text profiles of TTML require that the document is encoded as UTF-8. Real world problems that we have observed include:
the presence of null bytes
the presence of byte sequences that are not legal in UTF-8
the presence of byte sequences that are legal, but highly unlikely, that are symptomatic of encoding issues, such as “latin-1” encoded badly, in such a way that if the bytes were decoded as “utf-8” and then the result re-encoded as “latin-1” the result is actually the desired UTF-8 byte sequence.
strings encoded in some scheme that is not UTF-8, in some cases also claiming to be UTF-8.
Byte Order Marks (BOMs) that were prepended to a string before encoding that string as UTF-8, thus corrupting the BOM.
Three checks are provided to look for these issues and resolve them:
NullByteChecklooks for and removes null bytes;ByteOrderMarkChecklooks for byte order marks and, if it finds it, removes a UTF-8 encoded UTF-8 BOM.BadEncodingCheckattempts to identify the encoding and, if it it not UTF-8, re-encodes the input as UTF-8 after decoding it using the most likely encoding found.
XML structure issues
The XmlStructureCheck checks for the presence of an XML document type declaration
and any entity declarations, as well as any non-UTF-8 document encoding claim.
If it finds any it logs the error and then continues. It does not modify the
input bytes, for example to replace the encoding declaration. This could
feasibly produce unexpected results if the XML parser attempts to decode the
contents as something other than UTF-8 after the BadEncodingCheck has
re-encoded the document as UTF-8.