Document that the block magic sequence is invalid UTF-8 #275

eslavich · 2020-10-06T19:25:41Z

Unless I misunderstand the YAML spec's section on characters, all the bytes in our current block identifier sequence are valid in a YAML document:

d3 42 4c 4b

If this is true, then should we consider changing one of these characters to be outside of the YAML valid set? Doing so would allow us to seek through the ASDF file to find the first block without first parsing the YAML section.

The text was updated successfully, but these errors were encountered:

eslavich · 2021-06-09T21:11:54Z

The ASDF Standard requires that the tree be encoded in UTF-8:

ASDF is a hybrid text and binary format. The header, tree and block index are text, (specifically, in UTF-8 with DOS or UNIX-style newlines), while the blocks are raw binary.

and the block identifier sequence is in fact invalid UTF-8, since 0xD3 must be followed by a byte in the range 80..BF (see table 3.7 in the unicode standard).

So it should be possible to seek to the first block by looking for this sequence, but maybe we need to better document that fact. I'll change the title of this issue accordingly.

eslavich added the ASDF Standard 2.0.0 label Oct 6, 2020

eslavich changed the title ~~More convenient block magic sequence~~ Document that the block magic sequence is invalid UTF-8 Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document that the block magic sequence is invalid UTF-8 #275

Document that the block magic sequence is invalid UTF-8 #275

eslavich commented Oct 6, 2020

eslavich commented Jun 9, 2021

Document that the block magic sequence is invalid UTF-8 #275

Document that the block magic sequence is invalid UTF-8 #275

Comments

eslavich commented Oct 6, 2020

eslavich commented Jun 9, 2021