Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that the block magic sequence is invalid UTF-8 #275

Open
eslavich opened this issue Oct 6, 2020 · 1 comment
Open

Document that the block magic sequence is invalid UTF-8 #275

eslavich opened this issue Oct 6, 2020 · 1 comment

Comments

@eslavich
Copy link
Contributor

eslavich commented Oct 6, 2020

Unless I misunderstand the YAML spec's section on characters, all the bytes in our current block identifier sequence are valid in a YAML document:

d3 42 4c 4b

If this is true, then should we consider changing one of these characters to be outside of the YAML valid set? Doing so would allow us to seek through the ASDF file to find the first block without first parsing the YAML section.

@eslavich
Copy link
Contributor Author

eslavich commented Jun 9, 2021

The ASDF Standard requires that the tree be encoded in UTF-8:

ASDF is a hybrid text and binary format. The header, tree and block index are text, (specifically, in UTF-8 with DOS or UNIX-style newlines), while the blocks are raw binary.

and the block identifier sequence is in fact invalid UTF-8, since 0xD3 must be followed by a byte in the range 80..BF (see table 3.7 in the unicode standard).

So it should be possible to seek to the first block by looking for this sequence, but maybe we need to better document that fact. I'll change the title of this issue accordingly.

@eslavich eslavich changed the title More convenient block magic sequence Document that the block magic sequence is invalid UTF-8 Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant