Return binary git paths, not potentially invalid UTF8 #936
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Git paths have no inherent encoding, they are opaque binary strings that can be in any encoding. The macro
rb_str_new_utf8
is used in places to convert raw C strings representing paths (and other things) into ruby strings tagged with UTF8 encoding.This is not safe, since the underlying function simply copies bytes and tags the string with the specified encoding; it does not do any validity checking or transcoding. Thus it can easily create an invalid string (i.e. one for which
String#valid_encoding?
is false) if the repo contains files whose paths are multibyte strings in encodings other than UTF8. These strings are poisoned and difficult to work with: they can't be compared safely because of the semantics of ruby strings and they often can't be concatenated to a larger output buffer for display (which will attempt to transcode to the output buffer's native encoding, usually UTF8). The comparison issue hit us a few times at GitHub.More detail: in ruby, string equality for multibyte strings is defined as bytewise equality plus encoding. For convenience, this is unfortunately not enforced for ASCII strings, so it often becomes a bomb that only gets triggered when you pass multibyte data through your
application.
So even though a binary-encoded "hello" is equal to the UTF8 equivalent,
the same is not true for non-ASCII text:
One possible fix is to transcode to UTF-8. But this is a bad idea since the domain of possible git paths includes many binary strings that are not representable in UTF8 encoding. Better to acknowledge reality and use an encoding which matches the actual characteristics of this data so clients can handle it how they choose.
Note that this PR could go farther and do the same thing for some other uses of this macro which are similarly invalid (refs), but I've opted to keep the fix relatively narrow and focus on paths.