Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return binary git paths, not potentially invalid UTF8 #936

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

brasic
Copy link

@brasic brasic commented Aug 17, 2022

Git paths have no inherent encoding, they are opaque binary strings that can be in any encoding. The macro rb_str_new_utf8 is used in places to convert raw C strings representing paths (and other things) into ruby strings tagged with UTF8 encoding.

This is not safe, since the underlying function simply copies bytes and tags the string with the specified encoding; it does not do any validity checking or transcoding. Thus it can easily create an invalid string (i.e. one for which String#valid_encoding? is false) if the repo contains files whose paths are multibyte strings in encodings other than UTF8. These strings are poisoned and difficult to work with: they can't be compared safely because of the semantics of ruby strings and they often can't be concatenated to a larger output buffer for display (which will attempt to transcode to the output buffer's native encoding, usually UTF8). The comparison issue hit us a few times at GitHub.

More detail: in ruby, string equality for multibyte strings is defined as bytewise equality plus encoding. For convenience, this is unfortunately not enforced for ASCII strings, so it often becomes a bomb that only gets triggered when you pass multibyte data through your
application.

So even though a binary-encoded "hello" is equal to the UTF8 equivalent,

"\x68\x65\x6C\x6C\x6F".b.bytes == "hello".bytes
# => true
"\x68\x65\x6C\x6C\x6F".b == "hello"
# => true

the same is not true for non-ASCII text:

"\xE6\x97\xA5\xE6\x9C\xAC".b.bytes == "日本".bytes
# => true
"\xE6\x^C\xA5\xE6\x9C\xAC".b == "日本"
# => false

One possible fix is to transcode to UTF-8. But this is a bad idea since the domain of possible git paths includes many binary strings that are not representable in UTF8 encoding. Better to acknowledge reality and use an encoding which matches the actual characteristics of this data so clients can handle it how they choose.

Note that this PR could go farther and do the same thing for some other uses of this macro which are similarly invalid (refs), but I've opted to keep the fix relatively narrow and focus on paths.

Git paths have no inherent encoding, they are opaque binary strings that
can be in any encoding. The macro `rb_str_new_utf8` is used in places to
convert raw C strings representing paths (and other things) into ruby
strings tagged with UTF8 encoding.

This is not safe, since the macro simply copies bytes and tags the
string with the specified encoding; it does not do any validity checking
or transcoding. Thus it can easily create an invalid string (i.e. one
for which `String#valid_encoding?` is false) if the repo contains files
whose paths are multibyte strings in encodings other than UTF8. These
strings are poisoned and difficult to work with: they can't be compared
safely because of the semantics of ruby strings and they often can't be
concatenated to a larger output buffer for display (which will attempt
to transcode to the output buffer's native encoding, usually UTF8). The
comparison issue hit us a few times at GitHub.

More detail: in ruby, string equality for multibyte strings is defined
as bytewise equality *plus* encoding. For convenience, this is
unfortunately not enforced for ASCII strings, so it often becomes a bomb
that only gets triggered when you pass multibyte data through your
application.

So even though a binary-encoded "hello" is equal to the UTF8 equivalent,

    "\x68\x65\x6C\x6C\x6F".b.bytes == "hello".bytes
    # => true
    "\x68\x65\x6C\x6C\x6F".b == "hello"
    # => true

the same is not true for non-ASCII text:

    "\xE6\x97\xA5\xE6\x9C\xAC".b.bytes == "日本".bytes
    # => true
    "\xE6\x^C\xA5\xE6\x9C\xAC".b == "日本"
    # => false

One possible fix is to transcode to UTF-8. But this is a bad idea since
the domain of possible git paths includes many binary strings that are
not representable in UTF8 encoding. Better to acknowledge reality and
use an encoding which matches the actual characteristics of this data so
clients can handle it how they choose.

Note that this PR could go farther and do the same thing for some other
uses of this macro which are similarly invalid (refs), but I've opted to
keep the fix relatively narrow and focus on paths.
@wincent
Copy link

wincent commented Aug 18, 2022

Thanks for doing this @brasic! To reflect here what we talked about on Zoom, I would note that at least some of this (the bits dealing with reporting conflicted paths) may be made redundant by the merge-ort work, as we won't be using Rugged for that bit any more in the future.

@richardkmichael
Copy link

Note, this overlaps with the work in PR #619. There is relevant discussion on that PR too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants