-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Revise char
primitive
#7028
Comments
I like it! "Support range patterns for regular integers" especially. But I'm not very read up on these things. |
Looks good in principle. Just wondering about what kind of functionality one would want. Any way to use former chars, now int, to build things, as opposed to just taking things apart? |
If ints, formerly char, are not used for building things, then what else should be used? Always strings? |
The In the current version: // '𐀀' = 0x10000
let _ = String.get("𐀀", n) // behave like Js's String.prototype.codePointAt
let _ = String.make(10, '𐀀')
// ^ This has unexpected value "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
// Wrong on both Js and OCaml side The proposal "fixes" current behavior while not breaking it. After removing the OCaml modules in #6984, I expect it to be unused, but it shouldn't cause any problems even in the code that is still using it. BMP is a boundary that a UTF-16 character appears as a surrogate pair. If the codepoint is in the BMP, it can be represented in JS as a string with I think it can be easily checked in the scanner. |
When writing tokenizers in ReScript, It will still find something like this useful switch String.codePointAt(input, cursor) {
| Some(ch) => switch ch {
| '\r' | '\n' => Whitespace
| 'a' .. 'z' | 'A' .. 'Z' => ...
}
| None => ...
} |
That's how one takes things apart. |
One observation is that the internal representation as integers might have to wait until after parsing, as pretty printing will restore the original. |
State of
char
typeReScript has the
char
primitive type, which is rarely used. (I was one of those who usedchar
to handle ASCII keycodes)https://rescript-lang.org/docs/manual/latest/primitive-types#char
The
char
doesn't support Unicode, but only supports UTF-16 codepoint.compiles to
Its value is the same as
'👋'.codePointAt(0)
result in JavaScript, which means that in the value representation,char
is equivalent toint
(16-bit subset).Then, why don't we use just
int
instead ofchar
?char
literals are automatically compiled to codepoints. This is much more efficient than string representation when dealing with the Unicode data table.char
supports range pattern (e.g.'a' .. 'z'
) in pattern matching. This is very useful when writing parsers.However, a char literal is not really useful to represent a Unicode character because it doesn't cover the entire Unicode sequence. It only returns the first codepoint value and discards the rest of the character segment.
To avoid problems, we should limit the value range of
char
literal to the BMP(Basic Multilingual Plane, U+0000~U+FFFF).Suggestion
I suggest some changes that would keep the useful parts of char but remove its confusion.
char
type or make it an alias ofint
Char
module.The text was updated successfully, but these errors were encountered: