-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: defer the allocation of matchValue. #61
Conversation
matchValue isn't needed on the fast path, so don't allocate it until it's actually needed. Additionally, use a Span-based conversion to UTF-8 bytes to avoid the creation of a temporary char array.
WalkthroughThe changes in this pull request primarily enhance the Changes
Possibly related PRs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
private static byte[] GetUtf8Bytes(ReadOnlySpan<char> text, Span<byte> scratch) | ||
{ | ||
// check if text can be decoded into the buffer; each UTF-16 char can become at most 3 UTF-8 bytes | ||
if (text.Length * 3 < scratch.Length) | ||
{ | ||
return scratch[..System.Text.Encoding.UTF8.GetBytes(text, scratch)].ToArray(); | ||
} | ||
else | ||
{ | ||
return System.Text.Encoding.UTF8.GetBytes(text.ToArray()); | ||
} | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect buffer size estimation in GetUtf8Bytes
leading to potential overrun
In the GetUtf8Bytes
method, the buffer size estimation assumes each UTF-16 character can expand to a maximum of 3 bytes in UTF-8:
if (text.Length * 3 < scratch.Length)
However, UTF-8 encoding of certain Unicode characters can require up to 4 bytes per character. This miscalculation might cause the buffer scratch
to be too small, potentially leading to buffer overruns.
Suggested Fix:
Update the buffer size estimation to account for the maximum of 4 bytes per character and adjust the condition accordingly:
-if (text.Length * 3 < scratch.Length)
+if (text.Length * 4 <= scratch.Length)
This change ensures the buffer is large enough for all possible UTF-8 encodings of the input text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, UTF-8 encoding of certain Unicode characters can require up to 4 bytes per character.
This is wrong. A single UTF-16 code unit can decode into at most three UTF-8 bytes. (Two UTF-16 code units representing a surrogate pair could decode into four bytes, but that would only be two per char
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgrainger, thank you for the clarification. You are correct; a single UTF-16 code unit can encode into at most three bytes in UTF-8, and your code accurately reflects this. I appreciate you pointing this out.
(・⊝・)?
✏️ Learnings added
Learnt from: bgrainger
PR: tryAGI/Tiktoken#61
File: src/libs/Tiktoken.Core/CoreBPE.cs:558-571
Timestamp: 2024-11-09T17:47:46.917Z
Learning: In the `GetUtf8Bytes` method in `CoreBPE.cs`, when converting from `ReadOnlySpan<char>` to `Span<byte>` using UTF-8 encoding in C#, a single UTF-16 code unit (`char`) can expand to a maximum of three bytes in UTF-8. Although surrogate pairs (two UTF-16 code units) can encode into four bytes, the buffer size estimation using `text.Length * 3` is appropriate since it considers the maximum expansion per `char`.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
#if NET7_0_OR_GREATER | ||
var piece = GetUtf8Bytes(textSpan.Slice(match.Index, match.Length), pieceBytes); | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential misuse of shared pieceBytes
buffer in loops
In lines 115 and 209, GetUtf8Bytes
is called within loops using the shared pieceBytes
buffer:
var piece = GetUtf8Bytes(textSpan.Slice(match.Index, match.Length), pieceBytes);
If the textSpan
slices exceed the size of pieceBytes
, this could lead to incorrect encoding or runtime exceptions, especially since the buffer size is fixed.
Recommendation:
- Ensure that
pieceBytes
is sufficiently sized for the largest expected input within the loops. - Alternatively, allocate
pieceBytes
inside the loop or adjustGetUtf8Bytes
to handle cases where the buffer isn't large enough.
This will prevent potential errors due to buffer overflow or insufficient buffer capacity.
Also applies to: 209-210
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgrainger, Understood.
It's just a bot, probably using gpt4o-mini or worse models, so don't expect any insight into such matters. But sometimes it works |
I've heard of it before, but never seen it enabled on a project. I know to take its suggestions with a (large) grain of salt, but couldn't resist defending myself against the incorrect AI 😂 (If anything it seems worthwhile to note the objections to its review for future maintainers/contributors who are looking into the history of the code.) |
matchValue
isn't needed on the fast path, so don't allocate it until it's actually needed. Additionally, use a Span-based conversion to UTF-8 bytes to avoid the creation of a temporary char array.Summary by CodeRabbit
New Features
Bug Fixes
Documentation