Skip to content

Commit

Permalink
feat: Moved encodings to separate packages. Added o200k_base support.
Browse files Browse the repository at this point in the history
  • Loading branch information
HavenDV committed May 15, 2024
1 parent 98ed17d commit 5ee22aa
Show file tree
Hide file tree
Showing 43 changed files with 210,607 additions and 432 deletions.
85 changes: 43 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,23 @@ There's also a benchmark console app here for easy tracking of this.
We will be happy to accept any PR.

### Implemented encodings
- `o200k_base`
- `cl100k_base`
- `r50k_base`
- `p50k_base`
- `p50k_edit`

### Usage
```csharp
var encoding = Tiktoken.Encoding.ForModel("gpt-4");
var tokens = encoding.Encode("hello world"); // [15339, 1917]
var text = encoding.Decode(tokens); // hello world
var numberOfTokens = encoding.CountTokens(text); // 2
var stringTokens = encoding.Explore(text); // ["hello", " world"]
using Tiktoken.Encodings;
using Tiktoken;

var encoding = Tiktoken.Encoding.Get(Encodings.P50KBase);
var tokens = encoding.Encode("hello world"); // [31373, 995]
var text = encoding.Decode(tokens); // hello world
var encoding = new O200KBase();
var encoder = new Encoder(encoding);
var tokens = encoder.Encode("hello world"); // [15339, 1917]
var text = encoder.Decode(tokens); // hello world
var numberOfTokens = encoder.CountTokens(text); // 2
var stringTokens = encoder.Explore(text); // ["hello", " world"]
```

### Benchmarks
Expand All @@ -36,43 +37,43 @@ You can view the reports for each version [here](benchmarks)
BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M1 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK 8.0.203
[Host] : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD
DefaultJob : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD
.NET SDK 8.0.204
[Host] : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
```
| Method | Categories | Data | Mean | Median | Ratio | Gen0 | Gen1 | Allocated | Alloc Ratio |
|--------------------------- |------------ |-------------------- |---------------:|---------------:|------:|---------:|---------:|----------:|------------:|
| **SharpTokenV2_0_1_** | **CountTokens** | **1. (...)57. [19866]** | **659,050.8 ns** | **663,888.3 ns** | **1.00** | **2.9297** | **0.9766** | **20116 B** | **1.00** |
| TiktokenSharpV1_0_9_ | CountTokens | 1. (...)57. [19866] | 951,380.1 ns | 939,690.6 ns | 1.45 | 250.0000 | 125.0000 | 1570772 B | 78.09 |
| TokenizerLibV1_3_3_ | CountTokens | 1. (...)57. [19866] | 1,049,794.0 ns | 1,032,725.9 ns | 1.61 | 246.0938 | 89.8438 | 1547675 B | 76.94 |
| Tiktoken_ | CountTokens | 1. (...)57. [19866] | 325,631.7 ns | 324,920.4 ns | 0.49 | 49.3164 | - | 309449 B | 15.38 |
| | | | | | | | | | |
| **SharpTokenV2_0_1_** | **CountTokens** | **Hello, World!** | **431.0 ns** | **430.5 ns** | **1.00** | **0.0405** | **-** | **256 B** | **1.00** |
| TiktokenSharpV1_0_9_ | CountTokens | Hello, World! | 5,826.4 ns | 5,826.7 ns | 13.52 | 2.1210 | 0.0305 | 13344 B | 52.12 |
| TokenizerLibV1_3_3_ | CountTokens | Hello, World! | 774.3 ns | 771.0 ns | 1.80 | 0.2356 | - | 1480 B | 5.78 |
| Tiktoken_ | CountTokens | Hello, World! | 214.2 ns | 212.9 ns | 0.50 | 0.0420 | - | 264 B | 1.03 |
| | | | | | | | | | |
| **SharpTokenV2_0_1_** | **CountTokens** | **King(...)edy. [275]** | **6,643.3 ns** | **6,645.0 ns** | **1.00** | **0.0763** | **-** | **520 B** | **1.00** |
| TiktokenSharpV1_0_9_ | CountTokens | King(...)edy. [275] | 13,319.5 ns | 13,318.8 ns | 2.00 | 5.0507 | 0.1678 | 31712 B | 60.98 |
| TokenizerLibV1_3_3_ | CountTokens | King(...)edy. [275] | 7,342.0 ns | 7,349.4 ns | 1.10 | 3.0823 | 0.1373 | 19344 B | 37.20 |
| Tiktoken_ | CountTokens | King(...)edy. [275] | 3,306.1 ns | 3,289.0 ns | 0.50 | 0.6447 | - | 4064 B | 7.82 |
| | | | | | | | | | |
| **SharpTokenV2_0_1_Encode** | **Encode** | **1. (...)57. [19866]** | **616,768.0 ns** | **615,247.0 ns** | **1.00** | **2.9297** | **-** | **20115 B** | **1.00** |
| TiktokenSharpV1_0_9_Encode | Encode | 1. (...)57. [19866] | 929,080.6 ns | 926,978.2 ns | 1.51 | 250.0000 | 125.0000 | 1570770 B | 78.09 |
| TokenizerLibV1_3_3_Encode | Encode | 1. (...)57. [19866] | 793,069.4 ns | 791,800.6 ns | 1.29 | 246.0938 | 85.9375 | 1547673 B | 76.94 |
| Tiktoken_Encode | Encode | 1. (...)57. [19866] | 340,412.3 ns | 339,821.0 ns | 0.55 | 59.5703 | 2.4414 | 375601 B | 18.67 |
| | | | | | | | | | |
| **SharpTokenV2_0_1_Encode** | **Encode** | **Hello, World!** | **443.7 ns** | **443.7 ns** | **1.00** | **0.0405** | **-** | **256 B** | **1.00** |
| TiktokenSharpV1_0_9_Encode | Encode | Hello, World! | 5,783.7 ns | 5,778.7 ns | 13.04 | 2.1210 | 0.0305 | 13344 B | 52.12 |
| TokenizerLibV1_3_3_Encode | Encode | Hello, World! | 491.2 ns | 491.0 ns | 1.11 | 0.2356 | - | 1480 B | 5.78 |
| Tiktoken_Encode | Encode | Hello, World! | 264.8 ns | 264.3 ns | 0.60 | 0.1030 | - | 648 B | 2.53 |
| | | | | | | | | | |
| **SharpTokenV2_0_1_Encode** | **Encode** | **King(...)edy. [275]** | **6,620.0 ns** | **6,618.1 ns** | **1.00** | **0.0763** | **-** | **520 B** | **1.00** |
| TiktokenSharpV1_0_9_Encode | Encode | King(...)edy. [275] | 13,205.7 ns | 13,217.4 ns | 1.99 | 5.0507 | 0.1678 | 31712 B | 60.98 |
| TokenizerLibV1_3_3_Encode | Encode | King(...)edy. [275] | 7,312.6 ns | 7,307.4 ns | 1.10 | 3.0823 | 0.1373 | 19344 B | 37.20 |
| Tiktoken_Encode | Encode | King(...)edy. [275] | 3,599.7 ns | 3,596.8 ns | 0.54 | 0.7973 | - | 5024 B | 9.66 |
| Method | Categories | Data | Mean | Median | Ratio | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
|--------------------------- |------------ |-------------------- |-------------:|-------------:|------:|---------:|--------:|-------:|----------:|------------:|
| **SharpTokenV2_0_1_** | **CountTokens** | **1. (...)57. [19866]** | **632,817.1 ns** | **632,257.2 ns** | **1.00** | **2.9297** | **-** | **-** | **20115 B** | **1.00** |
| TiktokenSharpV1_0_9_ | CountTokens | 1. (...)57. [19866] | 463,840.3 ns | 458,851.3 ns | 0.74 | 64.4531 | 3.4180 | - | 404649 B | 20.12 |
| TokenizerLibV1_3_3_ | CountTokens | 1. (...)57. [19866] | 801,796.0 ns | 806,271.8 ns | 1.27 | 247.0703 | 98.6328 | 0.9766 | 1547675 B | 76.94 |
| Tiktoken_ | CountTokens | 1. (...)57. [19866] | 319,697.2 ns | 319,475.1 ns | 0.50 | 49.3164 | - | - | 309449 B | 15.38 |
| | | | | | | | | | | |
| **SharpTokenV2_0_1_** | **CountTokens** | **Hello, World!** | **478.1 ns** | **478.1 ns** | **1.00** | **0.0401** | **-** | **-** | **256 B** | **1.00** |
| TiktokenSharpV1_0_9_ | CountTokens | Hello, World! | 275.2 ns | 275.1 ns | 0.58 | 0.0505 | - | - | 320 B | 1.25 |
| TokenizerLibV1_3_3_ | CountTokens | Hello, World! | 498.1 ns | 497.4 ns | 1.04 | 0.2356 | - | - | 1480 B | 5.78 |
| Tiktoken_ | CountTokens | Hello, World! | 212.9 ns | 212.8 ns | 0.45 | 0.0420 | - | - | 264 B | 1.03 |
| | | | | | | | | | | |
| **SharpTokenV2_0_1_** | **CountTokens** | **King(...)edy. [275]** | **6,652.5 ns** | **6,651.9 ns** | **1.00** | **0.0763** | **-** | **-** | **520 B** | **1.00** |
| TiktokenSharpV1_0_9_ | CountTokens | King(...)edy. [275] | 4,774.2 ns | 4,781.1 ns | 0.72 | 0.8011 | - | - | 5064 B | 9.74 |
| TokenizerLibV1_3_3_ | CountTokens | King(...)edy. [275] | 7,261.6 ns | 7,241.6 ns | 1.09 | 3.0899 | 0.1450 | 0.0076 | 19344 B | 37.20 |
| Tiktoken_ | CountTokens | King(...)edy. [275] | 3,216.1 ns | 3,189.9 ns | 0.49 | 0.6447 | - | - | 4064 B | 7.82 |
| | | | | | | | | | | |
| **SharpTokenV2_0_1_Encode** | **Encode** | **1. (...)57. [19866]** | **613,700.9 ns** | **612,821.4 ns** | **1.00** | **2.9297** | **-** | **-** | **20115 B** | **1.00** |
| TiktokenSharpV1_0_9_Encode | Encode | 1. (...)57. [19866] | 444,436.3 ns | 444,298.4 ns | 0.72 | 64.4531 | 3.4180 | - | 404649 B | 20.12 |
| TokenizerLibV1_3_3_Encode | Encode | 1. (...)57. [19866] | 773,882.5 ns | 774,314.3 ns | 1.26 | 246.0938 | 85.9375 | - | 1547673 B | 76.94 |
| Tiktoken_Encode | Encode | 1. (...)57. [19866] | 335,482.3 ns | 333,936.4 ns | 0.55 | 59.5703 | 2.4414 | - | 375601 B | 18.67 |
| | | | | | | | | | | |
| **SharpTokenV2_0_1_Encode** | **Encode** | **Hello, World!** | **443.7 ns** | **436.8 ns** | **1.00** | **0.0405** | **-** | **-** | **256 B** | **1.00** |
| TiktokenSharpV1_0_9_Encode | Encode | Hello, World! | 300.4 ns | 299.4 ns | 0.67 | 0.0505 | - | - | 320 B | 1.25 |
| TokenizerLibV1_3_3_Encode | Encode | Hello, World! | 504.7 ns | 498.5 ns | 1.15 | 0.2356 | 0.0010 | - | 1480 B | 5.78 |
| Tiktoken_Encode | Encode | Hello, World! | 262.4 ns | 262.6 ns | 0.58 | 0.1030 | - | - | 648 B | 2.53 |
| | | | | | | | | | | |
| **SharpTokenV2_0_1_Encode** | **Encode** | **King(...)edy. [275]** | **6,784.3 ns** | **6,714.1 ns** | **1.00** | **0.0763** | **-** | **-** | **520 B** | **1.00** |
| TiktokenSharpV1_0_9_Encode | Encode | King(...)edy. [275] | 4,691.2 ns | 4,690.7 ns | 0.69 | 0.8011 | - | - | 5064 B | 9.74 |
| TokenizerLibV1_3_3_Encode | Encode | King(...)edy. [275] | 7,287.9 ns | 7,290.9 ns | 1.08 | 3.0823 | 0.1373 | - | 19344 B | 37.20 |
| Tiktoken_Encode | Encode | King(...)edy. [275] | 3,606.2 ns | 3,607.4 ns | 0.53 | 0.7973 | - | - | 5024 B | 9.66 |

<!--BENCHMARKS_END-->

Expand Down
44 changes: 43 additions & 1 deletion Tiktoken.sln
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "libs", "libs", "{3B092566-9
src\libs\Directory.Build.props = src\libs\Directory.Build.props
EndProjectSection
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Tiktoken", "src\libs\Tiktoken\Tiktoken.csproj", "{722606E4-DB7E-42DE-9627-A70459ABED8A}"
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Tiktoken.Core", "src\libs\Tiktoken.Core\Tiktoken.Core.csproj", "{722606E4-DB7E-42DE-9627-A70459ABED8A}"
EndProject
Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "tests", "tests", "{BC3A0A95-4B22-4523-A1C5-94699F586D1B}"
EndProject
Expand All @@ -29,6 +29,18 @@ Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "benchmarks", "benchmarks",
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Tiktoken.Benchmarks", "src\benchmarks\Tiktoken.Benchmarks\Tiktoken.Benchmarks.csproj", "{83457B34-2F66-4566-97A8-EB5CA2FAFC3E}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Tiktoken", "src\libs\Tiktoken\Tiktoken.csproj", "{7F1803DA-299D-4839-876D-19B4031C1246}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Tiktoken.Encodings.cl100k", "src\libs\Tiktoken.Encodings.cl100k\Tiktoken.Encodings.cl100k.csproj", "{937EC883-68C6-4F97-904E-4EC7149FEC72}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Tiktoken.Encodings.p50k", "src\libs\Tiktoken.Encodings.p50k\Tiktoken.Encodings.p50k.csproj", "{96A3E018-461F-4466-9086-8FFBA89DDE35}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Tiktoken.Encodings.o200k", "src\libs\Tiktoken.Encodings.o200k\Tiktoken.Encodings.o200k.csproj", "{40F00E7F-84D7-4886-95F4-DC96BDE90965}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Tiktoken.Encodings.Abstractions", "src\libs\Tiktoken.Encodings.Abstractions\Tiktoken.Encodings.Abstractions.csproj", "{3EDC6F81-2C49-4B8D-9AAC-5B7C40D5A1CF}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Tiktoken.Encodings.r50k", "src\libs\Tiktoken.Encodings.r50k\Tiktoken.Encodings.r50k.csproj", "{633C9C98-0782-4CFC-9D26-F27E77FA11EC}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Expand All @@ -47,6 +59,30 @@ Global
{83457B34-2F66-4566-97A8-EB5CA2FAFC3E}.Debug|Any CPU.Build.0 = Debug|Any CPU
{83457B34-2F66-4566-97A8-EB5CA2FAFC3E}.Release|Any CPU.ActiveCfg = Release|Any CPU
{83457B34-2F66-4566-97A8-EB5CA2FAFC3E}.Release|Any CPU.Build.0 = Release|Any CPU
{7F1803DA-299D-4839-876D-19B4031C1246}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{7F1803DA-299D-4839-876D-19B4031C1246}.Debug|Any CPU.Build.0 = Debug|Any CPU
{7F1803DA-299D-4839-876D-19B4031C1246}.Release|Any CPU.ActiveCfg = Release|Any CPU
{7F1803DA-299D-4839-876D-19B4031C1246}.Release|Any CPU.Build.0 = Release|Any CPU
{937EC883-68C6-4F97-904E-4EC7149FEC72}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{937EC883-68C6-4F97-904E-4EC7149FEC72}.Debug|Any CPU.Build.0 = Debug|Any CPU
{937EC883-68C6-4F97-904E-4EC7149FEC72}.Release|Any CPU.ActiveCfg = Release|Any CPU
{937EC883-68C6-4F97-904E-4EC7149FEC72}.Release|Any CPU.Build.0 = Release|Any CPU
{96A3E018-461F-4466-9086-8FFBA89DDE35}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{96A3E018-461F-4466-9086-8FFBA89DDE35}.Debug|Any CPU.Build.0 = Debug|Any CPU
{96A3E018-461F-4466-9086-8FFBA89DDE35}.Release|Any CPU.ActiveCfg = Release|Any CPU
{96A3E018-461F-4466-9086-8FFBA89DDE35}.Release|Any CPU.Build.0 = Release|Any CPU
{40F00E7F-84D7-4886-95F4-DC96BDE90965}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{40F00E7F-84D7-4886-95F4-DC96BDE90965}.Debug|Any CPU.Build.0 = Debug|Any CPU
{40F00E7F-84D7-4886-95F4-DC96BDE90965}.Release|Any CPU.ActiveCfg = Release|Any CPU
{40F00E7F-84D7-4886-95F4-DC96BDE90965}.Release|Any CPU.Build.0 = Release|Any CPU
{3EDC6F81-2C49-4B8D-9AAC-5B7C40D5A1CF}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{3EDC6F81-2C49-4B8D-9AAC-5B7C40D5A1CF}.Debug|Any CPU.Build.0 = Debug|Any CPU
{3EDC6F81-2C49-4B8D-9AAC-5B7C40D5A1CF}.Release|Any CPU.ActiveCfg = Release|Any CPU
{3EDC6F81-2C49-4B8D-9AAC-5B7C40D5A1CF}.Release|Any CPU.Build.0 = Release|Any CPU
{633C9C98-0782-4CFC-9D26-F27E77FA11EC}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{633C9C98-0782-4CFC-9D26-F27E77FA11EC}.Debug|Any CPU.Build.0 = Debug|Any CPU
{633C9C98-0782-4CFC-9D26-F27E77FA11EC}.Release|Any CPU.ActiveCfg = Release|Any CPU
{633C9C98-0782-4CFC-9D26-F27E77FA11EC}.Release|Any CPU.Build.0 = Release|Any CPU
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
Expand All @@ -58,5 +94,11 @@ Global
{722606E4-DB7E-42DE-9627-A70459ABED8A} = {3B092566-9A2F-4C00-BFF1-90C0A6BE8C62}
{DD0E1D72-6861-41E8-9737-E5875550B39D} = {BC3A0A95-4B22-4523-A1C5-94699F586D1B}
{83457B34-2F66-4566-97A8-EB5CA2FAFC3E} = {15EDB083-E0AC-4CDC-98EC-A67EFE143C29}
{7F1803DA-299D-4839-876D-19B4031C1246} = {3B092566-9A2F-4C00-BFF1-90C0A6BE8C62}
{937EC883-68C6-4F97-904E-4EC7149FEC72} = {3B092566-9A2F-4C00-BFF1-90C0A6BE8C62}
{96A3E018-461F-4466-9086-8FFBA89DDE35} = {3B092566-9A2F-4C00-BFF1-90C0A6BE8C62}
{40F00E7F-84D7-4886-95F4-DC96BDE90965} = {3B092566-9A2F-4C00-BFF1-90C0A6BE8C62}
{3EDC6F81-2C49-4B8D-9AAC-5B7C40D5A1CF} = {3B092566-9A2F-4C00-BFF1-90C0A6BE8C62}
{633C9C98-0782-4CFC-9D26-F27E77FA11EC} = {3B092566-9A2F-4C00-BFF1-90C0A6BE8C62}
EndGlobalSection
EndGlobal
Loading

0 comments on commit 5ee22aa

Please sign in to comment.