- utf8everywhere.cpp on Linux
- escaping Unicode characters
- Homoglpyhs
- Modifier and Combining Marks
- Questions?
There are two relevant code pages:
- the OEM code page for use by legacy console applications,
- the ANSI code page for use by legacy GUI applications
There are several ways to get the active code pages:
chcp
: Displays or sets the active code page number.- 437 IBM437 OEM United States
- 850 ibm850 OEM Multilingual Latin 1; Western European (DOS)
- 1250 windows-1250 ANSI Central European; Central European (Windows)
- 65001 utf-8 Unicode (UTF-8)
- 1200 utf-16 Unicode UTF-16, little endian byte order (BMP of ISO 10646)
- PowerShell:
Get-WinSystemLocale
- PowerShell via registry:
Get-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage | Select-Object OEMCP, ACP
- Windows C API:
GetACP()
,GetOEMCP()
,GetConsoleOutputCP()
As of Windows Version 1903, an application can easily set its own code page as described here.
If you can't believe that Greek letters are case insensitive as well, you can enable case-sensitivity for a specific folder and see for yourself.
fsutil.exe file setCaseSensitiveInfo <path> enable
"🐱".encode("utf-8")
returns the encoding b'\xf0\x9f\x90\xb1'
hex(ord("🐱"))
returns the code point as a hexadecimal value '0x1f431'
Conversion of Unicode String to C++23 escaped characters:
print("".join([f"\\u{{{ord(x):x}}}" for x in "A äöüß μ ‰ ఠ 😆🐱"]))
msvc's /utf-8
is shorthand for /source-charset:utf-8
and /execution-charset:utf-8
,
where source-charset can be omitted if a consistent BOM is present at each source file.
execution-charset affects string literals in the binary. Runtime file IO is not affected.
We use an .editorconfig
file to tell Visual Studio to save the file without the BOM and use the compiler option to set the encoding.
- Code Page Identifiers
- How can I manually determine the CodePage and Locale of the current OS
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Use UTF-8 code pages in Windows apps
- UTF-8 Everywhere
- Unicode: Going Down the Rabbit Hole - Peter Bindels - CppCon 2019
- Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics - Bob Steagall - CppCon 2018
- Wingdings and Webdings Symbols
- Change the case sensitivity of files and directories
- String and character literals
- cppreference: String literal
- Escape sequences
- Stack Overflow: Properly print utf8 characters in windows console
- Byte order mark
- Introduction to UTF Series' Articles, Paweł bbkr Pabian
- Stack Overflow: What is the difference between "combining characters" and "modifier letters"?
- Microsoft Documentation on setlocale: UTF-8 support
- /source-charset
- /execution-charset
- Wikipedia: Mojibake
- Combining character
- The Unicode® Standard, Version 15.0 – Core Specification
- µ U+00B5 Micro Sign
- μ U+03BC Greek Small Letter Mu
- U+0301 Combining Acute Accent
- U+02CA Modifier Letter Acute Accent
- 🐱 U+1F431 Cat Face
- 💩 U+1F4A9 Pile of Poo
- ☕ U+2615 Hot Beverage
- 🍛 U+1F35B Curry and Rice
- U+0338 Combining Long Solidus Overlay
- ; U+037E Greek Question Mark
- 🖖 U+1F596 Raised Hand with Part Between Middle and Ring Fingers, "Vulcan Salute"
- U+1F3FB Emoji Modifier Fitzpatrick Type-1-2
- U+1F3FF Emoji Modifier Fitzpatrick Type-6
- 👍 U+1F44D Thumbs Up Sign
- 🧙 U+1F9D9 Mage
- 🙋 U+1F64B Happy Person Raising One Hand