Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZnUTF8Encoder filters out any U+FEFF #134

Open
Rinzwind opened this issue Mar 7, 2024 · 8 comments
Open

ZnUTF8Encoder filters out any U+FEFF #134

Rinzwind opened this issue Mar 7, 2024 · 8 comments

Comments

@Rinzwind
Copy link

Rinzwind commented Mar 7, 2024

ZnUTF8Encoder filters out any U+FEFF, the following answers '42' rather than 'FEFF' for example:

(ZnUTF8Encoder new decodeBytes: #[16r41 16rEF 16rBB 16rBF 16r42])
	second codePoint printStringBase: 16

This doesn’t quite seem to be in accordance with the Unicode standard, which says (in section ‘23.8 Specials’ of ‘The Unicode Standard, Version 15.0 – Core Specification’):

For historical reasons, the character U+FEFF used for the byte order mark is named ZERO WIDTH NO-BREAK SPACE. […] Because the byte-swapped version U+FFFE is a noncharacter, when an interpreting process finds U+FFFE as the first character, it signals either that the process has encountered text that is of the incorrect byte order or that the file is not valid Unicode text. […] In UTF-8, the BOM corresponds to the byte sequence <EF₁₆ BB₁₆ BF₁₆>. Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. […] For compatibility with versions of the Unicode Standard prior to Version 3.2, the code point U+FEFF has the word-joining semantics of zero width no-break space when it is not used as a BOM. […] Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space. […] If U+FEFF had only the semantics of a signature code point, it could be freely deleted from text without affecting the interpretation of the rest of the text. […] Unfortunately, U+FEFF also has significance as a character. As a zero width no-break space, it indicates that line breaks are not allowed between the adjoining characters. Thus U+FEFF affects the interpretation of text and cannot be freely deleted.

@svenvc
Copy link
Owner

svenvc commented Mar 7, 2024

Well, I am sure you can see that this was done by design: the occurrence of a BOM is a no-op for UTF-8, should not occur and is not recommended. Hence this pragmatic decision.

I know that it is used on Windows, that is why there are some provisions for writing a BOM.

Should version 3.2 compatibility be a thing, since that is 20+ years old ?

Up to now there have not yet been any complaints about this behaviour.

What is the practical issue that you encounter ?

What do you think should happen/change ?

@Rinzwind
Copy link
Author

Rinzwind commented Mar 7, 2024

Shouldn’t it at most filter out an initial U+FEFF? The example for ucnv_detectUnicodeSignature in the ICU libraries documentation might help. If ‘input’ is changed to:

char input[] = { '\xEF','\xBB','\xBF', '\xEF','\xBB','\xBF', '\x41', '\xEF','\xBB','\xBF', '\x42' };

The output is as follows, so only the initial U+FEFF is filtered out:

feff 0041 feff 0042

@svenvc
Copy link
Owner

svenvc commented Mar 12, 2024

In Python I see this:

$ python3
Python 3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> bytes = b'\xEF\xBB\xBF\xEF\xBB\xBF\x41\xEF\xBB\xBF\x42'
>>> bytes.decode("utf-8")
'\ufeff\ufeffA\ufeffB'

@svenvc
Copy link
Owner

svenvc commented Mar 12, 2024

In Swift I see this:

$ swift repl
Welcome to Apple Swift version 5.10 (swiftlang-5.10.0.13 clang-1500.3.9.4).
Type :help for assistance.
  1> import Foundation 
  2> let byteArray: [UInt8] = [239, 187, 191, 239, 187, 191, 65, 239, 187, 191, 66] 
byteArray: [UInt8] = 11 values {
  [0] = 239
  [1] = 187
  [2] = 191
  [3] = 239
  [4] = 187
  [5] = 191
  [6] = 65
  [7] = 239
  [8] = 187
  [9] = 191
  [10] = 66
}
  3> String(data: Data(bytes: byteArray), encoding: .utf8) 
$R0: String? = "AB"

@svenvc
Copy link
Owner

svenvc commented Mar 12, 2024

We could look at Java and JavaScript as well.

I am rather a fan of Swift's Unicode approach (in general not specifically this), it is quite modern.

Like I said, it was a design decision to skip BOM sequences like that.

And my question remains: what practical issue did you encounter ?

I guess it would not be impossible to add an option to UTFEncoder (the superclass of all UTF encoders) to control this behaviour but then the next question is what the default should be. (There is already a strict option for byte encoders).

Still we would be adding complexity and I currently do not see why.

@Rinzwind
Copy link
Author

Rinzwind commented Mar 14, 2024

Note that the Swift REPL doesn’t escape U+FEFF when printing a string. This is clearer:

  4> String(data: Data(bytes: byteArray), encoding: .utf8)!.unicodeScalars.map{ String($0.value, radix: 16) }
$R1: [String] = 5 values {
  [0] = "feff"
  [1] = "feff"
  [2] = "41"
  [3] = "feff"
  [4] = "42"
}

In the following, the string is different depending on whether ‘utf16BigEndian’ or ‘utf16’ is used:

  5> let byteArray2: [UInt8] = [254, 255, 254, 255, 0, 65, 254, 255, 0, 66]
byteArray2: [UInt8] = 10 values {
[]
}
  6> String(data: Data(bytes: byteArray2), encoding: .utf16BigEndian)!.unicodeScalars.map{ String($0.value, radix: 16) } 
$R2: [String] = 5 values {
  [0] = "feff"
  [1] = "feff"
  [2] = "41"
  [3] = "feff"
  [4] = "42"
}
  7> String(data: Data(bytes: byteArray2), encoding: .utf16)!.unicodeScalars.map{ String($0.value, radix: 16) }
$R3: [String] = 4 values {
  [0] = "feff"
  [1] = "41"
  [2] = "feff"
  [3] = "42"
}

Section ‘3.10 Unicode Encoding Schemes’ in the Core Specification says that “in UTF-16BE, an initial byte sequence <FE FF> is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE”, while “in the UTF-16 encoding scheme, an initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark”.

Deletion of U+FEFF when it is not used as a BOM may lead to a security problem as discussed in section ‘3.5 Deletion of Code Points’ in Unicode Technical Report #36 ‘Unicode Security Considerations’, which is referred to in the section ‘Modification’ on conformance clause C7 within section ‘3.2 Conformance Requirements’ in the Core Specification.

svenvc pushed a commit that referenced this issue Apr 19, 2024
@svenvc
Copy link
Owner

svenvc commented Apr 19, 2024

fc59039

Maybe not 100% what you want, but at least there is now the option to see BOM occurrences.

Thx for the discussion.

@Rinzwind
Copy link
Author

Rinzwind commented May 1, 2024

OK thanks! It would seem better for the option to also disable the endianness swapping after the U+FFFE in the following though, I would expect the first, third and fifth characters to be the same, but they are not:

((ZnCharacterEncoder newForEncoding: 'utf16be')
	ignoreByteOrderMark: false;
	decodeBytes: (ByteArray readHexFrom: ('00FB FEFF 00FB FFFE 00FB' reject: #isSeparator)))
		asArray collect: [ :character |
			character codePoint printStringBase: 16 nDigits: 4 ]
"⇒ #('00FB' 'FEFF' '00FB' 'FFFE' 'FB00')"

syrel pushed a commit to feenkcom/gtoolkit that referenced this issue May 10, 2024
Metacello new
    baseline: 'GToolkitForPharo9';
    repository: 'github://feenkcom/gtoolkit:v1.0.749/src';
    load

All commits (including upstream repositories) since last build:
feenkcom/gtoolkit-utility@9f68de by John Brant
[#3764] adding query methods to RB ast nodes

feenkcom/gtoolkit-utility@3fff1c by Alistair Grant
Merge remote-tracking branch 'origin/main'


feenkcom/gtoolkit-utility@2bc080 by Alistair Grant
GtVmRuntimeStatisticsDiffReport>>reportSummaryValues handle 0 total duration

feenkcom/gt4pharo@5053cc by Tudor Girba
Merge 1c6aa3637b479bf5ada9e378ce6c4edd1150e8b6

feenkcom/gt4pharo@b658b2 by Tudor Girba
fix examples due to using small caps in tooltips #3757

feenkcom/gt4pharo@1c6aa3 by Juraj Kubelka
example fixes

feenkcom/gt4pharo@d86105 by Tudor Girba
add examples for expander in after: pragma #3757

feenkcom/gt4pharo@d8ff4c by Tudor Girba
Merge 11a5926e082b809995ccd5269dc616b0e05d0dac

feenkcom/gt4pharo@81ae8e by Tudor Girba
add expander in after: pragma

feenkcom/gtoolkit-inspector@431978 by Sven Van Caekenberghe
Merge cb4f6c6905f020e23e923833d9200540d730cc60

feenkcom/gtoolkit-inspector@f09d09 by Sven Van Caekenberghe
add utf8 lossy decoding to ByteArray Bytes gtView

feenkcom/zinc@2b61aa by svenvc
Merge 3f60e6d55077d9c601d4245d4c060a5d522f54de

feenkcom/zinc@9891ac by svenvc
added ZnLossyUTF8Encoder

feenkcom/zinc@3f60e6 by Sven Van Caekenberghe
Merge pull request #14 from svenvc/master

Update Zinc: Add an option #ignoreByteOrderMark to ZnUTFEncoders (true by default)

feenkcom/zinc@fc5903 by svenvc
Add an option #ignoreByteOrderMark to ZnUTFEncoders (true by default)

Add #testUTF8ByteOrderMarkSignificant

svenvc/zinc#134

feenkcom/zinc@fd71ff by Sven Van Caekenberghe
Update README.md

Add Pharo 12 badge/shield

feenkcom/gtoolkit-visualizer@92a6d2 by akevalion
[#3745] update barcharts to support a common parent and allow them to handle lines decorations

feenkcom/lepiter@13327d by Sven Van Caekenberghe
Shell script executor is now /bin/bash on macOS and Linux, and powershell on Windows

feenkcom/lepiter@465969 by Sven Van Caekenberghe
rename ShellCommand (Shell command) snippet to ShellScript (Shell script) snippet

feenkcom/gtoolkit-demos@000483 by Oscar Nierstrasz
fixed typo in smalltalk slide notes

feenkcom/gt4git@2044cb by Sven Van Caekenberghe
rewrote GtIceGitRepository>>#runGitWithArgs: to wait asynchronously on native subprocess completion/output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants