ZnUTF8Encoder filters out any U+FEFF #134

Rinzwind · 2024-03-07T16:04:38Z

ZnUTF8Encoder filters out any U+FEFF, the following answers '42' rather than 'FEFF' for example:

(ZnUTF8Encoder new decodeBytes: #[16r41 16rEF 16rBB 16rBF 16r42])
	second codePoint printStringBase: 16

This doesn’t quite seem to be in accordance with the Unicode standard, which says (in section ‘23.8 Specials’ of ‘The Unicode Standard, Version 15.0 – Core Specification’):

For historical reasons, the character U+FEFF used for the byte order mark is named ZERO WIDTH NO-BREAK SPACE. […] Because the byte-swapped version U+FFFE is a noncharacter, when an interpreting process finds U+FFFE as the first character, it signals either that the process has encountered text that is of the incorrect byte order or that the file is not valid Unicode text. […] In UTF-8, the BOM corresponds to the byte sequence <EF₁₆ BB₁₆ BF₁₆>. Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. […] For compatibility with versions of the Unicode Standard prior to Version 3.2, the code point U+FEFF has the word-joining semantics of zero width no-break space when it is not used as a BOM. […] Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space. […] If U+FEFF had only the semantics of a signature code point, it could be freely deleted from text without affecting the interpretation of the rest of the text. […] Unfortunately, U+FEFF also has significance as a character. As a zero width no-break space, it indicates that line breaks are not allowed between the adjoining characters. Thus U+FEFF affects the interpretation of text and cannot be freely deleted.

The text was updated successfully, but these errors were encountered:

svenvc · 2024-03-07T18:40:40Z

Well, I am sure you can see that this was done by design: the occurrence of a BOM is a no-op for UTF-8, should not occur and is not recommended. Hence this pragmatic decision.

I know that it is used on Windows, that is why there are some provisions for writing a BOM.

Should version 3.2 compatibility be a thing, since that is 20+ years old ?

Up to now there have not yet been any complaints about this behaviour.

What is the practical issue that you encounter ?

What do you think should happen/change ?

Rinzwind · 2024-03-07T20:19:18Z

Shouldn’t it at most filter out an initial U+FEFF? The example for ucnv_detectUnicodeSignature in the ICU libraries documentation might help. If ‘input’ is changed to:

char input[] = { '\xEF','\xBB','\xBF', '\xEF','\xBB','\xBF', '\x41', '\xEF','\xBB','\xBF', '\x42' };

The output is as follows, so only the initial U+FEFF is filtered out:

feff 0041 feff 0042

svenvc · 2024-03-12T10:53:28Z

In Python I see this:

$ python3
Python 3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> bytes = b'\xEF\xBB\xBF\xEF\xBB\xBF\x41\xEF\xBB\xBF\x42'
>>> bytes.decode("utf-8")
'\ufeff\ufeffA\ufeffB'

svenvc · 2024-03-12T10:54:07Z

In Swift I see this:

$ swift repl
Welcome to Apple Swift version 5.10 (swiftlang-5.10.0.13 clang-1500.3.9.4).
Type :help for assistance.
  1> import Foundation 
  2> let byteArray: [UInt8] = [239, 187, 191, 239, 187, 191, 65, 239, 187, 191, 66] 
byteArray: [UInt8] = 11 values {
  [0] = 239
  [1] = 187
  [2] = 191
  [3] = 239
  [4] = 187
  [5] = 191
  [6] = 65
  [7] = 239
  [8] = 187
  [9] = 191
  [10] = 66
}
  3> String(data: Data(bytes: byteArray), encoding: .utf8) 
$R0: String? = "AB"

svenvc · 2024-03-12T11:02:02Z

We could look at Java and JavaScript as well.

I am rather a fan of Swift's Unicode approach (in general not specifically this), it is quite modern.

Like I said, it was a design decision to skip BOM sequences like that.

And my question remains: what practical issue did you encounter ?

I guess it would not be impossible to add an option to UTFEncoder (the superclass of all UTF encoders) to control this behaviour but then the next question is what the default should be. (There is already a strict option for byte encoders).

Still we would be adding complexity and I currently do not see why.

Rinzwind · 2024-03-14T17:51:47Z

Note that the Swift REPL doesn’t escape U+FEFF when printing a string. This is clearer:

  4> String(data: Data(bytes: byteArray), encoding: .utf8)!.unicodeScalars.map{ String($0.value, radix: 16) }
$R1: [String] = 5 values {
  [0] = "feff"
  [1] = "feff"
  [2] = "41"
  [3] = "feff"
  [4] = "42"
}

In the following, the string is different depending on whether ‘utf16BigEndian’ or ‘utf16’ is used:

  5> let byteArray2: [UInt8] = [254, 255, 254, 255, 0, 65, 254, 255, 0, 66]
byteArray2: [UInt8] = 10 values {
[…]
}
  6> String(data: Data(bytes: byteArray2), encoding: .utf16BigEndian)!.unicodeScalars.map{ String($0.value, radix: 16) } 
$R2: [String] = 5 values {
  [0] = "feff"
  [1] = "feff"
  [2] = "41"
  [3] = "feff"
  [4] = "42"
}
  7> String(data: Data(bytes: byteArray2), encoding: .utf16)!.unicodeScalars.map{ String($0.value, radix: 16) }
$R3: [String] = 4 values {
  [0] = "feff"
  [1] = "41"
  [2] = "feff"
  [3] = "42"
}

Section ‘3.10 Unicode Encoding Schemes’ in the Core Specification says that “in UTF-16BE, an initial byte sequence <FE FF> is interpreted as U+FEFF ZERO WIDTH NO-BREAK SPACE”, while “in the UTF-16 encoding scheme, an initial byte sequence corresponding to U+FEFF is interpreted as a byte order mark”.

Deletion of U+FEFF when it is not used as a BOM may lead to a security problem as discussed in section ‘3.5 Deletion of Code Points’ in Unicode Technical Report #36 ‘Unicode Security Considerations’, which is referred to in the section ‘Modification’ on conformance clause C7 within section ‘3.2 Conformance Requirements’ in the Core Specification.

Add #testUTF8ByteOrderMarkSignificant #134

svenvc · 2024-04-19T14:19:01Z

fc59039

Maybe not 100% what you want, but at least there is now the option to see BOM occurrences.

Thx for the discussion.

Rinzwind · 2024-05-01T11:18:28Z

OK thanks! It would seem better for the option to also disable the endianness swapping after the U+FFFE in the following though, I would expect the first, third and fifth characters to be the same, but they are not:

((ZnCharacterEncoder newForEncoding: 'utf16be')
	ignoreByteOrderMark: false;
	decodeBytes: (ByteArray readHexFrom: ('00FB FEFF 00FB FFFE 00FB' reject: #isSeparator)))
		asArray collect: [ :character |
			character codePoint printStringBase: 16 nDigits: 4 ]
"⇒ #('00FB' 'FEFF' '00FB' 'FFFE' 'FB00')"

Metacello new baseline: 'GToolkitForPharo9'; repository: 'github://feenkcom/gtoolkit:v1.0.749/src'; load All commits (including upstream repositories) since last build: feenkcom/gtoolkit-utility@9f68de by John Brant [#3764] adding query methods to RB ast nodes feenkcom/gtoolkit-utility@3fff1c by Alistair Grant Merge remote-tracking branch 'origin/main' feenkcom/gtoolkit-utility@2bc080 by Alistair Grant GtVmRuntimeStatisticsDiffReport>>reportSummaryValues handle 0 total duration feenkcom/gt4pharo@5053cc by Tudor Girba Merge 1c6aa3637b479bf5ada9e378ce6c4edd1150e8b6 feenkcom/gt4pharo@b658b2 by Tudor Girba fix examples due to using small caps in tooltips #3757 feenkcom/gt4pharo@1c6aa3 by Juraj Kubelka example fixes feenkcom/gt4pharo@d86105 by Tudor Girba add examples for expander in after: pragma #3757 feenkcom/gt4pharo@d8ff4c by Tudor Girba Merge 11a5926e082b809995ccd5269dc616b0e05d0dac feenkcom/gt4pharo@81ae8e by Tudor Girba add expander in after: pragma feenkcom/gtoolkit-inspector@431978 by Sven Van Caekenberghe Merge cb4f6c6905f020e23e923833d9200540d730cc60 feenkcom/gtoolkit-inspector@f09d09 by Sven Van Caekenberghe add utf8 lossy decoding to ByteArray Bytes gtView feenkcom/zinc@2b61aa by svenvc Merge 3f60e6d55077d9c601d4245d4c060a5d522f54de feenkcom/zinc@9891ac by svenvc added ZnLossyUTF8Encoder feenkcom/zinc@3f60e6 by Sven Van Caekenberghe Merge pull request #14 from svenvc/master Update Zinc: Add an option #ignoreByteOrderMark to ZnUTFEncoders (true by default) feenkcom/zinc@fc5903 by svenvc Add an option #ignoreByteOrderMark to ZnUTFEncoders (true by default) Add #testUTF8ByteOrderMarkSignificant svenvc/zinc#134 feenkcom/zinc@fd71ff by Sven Van Caekenberghe Update README.md Add Pharo 12 badge/shield feenkcom/gtoolkit-visualizer@92a6d2 by akevalion [#3745] update barcharts to support a common parent and allow them to handle lines decorations feenkcom/lepiter@13327d by Sven Van Caekenberghe Shell script executor is now /bin/bash on macOS and Linux, and powershell on Windows feenkcom/lepiter@465969 by Sven Van Caekenberghe rename ShellCommand (Shell command) snippet to ShellScript (Shell script) snippet feenkcom/gtoolkit-demos@000483 by Oscar Nierstrasz fixed typo in smalltalk slide notes feenkcom/gt4git@2044cb by Sven Van Caekenberghe rewrote GtIceGitRepository>>#runGitWithArgs: to wait asynchronously on native subprocess completion/output

svenvc pushed a commit that referenced this issue Apr 19, 2024

Add an option #ignoreByteOrderMark to ZnUTFEncoders (true by default)

fc59039

Add #testUTF8ByteOrderMarkSignificant #134

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZnUTF8Encoder filters out any U+FEFF #134

ZnUTF8Encoder filters out any U+FEFF #134

Rinzwind commented Mar 7, 2024

svenvc commented Mar 7, 2024

Rinzwind commented Mar 7, 2024

svenvc commented Mar 12, 2024

svenvc commented Mar 12, 2024

svenvc commented Mar 12, 2024 •

edited

Loading

Rinzwind commented Mar 14, 2024 •

edited

Loading

svenvc commented Apr 19, 2024

Rinzwind commented May 1, 2024

ZnUTF8Encoder filters out any U+FEFF #134

ZnUTF8Encoder filters out any U+FEFF #134

Comments

Rinzwind commented Mar 7, 2024

svenvc commented Mar 7, 2024

Rinzwind commented Mar 7, 2024

svenvc commented Mar 12, 2024

svenvc commented Mar 12, 2024

svenvc commented Mar 12, 2024 • edited Loading

Rinzwind commented Mar 14, 2024 • edited Loading

svenvc commented Apr 19, 2024

Rinzwind commented May 1, 2024

svenvc commented Mar 12, 2024 •

edited

Loading

Rinzwind commented Mar 14, 2024 •

edited

Loading