-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for non-UTF encoding
in xml parser
#332
Comments
encoding
encoding
in xml parser
encoding
in xml parserencoding
in xml parser
Thanks for reporting. My parser indeed works on decoded characters, so you need to decode it in the right way before going through the XML event pipe. This can be done with fs2-data by using the appropriate import which brings the decoder in scope from The parser operates then on these bytes without changing the decoding if it does not match the one provided. I remember thinking about implementing this behavior but I decided to stay with an implementation that is not aware of the encoding. It trusts blindly the strings that were decoded for it. In the case of your issue, I am not entirely sure how the string is decoded from bytes, I should have a closer look at it. |
Thanks for the response! I'm digging into this again since it's all rather confusing. Currently looking at https://www.w3.org/TR/2008/REC-xml-20081126/#charencoding Which says:
Which again suggests to me, that XML-parsing should operate directly on byte-streams, rather than decoded |
Parsed entities are DTD related (you have internal and external ones). What this says is that every externally defined entity (i.e. in a DTD that is physically in another file) might use a different encoding. |
Basically the approach taken in fs2-data is:
In your case, I would expect |
Right. I think this situation is testing when there is no declared charset, except for the Still, I think are you right that this might be a problem to solve in http4s. Further reading:
|
In the test, can you try making an implicit ISO-8859-1 charset available in scope of the test and see if it solves it? |
Right, so the test is currently declared like this: test("parse omitted charset and 8-Bit MIME Entity") {
// https://datatracker.ietf.org/doc/html/rfc7303#section-8.3
encodingTest(
Chunk.array(
"""<?xml version="1.0" encoding="iso-8859-1"?><hello name="Günther"/>""".getBytes(
StandardCharsets.ISO_8859_1
)
),
"application/xml",
"Günther",
)
} That fails. However, if I specify the charset like this, then it passes: diff --git a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
index 8739d94..cfd9622 100644
--- a/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
+++ b/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala
@@ -203,7 +203,7 @@ class ScalaXmlSuite extends CatsEffectSuite with ScalaCheckSuite {
StandardCharsets.ISO_8859_1
)
),
- "application/xml",
+ "application/xml; charset=iso-8859-1",
"Günther",
)
} But the whole point of that test is to pass without specifying the charset. test("parse omitted charset and 8-Bit MIME Entity") |
No I meant something adding |
Sorry, I guess I'm confused 😕 I understand the code changes you are suggesting (and I expect it will work), but I don't understand what it will demonstrate? Since in practice there would be no way to know what the |
@rossabaker if you have a moment to weigh in here it would be appreciated 🙏 |
@satabin I solved the problem with a small hack in Would you see this as something we could pull into If we don't pull it in or explore alternatives, I would close this issue as http4s has a way forward. |
I am not a fan of what I did with the The current approach in fs2-data 1.x is to have characters decoded outside of fs2-data, which does not work well with this kind of of format. I would rather not integrate it currently and think about a better approach for this problem and abstraction and integrate it in 2.0. WDYT? |
Very excited about the enhanced XML support in 1.4.0 :) I've been experimenting with it in http4s/http4s-scala-xml#25 and running into trouble with non UTF encodings. FTR I'm no expert in these things :)
For example this request:
as used in this test:
https://github.com/http4s/http4s-scala-xml/blob/1ca64f2ab7ef500d384d2ec5f8caf88df600e6a6/scala-xml/src/test/scala/org/http4s/scalaxml/ScalaXmlSuite.scala#L198-L209
Furthermore the RFC specifies:
https://datatracker.ietf.org/doc/html/rfc7303#section-8.3
I'm not sure if there is a way to support this without an XML parser that operates directly on bytes instead of chars/strings 😕 any thoughts? Thanks!
The text was updated successfully, but these errors were encountered: