Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better handling of percent-encoded image references #605

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

benoit-pierre
Copy link
Contributor

@benoit-pierre benoit-pierre commented Oct 19, 2024

  • fix DecodeHTMLUrlString so it does not mangle non-ASCII characters
  • fix lunasvgDrawImageHelper to properly handle a percent-encoded image URL

Cf. koreader/koreader#12656.


This change is Reviewable

Copy link
Member

@Frenzie Frenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

IMAGE_SOURCE_FROM_BYTES sounds somewhat useful. Nothing like that is used anywhere?

Comment on lines -1762 to -1764
if ( stream.isNull() ) {
// Try again in case cover_image_path is percent-encoded
cover_image_path = LVCombinePaths(codeBase, DecodeHTMLUrlString(cover_image_href));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we don't need this "try again" ? We're still doing it elsewhere.
The thing is that we may get crappy EPUBs, with sometimes the %escapes literally in the zip item names (that is, the correctly %encoded attribute values in some XML, once decoded, won't be found, because the zip contains the original value with %), so these "try again" were added to handle these crappy cases. Also, zip item names don't have any encoding, so we meet stuff as bytes, which may be or not be utf8.

Some possibly related issues that I managed to find again:
koreader/koreader#7661
Bottomest item in #326.
Search for "try again" in ef95bcc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we don't need this "try again" ? We're still doing it elsewhere. The thing is that we may get crappy EPUBs, with sometimes the %escapes literally in the zip item names (that is, the correctly %encoded attribute values in some XML, once decoded, won't be found, because the zip contains the original value with %), so these "try again" were added to handle these crappy cases.

In theory, but are there actually instances of this?

Also, zip item names don't have any encoding, so we meet stuff as bytes, which may be or not be utf8.

Some possibly related issues that I managed to find again: koreader/koreader#7661 Bottomest item in #326. Search for "try again" in ef95bcc.

At least the current code in ldomNode::getObjectImageSource does it in the right order: correct behavior first (percent-encoded), ugly workaround second…

But that code in ldomDocumentFragmentWriter::convertHref seems icky:

// Depending on what's calling us, href may or may not have
// gone thru DecodeHTMLUrlString() to decode %-encoded bits.
// We'll need to try again with DecodeHTMLUrlString() if not
// initially found in "pathSubstitutions" (whose filenames went
// thru DecodeHTMLUrlString(), and so did 'codeBase').

Possibly percent-decoding a string 2 times…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, but are there actually instances of this?

I think so, if the code "try again with" is there :) (May be - don't remember - the original code was bad, and I added the first chunk to do it right, keeping the "bad" chunk as a fallbadk (as it has most often never hurt us) just in case.

At least the current code in ldomNode::getObjectImageSource does it in the right order: correct behavior first (percent-encoded), ugly workaround second…

So, re-order existing "try again" code in the right order - rather than removing the "try again".

I don't remember all this and don't want to dig in - so trusting you. Just remember crengine doesn't enforce/require strictly-perfect-EPUB, but was made (with such "try again" branches) to handle the crappy ones we've met over the years.

@benoit-pierre
Copy link
Contributor Author

Makes sense to me

IMAGE_SOURCE_FROM_BYTES sounds somewhat useful. Nothing like that is used anywhere?

Nope.

@benoit-pierre benoit-pierre marked this pull request as draft November 11, 2024 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants