Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RequestURL and ResponseURL #42

Merged
merged 11 commits into from
Jun 9, 2022
Merged

RequestURL and ResponseURL #42

merged 11 commits into from
Jun 9, 2022

Conversation

kmike
Copy link
Member

@kmike kmike commented May 21, 2022

This is a basic version of URL dependency; it'd allow Page Objects to request just the URL, without the response.
RequestURL is especially useful, as it allows Page Objects to skip the initial download.

My concerns about this PR: I really wanted to have some "smart" object for URL, not just a str subclass. For this, I considered 3 options:

  1. https://github.com/aio-libs/yarl (a top contender).
  2. https://github.com/python-hyper/hyperlink
  3. roll out a new implementation, based on w3lib

For yarl the issue is that it declares yarl.URL as "final", and also prohibits subclassing at runtime. So it's not possible to have a "marker" subclass like RequestURL. To overcome this, I tried using https://docs.python.org/3/library/typing.html#newtype; it works at runtime, but I was unable to make typing work well for attr converters; e.g. creating HttpRequest with anything but RequestURL instance was problematic.

Next, I looked into hyperlink, but the API is not perfect. For example, they provide url.click method, which is essentially a re-implementation of urljoin. I think the method name is misleading, especially in web-poet context, because to click properly you need to take base url in account, not just response url.

Also, if I'm not mistaken, these libraries don't handle all the corner cases (e.g. a case when path encoding is different from query encoding, which could be the case for non-utf8 pages).

The third option is also something I don't like :) I'd much prefer to use something existing, something like yarl, and contribute to it if needed. But it is still an option to roll out a new impementation. In this case, I think we should create w3lib.URL object, not an object in web-poet. w3lib.URL could expose functions from w3lib.url in a nice way; it doesn't have to be very complete (minimal is fine), but it should expose features needed for web scraping. I think that it'd be fine even if this object only exposes some methods to work with query parameters, and nothing more; it'd be enough for a first version.

This PR contains a very basic implementation. It may be sufficient for web-poet needs, but I'm worried about backwards compatibility if we move to an URL object in future.

Thoughts?

@codecov
Copy link

codecov bot commented May 21, 2022

Codecov Report

Merging #42 (237fab6) into master (bab41be) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #42   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           15        15           
  Lines          330       344   +14     
=========================================
+ Hits           330       344   +14     
Impacted Files Coverage Δ
web_poet/__init__.py 100.00% <ø> (ø)
web_poet/page_inputs/__init__.py 100.00% <ø> (ø)
web_poet/_base.py 100.00% <100.00%> (ø)
web_poet/mixins.py 100.00% <100.00%> (ø)
web_poet/page_inputs/client.py 100.00% <100.00%> (ø)
web_poet/page_inputs/http.py 100.00% <100.00%> (ø)

@kmike kmike mentioned this pull request May 22, 2022
@BurnzZ
Copy link
Contributor

BurnzZ commented May 23, 2022

+1 on using yarl as a lot of developers that use Scrapy has at least interacted or used it in the past. This leads to better familiarity with the new Smart URL's API. It's also well-tested and has been battle-tested in many applications. This makes option 3 of creating a new implementation from scratch more time consuming and would have possibly more bugs.

For yarl the issue is that it declares yarl.URL as "final", and also prohibits subclassing at runtime. So it's not possible to have a "marker" subclass like RequestURL.

We could do something like this:

import yarl

# As suggested in https://github.com/aio-libs/yarl/issues/312
PatchedURL = yarl.URL
del PatchedURL.__init_subclass__

class RequestURL(PatchedURL):
    pass

class ResponseURL(PatchedURL):
    pass

@kmike
Copy link
Member Author

kmike commented May 24, 2022

The issue with yarl workaround is that typing still doesn't work properly; yarl hardcodes various methods as returning and receiving yarl.URL. If I recall properly, this is what was breaking it usage in HttpResponse.

@Gallaecio
Copy link
Member

I'm worried about backwards compatibility if we move to an URL object in future.

I believe this is technically a valid concern, e.g. URL("foo")[0] works here and may not work in the future. Maybe we could instead use a custom class with an internal string, and only define what we expect not no change in a backward-incompatible way: __init__ and __str__.

@kmike
Copy link
Member Author

kmike commented May 27, 2022

I believe this is technically a valid concern, e.g. URL("foo")[0] works here and may not work in the future. Maybe we could instead use a custom class with an internal string, and only define what we expect not no change in a backward-incompatible way: init and str.

It's a bit worse than that: neither yarl.URL nor hyperlink.URL are string subclasses.

@Gallaecio
Copy link
Member

Gallaecio commented May 27, 2022

It's a bit worse than that: neither yarl.URL nor hyperlink.URL are string subclasses.

Yes, I am suggesting we make it so that ours is not one either:

class RequestUrl:
    def __init__(self, url: str):
        self._url = url
    def __str__(self):
        return self._url

That way str(url) works (I presume) the same as with yarl and hyperlink, and our interface is as minimal as possible (as opposed to exposing the whole str API), so that we can grow it as needed putting thought into every API change (including a potential switch to yarl, hyperlink, or something else).

@BurnzZ
Copy link
Contributor

BurnzZ commented May 31, 2022

I am suggesting we make it so that ours is not one either

What about wrapping yarl.URL into a class?

class RequestURL:
    def __init__(self, url: str):
        self._url = yarl.URL(url)

    def __str__(self) -> str:
        return str(self._url)

This solves:

  • yarl's the typing issue
  • yarl's subclassing issue
  • I'm worried about backwards compatibility if we move to an URL object in future.

We don't need to match every dunder method to the equivalent yarl.URL functionality if we won't need it. For example, if we find the .path property handy, then we could simply add it as:

class RequestURL: 
    ... # same code as above

    @property
    def path(self) -> str:
        return self._url.path

url = RequestURL("https://example.com/category")
url.path  # /category

We're also free to define our own interface and not be tied directly to yarl.

Also, if I'm not mistaken, these libraries don't handle all the corner cases (e.g. a case when path encoding is different from query encoding, which could be the case for non-utf8 pages).

More specifically, we can solve this on our own.

@Gallaecio
Copy link
Member

That approach to extending these classes in the future makes sense to me.

Also, it probably makes sense to create a common parent class, even if private (_URL).

Finally, I thought I had mentioned it already somewhere, so sorry if I did and I am repeating myself, but for consistency with existing classes the names should probably be Url and not URL. Which I find unpythonic, but I find consistency more important.

@Gallaecio
Copy link
Member

Gallaecio commented May 31, 2022

We are definitely moving into own-class, w3lib territory, though. Should we implement these classes here and move them to w3lib later, or create them in w3lib directly first?

@BurnzZ
Copy link
Contributor

BurnzZ commented May 31, 2022

Should we implement these classes here and move them to w3lib later, or create them in w3lib directly first?

Perhaps we could implement the _Url class that you've mentioned here and have RequestUrl(_Url) and Response(_Url) use it.

We could then move _Url to w3lib later on. Perhaps this option would prevent this PR from being unnecessarily blocked.

@kmike
Copy link
Member Author

kmike commented May 31, 2022

Finally, I thought I had mentioned it already somewhere, so sorry if I did and I am repeating myself, but for consistency with existing classes the names should probably be Url and not URL. Which I find unpythonic, but I find consistency more important.

I was wondering if anyone would notice that :)


class _Url:
""" Base URL class.
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @BurnzZ @Gallaecio! I tried to follow @Gallaecio's suggestion, and created a minimal version of the Url base class. To keep it minimal, I haven't added anything useful to the _Url class; that's the difference with #45

On one hand, it allows us to improve _Url class in future almost without any restrictions.

On the other hand, it kind-of gets in a way now. With this implementation, users need to cast to str almost always: for example, HttpClient backends would always need to do str(request.url).

So, for the users the current implementation is more cumbersome than just using str, or having URL a str subclass, without any benefits. It may be ok.

"""Shortcut to HTML Response's URL."""
return self.response.url
"""Shortcut to HTML Response's URL, as a string."""
return str(self.response.url)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for backwards compatibility, and also for consistency with .base_url and .urljoin properties, where urls are strings.

Overall, I don't like ResponseShortcutsMixin, it seems using HttpResponse methods is better :)

web_poet/page_inputs/client.py Outdated Show resolved Hide resolved
@@ -17,7 +17,7 @@ def async_mock():
"""Workaround since python 3.7 doesn't ship with asyncmock."""

async def async_test(req):
return HttpResponse(req.url, body=b"")
return HttpResponse(str(req.url), body=b"")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be casting this to str as the tests that rely on this mock are expecting ResponseUrl. We'd need to update the downstream tests instead to validate their expectations.

Copy link
Member Author

@kmike kmike Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit I failed to make tests working, and applied this hack instead :) Mocks are not my friends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it's harmless, because tests should be getting ResponseUrl, as HttpResponse should be converting str to ResponseUrl; without having mocks in mind the change doesn't really change the behavior. But I'm likely missing some interactions with mocks which are in effect here.

@kmike kmike marked this pull request as ready for review June 7, 2022 07:58
tests/test_page_inputs.py Show resolved Hide resolved
req = HttpRequest(resp.url)
assert isinstance(req.url, RequestUrl)
assert str(req.url) == str(resp.url)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio has raised a good point if HttpResponse("url") and HttpRequest("url") are equal. We should add a test for such.

Copy link
Member Author

@kmike kmike Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test which shows that they're not equal.

But I'm not sure this behavior is useful. What kind of errors would it prevent?

I wonder if it'd better to leave it "undefined", instead of solidifying this behavior in tests (if that's even a reasonable approach :)

pathlib does some magic when comparing paths, it's not a simple string match. E.g. on Windows path components are lower-cased before comparison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test simply solidifies the expectation to users in case they're unsure. As a user, I might expect when initially using it for the first time that the expression response.url == request.url holds True. However, I need to cast it to its string value to bear True.

I also expect such test to conveniently break when we override __eq__ to another behavior. It helps future PRs denote that an equality expectation will change.

Copy link
Contributor

@BurnzZ BurnzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @kmike ! 🙌

@BurnzZ
Copy link
Contributor

BurnzZ commented Jun 8, 2022

It just occurred to me after hitting approve that the docs needed to be updated as well. I took the liberty of creating a PR for it here: #48

@kmike
Copy link
Member Author

kmike commented Jun 9, 2022

Thanks @BurnzZ and @Gallaecio! I'm merging it, to unblock the releases.

@kmike kmike merged commit 778f27e into master Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants