RequestURL and ResponseURL #42

kmike · 2022-05-21T21:11:13Z

This is a basic version of URL dependency; it'd allow Page Objects to request just the URL, without the response.
RequestURL is especially useful, as it allows Page Objects to skip the initial download.

My concerns about this PR: I really wanted to have some "smart" object for URL, not just a str subclass. For this, I considered 3 options:

https://github.com/aio-libs/yarl (a top contender).
https://github.com/python-hyper/hyperlink
roll out a new implementation, based on w3lib

For yarl the issue is that it declares yarl.URL as "final", and also prohibits subclassing at runtime. So it's not possible to have a "marker" subclass like RequestURL. To overcome this, I tried using https://docs.python.org/3/library/typing.html#newtype; it works at runtime, but I was unable to make typing work well for attr converters; e.g. creating HttpRequest with anything but RequestURL instance was problematic.

Next, I looked into hyperlink, but the API is not perfect. For example, they provide url.click method, which is essentially a re-implementation of urljoin. I think the method name is misleading, especially in web-poet context, because to click properly you need to take base url in account, not just response url.

Also, if I'm not mistaken, these libraries don't handle all the corner cases (e.g. a case when path encoding is different from query encoding, which could be the case for non-utf8 pages).

The third option is also something I don't like :) I'd much prefer to use something existing, something like yarl, and contribute to it if needed. But it is still an option to roll out a new impementation. In this case, I think we should create w3lib.URL object, not an object in web-poet. w3lib.URL could expose functions from w3lib.url in a nice way; it doesn't have to be very complete (minimal is fine), but it should expose features needed for web scraping. I think that it'd be fine even if this object only exposes some methods to work with query parameters, and nothing more; it'd be enough for a first version.

This PR contains a very basic implementation. It may be sufficient for web-poet needs, but I'm worried about backwards compatibility if we move to an URL object in future.

Thoughts?

codecov · 2022-05-21T21:11:59Z

Codecov Report

Merging #42 (237fab6) into master (bab41be) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #42   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           15        15           
  Lines          330       344   +14     
=========================================
+ Hits           330       344   +14

Impacted Files	Coverage Δ
web_poet/__init__.py	`100.00% <ø> (ø)`
web_poet/page_inputs/__init__.py	`100.00% <ø> (ø)`
web_poet/_base.py	`100.00% <100.00%> (ø)`
web_poet/mixins.py	`100.00% <100.00%> (ø)`
web_poet/page_inputs/client.py	`100.00% <100.00%> (ø)`
web_poet/page_inputs/http.py	`100.00% <100.00%> (ø)`

BurnzZ · 2022-05-23T05:35:34Z

+1 on using yarl as a lot of developers that use Scrapy has at least interacted or used it in the past. This leads to better familiarity with the new Smart URL's API. It's also well-tested and has been battle-tested in many applications. This makes option 3 of creating a new implementation from scratch more time consuming and would have possibly more bugs.

For yarl the issue is that it declares yarl.URL as "final", and also prohibits subclassing at runtime. So it's not possible to have a "marker" subclass like RequestURL.

We could do something like this:

import yarl

# As suggested in https://github.com/aio-libs/yarl/issues/312
PatchedURL = yarl.URL
del PatchedURL.__init_subclass__

class RequestURL(PatchedURL):
    pass

class ResponseURL(PatchedURL):
    pass

kmike · 2022-05-24T08:45:02Z

The issue with yarl workaround is that typing still doesn't work properly; yarl hardcodes various methods as returning and receiving yarl.URL. If I recall properly, this is what was breaking it usage in HttpResponse.

Gallaecio · 2022-05-26T19:24:05Z

I'm worried about backwards compatibility if we move to an URL object in future.

I believe this is technically a valid concern, e.g. URL("foo")[0] works here and may not work in the future. Maybe we could instead use a custom class with an internal string, and only define what we expect not no change in a backward-incompatible way: __init__ and __str__.

kmike · 2022-05-27T07:53:24Z

I believe this is technically a valid concern, e.g. URL("foo")[0] works here and may not work in the future. Maybe we could instead use a custom class with an internal string, and only define what we expect not no change in a backward-incompatible way: init and str.

It's a bit worse than that: neither yarl.URL nor hyperlink.URL are string subclasses.

Gallaecio · 2022-05-27T08:07:22Z

It's a bit worse than that: neither yarl.URL nor hyperlink.URL are string subclasses.

Yes, I am suggesting we make it so that ours is not one either:

class RequestUrl:
    def __init__(self, url: str):
        self._url = url
    def __str__(self):
        return self._url

That way str(url) works (I presume) the same as with yarl and hyperlink, and our interface is as minimal as possible (as opposed to exposing the whole str API), so that we can grow it as needed putting thought into every API change (including a potential switch to yarl, hyperlink, or something else).

BurnzZ · 2022-05-31T08:26:53Z

I am suggesting we make it so that ours is not one either

What about wrapping yarl.URL into a class?

class RequestURL:
    def __init__(self, url: str):
        self._url = yarl.URL(url)

    def __str__(self) -> str:
        return str(self._url)

This solves:

yarl's the typing issue
yarl's subclassing issue
I'm worried about backwards compatibility if we move to an URL object in future.

We don't need to match every dunder method to the equivalent yarl.URL functionality if we won't need it. For example, if we find the .path property handy, then we could simply add it as:

class RequestURL: 
    ... # same code as above

    @property
    def path(self) -> str:
        return self._url.path

url = RequestURL("https://example.com/category")
url.path  # /category

We're also free to define our own interface and not be tied directly to yarl.

Also, if I'm not mistaken, these libraries don't handle all the corner cases (e.g. a case when path encoding is different from query encoding, which could be the case for non-utf8 pages).

More specifically, we can solve this on our own.

Gallaecio · 2022-05-31T08:33:08Z

That approach to extending these classes in the future makes sense to me.

Also, it probably makes sense to create a common parent class, even if private (_URL).

Finally, I thought I had mentioned it already somewhere, so sorry if I did and I am repeating myself, but for consistency with existing classes the names should probably be Url and not URL. Which I find unpythonic, but I find consistency more important.

Gallaecio · 2022-05-31T08:35:21Z

We are definitely moving into own-class, w3lib territory, though. Should we implement these classes here and move them to w3lib later, or create them in w3lib directly first?

BurnzZ · 2022-05-31T08:38:22Z

Should we implement these classes here and move them to w3lib later, or create them in w3lib directly first?

Perhaps we could implement the _Url class that you've mentioned here and have RequestUrl(_Url) and Response(_Url) use it.

We could then move _Url to w3lib later on. Perhaps this option would prevent this PR from being unnecessarily blocked.

kmike · 2022-05-31T10:31:53Z

Finally, I thought I had mentioned it already somewhere, so sorry if I did and I am repeating myself, but for consistency with existing classes the names should probably be Url and not URL. Which I find unpythonic, but I find consistency more important.

I was wondering if anyone would notice that :)

kmike · 2022-06-02T18:15:00Z

web_poet/_base.py

+
+class _Url:
+    """ Base URL class.
+    """


hey @BurnzZ @Gallaecio! I tried to follow @Gallaecio's suggestion, and created a minimal version of the Url base class. To keep it minimal, I haven't added anything useful to the _Url class; that's the difference with #45

On one hand, it allows us to improve _Url class in future almost without any restrictions.

On the other hand, it kind-of gets in a way now. With this implementation, users need to cast to str almost always: for example, HttpClient backends would always need to do str(request.url).

So, for the users the current implementation is more cumbersome than just using str, or having URL a str subclass, without any benefits. It may be ok.

kmike · 2022-06-02T18:17:20Z

web_poet/mixins.py

-        """Shortcut to HTML Response's URL."""
-        return self.response.url
+        """Shortcut to HTML Response's URL, as a string."""
+        return str(self.response.url)


This is for backwards compatibility, and also for consistency with .base_url and .urljoin properties, where urls are strings.

Overall, I don't like ResponseShortcutsMixin, it seems using HttpResponse methods is better :)

web_poet/page_inputs/client.py

BurnzZ · 2022-06-03T06:40:40Z

tests/test_requests.py

@@ -17,7 +17,7 @@ def async_mock():
    """Workaround since python 3.7 doesn't ship with asyncmock."""

    async def async_test(req):
-        return HttpResponse(req.url, body=b"")
+        return HttpResponse(str(req.url), body=b"")


I don't think we should be casting this to str as the tests that rely on this mock are expecting ResponseUrl. We'd need to update the downstream tests instead to validate their expectations.

I must admit I failed to make tests working, and applied this hack instead :) Mocks are not my friends.

I was thinking it's harmless, because tests should be getting ResponseUrl, as HttpResponse should be converting str to ResponseUrl; without having mocks in mind the change doesn't really change the behavior. But I'm likely missing some interactions with mocks which are in effect here.

tests/test_page_inputs.py

BurnzZ · 2022-06-07T08:15:48Z

tests/test_page_inputs.py

+    req = HttpRequest(resp.url)
+    assert isinstance(req.url, RequestUrl)
+    assert str(req.url) == str(resp.url)
+


@Gallaecio has raised a good point if HttpResponse("url") and HttpRequest("url") are equal. We should add a test for such.

I've added a test which shows that they're not equal.

But I'm not sure this behavior is useful. What kind of errors would it prevent?

I wonder if it'd better to leave it "undefined", instead of solidifying this behavior in tests (if that's even a reasonable approach :)

pathlib does some magic when comparing paths, it's not a simple string match. E.g. on Windows path components are lower-cased before comparison.

The test simply solidifies the expectation to users in case they're unsure. As a user, I might expect when initially using it for the first time that the expression response.url == request.url holds True. However, I need to cast it to its string value to bear True.

I also expect such test to conveniently break when we override __eq__ to another behavior. It helps future PRs denote that an equality expectation will change.

# Conflicts: # web_poet/__init__.py

@Gallaecio

thanks @Gallaecio for the catch

BurnzZ

LGTM @kmike ! 🙌

BurnzZ · 2022-06-08T04:07:07Z

It just occurred to me after hitting approve that the docs needed to be updated as well. I took the liberty of creating a PR for it here: #48

update doc examples for RequestUrl/ResponseUrl

kmike · 2022-06-09T08:50:19Z

Thanks @BurnzZ and @Gallaecio! I'm merging it, to unblock the releases.

RequestURL and ResponseURL

b2c665e

kmike mentioned this pull request May 22, 2022

BrowserHtml #43

Merged

Expose RequestURL and ResponseURL in higher-level imports

10a3883

Gallaecio approved these changes May 26, 2022

View reviewed changes

Merge branch 'master' into url-page-inputs

f56da89

BurnzZ mentioned this pull request Jun 1, 2022

RequestUrl and ResponseUrl using yarl #45

Open

kmike added 2 commits June 2, 2022 22:12

rename classes, for consistency

890a238

base _Url class

5b0e889

kmike commented Jun 2, 2022

View reviewed changes

Gallaecio approved these changes Jun 3, 2022

View reviewed changes

web_poet/page_inputs/client.py Outdated Show resolved Hide resolved

BurnzZ reviewed Jun 3, 2022

View reviewed changes

kmike marked this pull request as ready for review June 7, 2022 07:58

BurnzZ reviewed Jun 7, 2022

View reviewed changes

kmike added 3 commits June 7, 2022 23:14

Merge branch 'master' into url-page-inputs

2884d2f

# Conflicts: # web_poet/__init__.py

remove useless format specifier

ea11f3d

thanks @Gallaecio for the catch

Move URL tests to a separate file; a few test for equality / inequality

ba4e615

BurnzZ approved these changes Jun 8, 2022

View reviewed changes

update doc examples for RequestUrl/ResponseUrl

aad8ae6

BurnzZ mentioned this pull request Jun 8, 2022

update doc examples for RequestUrl/ResponseUrl #48

Merged

BurnzZ and others added 2 commits June 8, 2022 15:12

fix docs on how RequestUrl/ResponseUrl is portrayed

9287744

Merge pull request #48 from scrapinghub/url-page-inputs-docs

237fab6

update doc examples for RequestUrl/ResponseUrl

kmike merged commit 778f27e into master Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RequestURL and ResponseURL #42

RequestURL and ResponseURL #42

kmike commented May 21, 2022 •

edited

Loading

codecov bot commented May 21, 2022 •

edited

Loading

BurnzZ commented May 23, 2022

kmike commented May 24, 2022

Gallaecio commented May 26, 2022

kmike commented May 27, 2022

Gallaecio commented May 27, 2022 •

edited

Loading

BurnzZ commented May 31, 2022

Gallaecio commented May 31, 2022

Gallaecio commented May 31, 2022 •

edited

Loading

BurnzZ commented May 31, 2022

kmike commented May 31, 2022

kmike Jun 2, 2022

kmike Jun 2, 2022

BurnzZ Jun 3, 2022

kmike Jun 3, 2022 •

edited

Loading

kmike Jun 3, 2022

BurnzZ Jun 7, 2022

kmike Jun 7, 2022 •

edited

Loading

BurnzZ Jun 8, 2022

BurnzZ left a comment

BurnzZ commented Jun 8, 2022

kmike commented Jun 9, 2022

RequestURL and ResponseURL #42

RequestURL and ResponseURL #42

Conversation

kmike commented May 21, 2022 • edited Loading

codecov bot commented May 21, 2022 • edited Loading

Codecov Report

BurnzZ commented May 23, 2022

kmike commented May 24, 2022

Gallaecio commented May 26, 2022

kmike commented May 27, 2022

Gallaecio commented May 27, 2022 • edited Loading

BurnzZ commented May 31, 2022

Gallaecio commented May 31, 2022

Gallaecio commented May 31, 2022 • edited Loading

BurnzZ commented May 31, 2022

kmike commented May 31, 2022

kmike Jun 2, 2022

Choose a reason for hiding this comment

kmike Jun 2, 2022

Choose a reason for hiding this comment

BurnzZ Jun 3, 2022

Choose a reason for hiding this comment

kmike Jun 3, 2022 • edited Loading

Choose a reason for hiding this comment

kmike Jun 3, 2022

Choose a reason for hiding this comment

BurnzZ Jun 7, 2022

Choose a reason for hiding this comment

kmike Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

BurnzZ Jun 8, 2022

Choose a reason for hiding this comment

BurnzZ left a comment

Choose a reason for hiding this comment

BurnzZ commented Jun 8, 2022

kmike commented Jun 9, 2022

kmike commented May 21, 2022 •

edited

Loading

codecov bot commented May 21, 2022 •

edited

Loading

Gallaecio commented May 27, 2022 •

edited

Loading

Gallaecio commented May 31, 2022 •

edited

Loading

kmike Jun 3, 2022 •

edited

Loading

kmike Jun 7, 2022 •

edited

Loading