-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RequestURL and ResponseURL #42
Conversation
Codecov Report
@@ Coverage Diff @@
## master #42 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 15 15
Lines 330 344 +14
=========================================
+ Hits 330 344 +14
|
+1 on using
We could do something like this: import yarl
# As suggested in https://github.com/aio-libs/yarl/issues/312
PatchedURL = yarl.URL
del PatchedURL.__init_subclass__
class RequestURL(PatchedURL):
pass
class ResponseURL(PatchedURL):
pass |
The issue with yarl workaround is that typing still doesn't work properly; yarl hardcodes various methods as returning and receiving yarl.URL. If I recall properly, this is what was breaking it usage in HttpResponse. |
I believe this is technically a valid concern, e.g. |
It's a bit worse than that: neither yarl.URL nor hyperlink.URL are string subclasses. |
Yes, I am suggesting we make it so that ours is not one either: class RequestUrl:
def __init__(self, url: str):
self._url = url
def __str__(self):
return self._url That way |
What about wrapping class RequestURL:
def __init__(self, url: str):
self._url = yarl.URL(url)
def __str__(self) -> str:
return str(self._url) This solves:
We don't need to match every dunder method to the equivalent class RequestURL:
... # same code as above
@property
def path(self) -> str:
return self._url.path
url = RequestURL("https://example.com/category")
url.path # /category We're also free to define our own interface and not be tied directly to
More specifically, we can solve this on our own. |
That approach to extending these classes in the future makes sense to me. Also, it probably makes sense to create a common parent class, even if private ( Finally, I thought I had mentioned it already somewhere, so sorry if I did and I am repeating myself, but for consistency with existing classes the names should probably be |
We are definitely moving into own-class, w3lib territory, though. Should we implement these classes here and move them to w3lib later, or create them in w3lib directly first? |
Perhaps we could implement the We could then move |
I was wondering if anyone would notice that :) |
|
||
class _Url: | ||
""" Base URL class. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @BurnzZ @Gallaecio! I tried to follow @Gallaecio's suggestion, and created a minimal version of the Url base class. To keep it minimal, I haven't added anything useful to the _Url
class; that's the difference with #45
On one hand, it allows us to improve _Url class in future almost without any restrictions.
On the other hand, it kind-of gets in a way now. With this implementation, users need to cast to str almost always: for example, HttpClient backends would always need to do str(request.url)
.
So, for the users the current implementation is more cumbersome than just using str
, or having URL a str subclass, without any benefits. It may be ok.
"""Shortcut to HTML Response's URL.""" | ||
return self.response.url | ||
"""Shortcut to HTML Response's URL, as a string.""" | ||
return str(self.response.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for backwards compatibility, and also for consistency with .base_url and .urljoin properties, where urls are strings.
Overall, I don't like ResponseShortcutsMixin, it seems using HttpResponse methods is better :)
@@ -17,7 +17,7 @@ def async_mock(): | |||
"""Workaround since python 3.7 doesn't ship with asyncmock.""" | |||
|
|||
async def async_test(req): | |||
return HttpResponse(req.url, body=b"") | |||
return HttpResponse(str(req.url), body=b"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be casting this to str
as the tests that rely on this mock are expecting ResponseUrl
. We'd need to update the downstream tests instead to validate their expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit I failed to make tests working, and applied this hack instead :) Mocks are not my friends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it's harmless, because tests should be getting ResponseUrl, as HttpResponse should be converting str to ResponseUrl; without having mocks in mind the change doesn't really change the behavior. But I'm likely missing some interactions with mocks which are in effect here.
req = HttpRequest(resp.url) | ||
assert isinstance(req.url, RequestUrl) | ||
assert str(req.url) == str(resp.url) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gallaecio has raised a good point if HttpResponse("url")
and HttpRequest("url")
are equal. We should add a test for such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a test which shows that they're not equal.
But I'm not sure this behavior is useful. What kind of errors would it prevent?
I wonder if it'd better to leave it "undefined", instead of solidifying this behavior in tests (if that's even a reasonable approach :)
pathlib does some magic when comparing paths, it's not a simple string match. E.g. on Windows path components are lower-cased before comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test simply solidifies the expectation to users in case they're unsure. As a user, I might expect when initially using it for the first time that the expression response.url == request.url
holds True. However, I need to cast it to its string value to bear True.
I also expect such test to conveniently break when we override __eq__
to another behavior. It helps future PRs denote that an equality expectation will change.
# Conflicts: # web_poet/__init__.py
thanks @Gallaecio for the catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @kmike ! 🙌
It just occurred to me after hitting approve that the docs needed to be updated as well. I took the liberty of creating a PR for it here: #48 |
update doc examples for RequestUrl/ResponseUrl
Thanks @BurnzZ and @Gallaecio! I'm merging it, to unblock the releases. |
This is a basic version of URL dependency; it'd allow Page Objects to request just the URL, without the response.
RequestURL is especially useful, as it allows Page Objects to skip the initial download.
My concerns about this PR: I really wanted to have some "smart" object for URL, not just a str subclass. For this, I considered 3 options:
For yarl the issue is that it declares yarl.URL as "final", and also prohibits subclassing at runtime. So it's not possible to have a "marker" subclass like RequestURL. To overcome this, I tried using https://docs.python.org/3/library/typing.html#newtype; it works at runtime, but I was unable to make typing work well for attr converters; e.g. creating HttpRequest with anything but RequestURL instance was problematic.
Next, I looked into hyperlink, but the API is not perfect. For example, they provide url.click method, which is essentially a re-implementation of urljoin. I think the method name is misleading, especially in web-poet context, because to click properly you need to take base url in account, not just response url.
Also, if I'm not mistaken, these libraries don't handle all the corner cases (e.g. a case when path encoding is different from query encoding, which could be the case for non-utf8 pages).
The third option is also something I don't like :) I'd much prefer to use something existing, something like yarl, and contribute to it if needed. But it is still an option to roll out a new impementation. In this case, I think we should create w3lib.URL object, not an object in web-poet. w3lib.URL could expose functions from w3lib.url in a nice way; it doesn't have to be very complete (minimal is fine), but it should expose features needed for web scraping. I think that it'd be fine even if this object only exposes some methods to work with query parameters, and nothing more; it'd be enough for a first version.
This PR contains a very basic implementation. It may be sufficient for web-poet needs, but I'm worried about backwards compatibility if we move to an URL object in future.
Thoughts?