RequestUrl and ResponseUrl using yarl #45

BurnzZ · 2022-06-01T07:51:02Z

This PR attempts to scrutinize the idea as noted down in #42 (comment).

It's built on top of this PR's branch:

RequestURL and ResponseURL #42

BurnzZ · 2022-06-01T07:51:48Z

web_poet/page_inputs/http.py

+
+    def __eq__(self, other) -> bool:
+        return str(self._url) == str(other)
+


The property methods below can be added mapped dynamically with yarl's. However, we lose the benefit of defining docstrings within them.

We would also lose API governance, so +1 to manual definition.

However, I wonder if we should define them at all for the initial implementation. We want to make sure we get the API right encoding-wise, and if we expose a part of the Yarl interface already as is, I imagine we are introducing the encoding issue in our implementation, with the caveat of not supporting encoded=True in __init__ to at least prevent Yarl from messing things up.

Maybe the initial implementation should use a string internally instead, and we can convert it into Yarl later.

A good point about handling the encoding. What do you think about setting encoded=False by default to prevent yarl from messing things up due to incorrect encoding? be37f39. This would be equivalent to having a str internally, aside from the "smart" helper methods.

The thing is, Yarl exposes encoded in __init__ as a workaround for proper encoding handling, which they set off not to implement. But I believe what @kmike has in mind is for us to have a URL class that does proper encoding handling, in which case we should probably not expose encoded at all (maybe encoding instead, defaulting to "utf8").

I would wait for feedback from @kmike before making more API decisions. I am personally not sure of the best approach here, what parts of w3lib.url we want to apply and how.

I think the goal shouldn't be to implement a general URL class; the goal is to implement URL class useful for web scraping.

If that's hard to use yarl or other library's URL class directly, and we're defining the API anyways, we probably should think about it from the API point of view: what's the API we want, what are the features commonly used in web scraping? After figuring out how we'd like API to look like, we can see what's the best way to implement it - wrap yarl, wrap w3lib, do something else.

Based on our previous discussions, I think a scraping-ready URL class should have:

a way to manipulate query string: add/remove/get/update query parameters

some kind of urljoin method, probably via / operation

probably - a way to extract the domain name?

anything else?

In addition to this, there is whole bunch of questions about encoding, normalization, converting URLs to ascii-only encoded strings suitable for downloading, etc. The best API to handle all that might require some thought. I wonder if we can side-step it for now somehow.

At the same time, I'm not sure properties like .scheme are that essential. They're are essential for a general-purpose URL class, but are people who write scraping code commonly parse URLs to get their scheme? We can add such methods and properties for sure, but we can do it later. These methods are probably useful for authors of web scraping frameworks / http clients, but less so for people who write web scraping code.

The property methods below can be added mapped dynamically with yarl's. However, we lose the benefit of defining docstrings within them.

It's also about return types - some yarl.URL methods are going to return yarl.URL objects, while here it would make more sense to return _Url objects.

BurnzZ · 2022-06-01T07:53:23Z

web_poet/page_inputs/http.py

@@ -18,12 +19,46 @@
 _AnyStrDict = Dict[AnyStr, Union[AnyStr, List[AnyStr], Tuple[AnyStr, ...]]]


-class ResponseURL(str):
+class _URL:


We could move this later to w3lib when we're comfortable in the "smart" URL functionalities that we like.

web_poet/page_inputs/http.py

Gallaecio · 2022-06-01T09:42:59Z

web_poet/page_inputs/http.py

+
+    def __eq__(self, other) -> bool:
+        return str(self._url) == str(other)
+


We would also lose API governance, so +1 to manual definition.

However, I wonder if we should define them at all for the initial implementation. We want to make sure we get the API right encoding-wise, and if we expose a part of the Yarl interface already as is, I imagine we are introducing the encoding issue in our implementation, with the caveat of not supporting encoded=True in __init__ to at least prevent Yarl from messing things up.

Maybe the initial implementation should use a string internally instead, and we can convert it into Yarl later.

Gallaecio · 2022-06-01T09:48:44Z

web_poet/page_inputs/http.py

+    def __eq__(self, other) -> bool:
+        return str(self._url) == str(other)


We should probably make sure we implement this in line with our API expectations.

Mind the following:

>>> example_url_1 = yarl.URL("https://example.com") >>> example_url_2 = yarl.URL("https://example.com/") >>> example_url_1 == example_url_2 True >>> str(example_url_1) == str(example_url_2) False

That's a good point. Although I'm not quite sure if we should have the presence of a trailing / dictates an equality though.

For example, some sites redirect URLs like "https://example.com/" into "https://example.com". For efficiency, the trailing "/" needs to be stripped to prevent an extra request from being wasted due to redirections.

Because of the different behaviors/expectations, such URLs cannot really be equal as they don't point at the same resource (without the redirections).

While that is true for https://example.com/foo and https://example.com/foo/, the case of https://example.com and https://example.com/ is special, since the path of a URL, when not specified, is /.

>>> URL('https://example.com').path '/' >>> URL('https://example.com/').path '/' >>> URL('https://example.com/foo').path '/foo' >>> URL('https://example.com/foo/').path '/foo/'

Ahhh thanks for making the distinction clear! Handled this on 292a3b4.

Co-authored-by: Adrián Chaves <[email protected]>

kmike · 2022-06-02T18:35:51Z

web_poet/page_inputs/http.py

+    def __eq__(self, other) -> bool:
+        if self._url.path == "/":
+            if isinstance(other, str):
+                other = _Url(other)


If we follow pathlib route, probably there shouldn't be magic like this:

In [1]: import pathlib In [2]: p = pathlib.Path("/etc") In [3]: p == "/etc" Out[3]: False

You're right. Initially, I was trying to solve the tiresome casting of str() everytime, similar to the concern noted in #42 (comment). One thing I realized now would be the problem it presents to end users later on. They might be expecting a "https://example.com" but misses out that it's actually ResponseUrl("https://example.com") due to this equality "magic".

Updated this on: 912ba77

kmike · 2022-06-02T18:37:53Z

web_poet/page_inputs/http.py

-class ResponseURL(str):
-    """ URL of the response """
+class _Url:
+    def __init__(self, url: Union[str, yarl.URL], encoded=True):


url: Union[str, yarl.URL] - this type annotation doesn't allow to create RequestUrl from RequestUrl or from ResponseUrl

Nice catch! Fixed it in d869024.

I forgot to annotate the other methods in other modules. In any case, we'd be rebasing to #42 if we should proceed with this route. This would include the annotations like 5b0e889#diff-e91005d8dfc7a61689963aafad8b742554243e973f4943ec394cab425bbb61f6R81.

BurnzZ requested review from kmike and Gallaecio June 1, 2022 07:51

BurnzZ commented Jun 1, 2022

View reviewed changes

BurnzZ changed the title ~~RequestURL and ResponseURL using yarl~~ RequestUrl and ResponseUrl using yarl Jun 1, 2022

use yarl underneath ResponseURL and RequestURL

71bb150

BurnzZ force-pushed the url-page-inputs-yarl branch from e9a6ecf to bb2a8f2 Compare June 1, 2022 08:16

fix naming, annotations, and cleanup files

b3c7a0a

BurnzZ force-pushed the url-page-inputs-yarl branch from bb2a8f2 to b3c7a0a Compare June 1, 2022 08:17

Gallaecio reviewed Jun 1, 2022

View reviewed changes

BurnzZ and others added 5 commits June 1, 2022 18:59

Update the __repr__ of _Url class

2ebf7a3

Co-authored-by: Adrián Chaves <[email protected]>

update the internal yarl.URL reference to be private

b658ab5

expose 'encoded' parameter in _Url class

be37f39

revert dunder private attribute in _Url

4e0e103

handle equality on the base url

292a3b4

kmike mentioned this pull request Jun 2, 2022

RequestURL and ResponseURL #42

Merged

kmike reviewed Jun 2, 2022

View reviewed changes

BurnzZ added 2 commits June 3, 2022 14:38

prevent str and _Url instance when having the same value

912ba77

fix type annotation on _Url init

d869024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RequestUrl and ResponseUrl using yarl #45

RequestUrl and ResponseUrl using yarl #45

BurnzZ commented Jun 1, 2022

BurnzZ Jun 1, 2022

Gallaecio Jun 1, 2022 •

edited

Loading

BurnzZ Jun 1, 2022 •

edited

Loading

Gallaecio Jun 1, 2022 •

edited

Loading

kmike Jun 2, 2022

kmike Jun 2, 2022

BurnzZ Jun 1, 2022

Gallaecio Jun 1, 2022 •

edited

Loading

Gallaecio Jun 1, 2022 •

edited

Loading

BurnzZ Jun 1, 2022

Gallaecio Jun 1, 2022 •

edited

Loading

BurnzZ Jun 1, 2022

kmike Jun 2, 2022

BurnzZ Jun 3, 2022

kmike Jun 2, 2022

BurnzZ Jun 3, 2022

BurnzZ Jun 3, 2022


		def __eq__(self, other) -> bool:
		return str(self._url) == str(other)

RequestUrl and ResponseUrl using yarl #45

Are you sure you want to change the base?

RequestUrl and ResponseUrl using yarl #45

Conversation

BurnzZ commented Jun 1, 2022

Choose a reason for hiding this comment

Gallaecio Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

BurnzZ Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

Gallaecio Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gallaecio Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

Gallaecio Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gallaecio Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gallaecio Jun 1, 2022 •

edited

Loading

BurnzZ Jun 1, 2022 •

edited

Loading

Gallaecio Jun 1, 2022 •

edited

Loading

Gallaecio Jun 1, 2022 •

edited

Loading

Gallaecio Jun 1, 2022 •

edited

Loading

Gallaecio Jun 1, 2022 •

edited

Loading