-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RequestUrl and ResponseUrl using yarl #45
base: url-page-inputs
Are you sure you want to change the base?
Changes from 7 commits
71bb150
b3c7a0a
2ebf7a3
b658ab5
be37f39
4e0e103
292a3b4
912ba77
d869024
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
http_content_type_encoding | ||
) | ||
|
||
import yarl | ||
from web_poet._base import _HttpHeaders | ||
from web_poet.utils import memoizemethod_noargs | ||
from web_poet.mixins import SelectableMixin | ||
|
@@ -18,13 +19,64 @@ | |
_AnyStrDict = Dict[AnyStr, Union[AnyStr, List[AnyStr], Tuple[AnyStr, ...]]] | ||
|
||
|
||
class ResponseURL(str): | ||
""" URL of the response """ | ||
class _Url: | ||
def __init__(self, url: Union[str, yarl.URL], encoded=True): | ||
self._url = yarl.URL(str(url), encoded=encoded) | ||
|
||
def __str__(self) -> str: | ||
return str(self._url) | ||
|
||
def __repr__(self) -> str: | ||
return f'{type(self).__name__}({str(self._url)!r})' | ||
|
||
def __eq__(self, other) -> bool: | ||
if self._url.path == "/": | ||
if isinstance(other, str): | ||
other = _Url(other) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we follow pathlib route, probably there shouldn't be magic like this: In [1]: import pathlib
In [2]: p = pathlib.Path("/etc")
In [3]: p == "/etc"
Out[3]: False There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right. Initially, I was trying to solve the tiresome casting of Updated this on: 912ba77 |
||
if self._url.path == other.path: | ||
return True | ||
return str(self._url) == str(other) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The property methods below can be added mapped dynamically with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would also lose API governance, so +1 to manual definition. However, I wonder if we should define them at all for the initial implementation. We want to make sure we get the API right encoding-wise, and if we expose a part of the Yarl interface already as is, I imagine we are introducing the encoding issue in our implementation, with the caveat of not supporting Maybe the initial implementation should use a string internally instead, and we can convert it into Yarl later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A good point about handling the encoding. What do you think about setting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The thing is, Yarl exposes I would wait for feedback from @kmike before making more API decisions. I am personally not sure of the best approach here, what parts of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the goal shouldn't be to implement a general URL class; the goal is to implement URL class useful for web scraping. If that's hard to use yarl or other library's URL class directly, and we're defining the API anyways, we probably should think about it from the API point of view: what's the API we want, what are the features commonly used in web scraping? After figuring out how we'd like API to look like, we can see what's the best way to implement it - wrap yarl, wrap w3lib, do something else. Based on our previous discussions, I think a scraping-ready URL class should have:
In addition to this, there is whole bunch of questions about encoding, normalization, converting URLs to ascii-only encoded strings suitable for downloading, etc. The best API to handle all that might require some thought. I wonder if we can side-step it for now somehow. At the same time, I'm not sure properties like .scheme are that essential. They're are essential for a general-purpose URL class, but are people who write scraping code commonly parse URLs to get their scheme? We can add such methods and properties for sure, but we can do it later. These methods are probably useful for authors of web scraping frameworks / http clients, but less so for people who write web scraping code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's also about return types - some yarl.URL methods are going to return yarl.URL objects, while here it would make more sense to return |
||
@property | ||
def scheme(self) -> str: | ||
return self._url.scheme | ||
|
||
@property | ||
def host(self) -> Optional[str]: | ||
return self._url.host | ||
|
||
@property | ||
def path(self) -> str: | ||
return self._url.path | ||
|
||
@property | ||
def query_string(self) -> str: | ||
return self._url.query_string | ||
|
||
@property | ||
def fragment(self) -> str: | ||
return self._url.fragment | ||
|
||
|
||
class ResponseUrl(_Url): | ||
""" URL of the response | ||
|
||
:param url: a string representation of a URL. | ||
:param encoded: If set to False, the given ``url`` would be auto-encoded. | ||
However, there's no guarantee that correct encoding is used. Thus, | ||
it's recommended to set this in the *default* ``False`` value. | ||
""" | ||
pass | ||
|
||
|
||
class RequestURL(str): | ||
""" URL of the request """ | ||
class RequestUrl(_Url): | ||
""" URL of the request | ||
|
||
:param url: a string representation of a URL. | ||
:param encoded: If set to False, the given ``url`` would be auto-encoded. | ||
However, there's no guarantee that correct encoding is used. Thus, | ||
it's recommended to set this in the *default* ``False`` value. | ||
""" | ||
pass | ||
|
||
|
||
|
@@ -162,7 +214,7 @@ class HttpRequest: | |
**web-poet** like :class:`~.HttpClient`. | ||
""" | ||
|
||
url: RequestURL = attrs.field(converter=RequestURL) | ||
url: RequestUrl = attrs.field(converter=RequestUrl) | ||
method: str = attrs.field(default="GET", kw_only=True) | ||
headers: HttpRequestHeaders = attrs.field( | ||
factory=HttpRequestHeaders, converter=HttpRequestHeaders, kw_only=True | ||
|
@@ -195,7 +247,7 @@ class HttpResponse(SelectableMixin): | |
is auto-detected from headers and body content. | ||
""" | ||
|
||
url: ResponseURL = attrs.field(converter=ResponseURL) | ||
url: ResponseUrl = attrs.field(converter=ResponseUrl) | ||
body: HttpResponseBody = attrs.field(converter=HttpResponseBody) | ||
status: Optional[int] = attrs.field(default=None, kw_only=True) | ||
headers: HttpResponseHeaders = attrs.field(factory=HttpResponseHeaders, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
url: Union[str, yarl.URL]
- this type annotation doesn't allow to create RequestUrl from RequestUrl or from ResponseUrlThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Fixed it in d869024.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to annotate the other methods in other modules. In any case, we'd be rebasing to #42 if we should proceed with this route. This would include the annotations like 5b0e889#diff-e91005d8dfc7a61689963aafad8b742554243e973f4943ec394cab425bbb61f6R81.