Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_Url to inherit from str #187

Open
BurnzZ opened this issue Aug 28, 2023 · 0 comments
Open

_Url to inherit from str #187

BurnzZ opened this issue Aug 28, 2023 · 0 comments
Labels

Comments

@BurnzZ
Copy link
Contributor

BurnzZ commented Aug 28, 2023

There was a previous discussion about this before in one of the PRs.

I'm re-opening this for tracking since this part of w3lib.util.to_unicode breaks: https://github.com/scrapy/w3lib/blob/master/w3lib/util.py#L46-L49

In particular, doing something like:

from scrapy.linkextractors import LinkExtractor

link_extractor = LinkExtractor()
link_extractor.extract_links(response) 

where response is a web_poet.page_inputs.http.HttpResponse instance and not scrapy.http.Response.

The full stacktrace would be:

File "/usr/local/lib/python3.10/site-packages/scrapy/linkextractors/[lxmlhtml.py](http://lxmlhtml.py/)", line 239, in extract_links
    base_url = get_base_url(response)
  File "/usr/local/lib/python3.10/site-packages/scrapy/utils/[response.py](http://response.py/)", line 27, in get_base_url
    _baseurl_cache[response] = html.get_base_url(
  File "/usr/local/lib/python3.10/site-packages/w3lib/[html.py](http://html.py/)", line 323, in get_base_url
    return safe_url_string(baseurl)
  File "/usr/local/lib/python3.10/site-packages/w3lib/[url.py](http://url.py/)", line 141, in safe_url_string
    decoded = to_unicode(url, encoding=encoding, errors="percentencode")
  File "/usr/local/lib/python3.10/site-packages/w3lib/[util.py](http://util.py/)", line 47, in to_unicode
    raise TypeError(
TypeError: to_unicode must receive bytes or str, got ResponseUrl

Other alternatives could be adjusting Scrapy code instead to cast str(response.url) for every use.

@BurnzZ BurnzZ added the discuss label Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant