-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion for supporting Scrapy's LinkExtractor #10
Comments
web-poet works independently of Scrapy, but you can use scrapy-poet to integrate web-poet into Scrapy projects. scrapy-poet offers an injector middleware that builds and injects dependencies. It has some "Scrapy-provided" classes that are automatically injected without the need of a provider. You should then be able to create a Page Object that looks like this: from scrapy.http import Response
from scrapy.linkextractors import LinkExtractor
from web_poet.pages import ItemWebPage
class SomePage(ItemWebPage):
scrapy_response = Response
def to_item(self):
return {
'links': LinkExtractor(
allow=r'some-website.com/product/tt\d+/$'
process_value=some_processor,
restrict_xpaths=f'//div[@id="products"]//span',
).extract_links(self.scrapy_response) # expects a Scrapy Response instance
} Note that currently, this approach will make it mandatory for the request to be made through Scrapy downloader middleware. So for example, if you're using AutoExtract providers that ignore Scrapy requests, you'll end up making an additional request in order to build this Scrapy response. The solution in this case would be to mock a Scrapy response from the AutoExtract response's HTML. |
Hi @victor-torres , thanks for taking the time on explaining this feature! This actually opens up a lot of opportunities for us, like having a neat way to get the However, I've met with some peculiar bug with this approach. I've created a new issue in #11 to discuss it further. |
As I've said in #11, the example I gave you is wrong in two ways:
That means you should create another type (ScrapyResponse, for example) that's initialized with a Response and then create a provider that makes use of Response to build this ScrapyResponse class. Does that make sense? Confusing, right? It should be something like this: from scrapy_poet.page_input_providers import PageObjectInputProvider, provides
class ScrapyResponse:
def __init__(self, response):
self.response = response
@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):
def __init__(self, response: Response):
self.response = response
def __call__(self):
return ScrapyResponse(self.response)
class QuotesListingPage(ItemWebPage):
scrapy_response: ScrapyResponse
def to_item(self):
return {
'site_name': self.css('h1 a::text').get(),
'author_links': LinkExtractor(
restrict_css='.author + a'
).extract_links(self.scrapy_response.response),
} Here comes the problems: I've just tested this code and it's not working 😞 I'm not sure what's happening because there's a test case specifically designed to make sure Scrapy dependencies are injected into Providers. In any case, I'd like to leave the question here to @kmike, @ivanprado, @ejulio and others: I know that we've previously discussed about this, but maybe we could rethink about making Scrapy dependencies available for Page Objects as well. What do you think? |
We've discussed this issue today and @ivanprado was able to spot the problem with my snippet: it's missing an import attr
from scrapy_poet.page_input_providers import PageObjectInputProvider, provides
class ScrapyResponse:
def __init__(self, response):
self.response = response
@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):
def __init__(self, response: Response):
self.response = response
def __call__(self):
return ScrapyResponse(self.response)
@attr.s(auto_attribs=True)
class QuotesListingPage(ItemWebPage):
scrapy_response: ScrapyResponse
def to_item(self):
return {
'site_name': self.css('h1 a::text').get(),
'author_links': LinkExtractor(
restrict_css='.author + a'
).extract_links(self.scrapy_response.response),
} Although it's possible to receive Scrapy dependencies when initializing providers to check some runtime conditions or settings, it doesn't look correct to create a provider specifically for Scrapy Responses. It looks like we should contribute with a pull request to Scrapy making LinkExtractor work with generic HTML data. @ivanprado is currently using one of its private methods to archive this behavior, we should aim at providing the same feature through a public interface. This way, you can just pass ResponseData to LinkExtractor and it should work as expected, @BurnzZ. |
One point on PageObjects is that they should be portable. That is, they shouldn't depend on There is a workaround to use link extractor the following way: from w3lib.html import get_base_url
class QuotesListingPage(ItemWebPage):
def to_item(self):
link_extractor = LinkExtractor()
base_url = get_base_url(self.html, self.url)
author_links = [link for sel in self.css('.author + a')
for link in link_extractor._extract_links(sel, self.url, "utf-8", base_url)]
return {
'site_name': self.css('h1 a::text').get(),
'author_links': author_links,
} The problem is that it is using the private method Ideas about what can be done to improve the situation:
class LinkExtractor(LxmlLinkExtractor):
def __init__(*args, **kwargs):
super().__init__(*args, **kwargs)
def extract_links(self, response: ResponseShortcutsMixin):
base_url = get_base_url(response.html, response.url)
if self.restrict_xpaths:
docs = [
subdoc
for x in self.restrict_xpaths
for subdoc in response.xpath(x)
]
else:
docs = [response.selector]
all_links = []
for doc in docs:
links = self._extract_links(doc, response.url, "utf-8", base_url)
all_links.extend(self._process_links(links))
return unique_list(all_links) Thoughts? |
I think the way to go is to refactor Scrapy's LinkExtractor, and likely move it into a separate package. |
Thanks for all the help everyone! I was able to make it work using the providers. 🎉 I attached the full reproducible code below for posterity. However, I tried to take it a step further by perhaps creating a
Creating a In order to simplify my example, I've added a commented line in the snippet below that could reproduce this problem. I have to note that accessing the import attr
from scrapy import Spider
from scrapy.http import Response, Request
from scrapy.linkextractors import LinkExtractor
from web_poet.pages import ItemWebPage
from scrapy_poet.page_input_providers import PageObjectInputProvider, provides
class ScrapyResponse:
def __init__(self, response):
self.response = response
# self.meta = response.meta # uncomment this and it'll error out.
@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):
def __init__(self, response: Response):
self.response = response
def __call__(self):
return ScrapyResponse(self.response)
@attr.s(auto_attribs=True)
class QuotesListingPage(ItemWebPage):
scrapy_response: ScrapyResponse
def to_item(self):
return {
'site_name': self.css('h1 a::text').get(),
'author_links': LinkExtractor(
restrict_css='.author + a').extract_links(self.scrapy_response.response),
}
@attr.s(auto_attribs=True)
class QuotesAuthorPage(ItemWebPage):
scrapy_response: ScrapyResponse
def to_item(self):
assert self.scrapy_response.response.meta['some_other_meta_field']
base_item = self.scrapy_response.response.meta['item'].copy()
base_item.update({
'author': self.css('.author-title ::text').get('').strip(),
'born': self.css('.author-born-date::text').get('').strip(),
})
return base_item
class QuotesBaseSpider(Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com//']
def parse(self, response, page: QuotesListingPage):
meta = {
'item': {'field': 'value'},
'some_other_meta_field': True,
}
data = page.to_item()
for link in data['author_links']:
yield Request(link.url, self.parse_author, meta=meta)
def parse_author(self, response, page: QuotesAuthorPage):
return page.to_item() Cheers! |
So, However, you can access meta if you provide the request object as parameter to class ScrapyResponse:
def __init__(self, response, meta):
self.response = response
self.meta = meta
@provides(ScrapyResponse)
class ScrapyResponseProvider(PageObjectInputProvider):
def __init__(self, response: Response, request: Request):
self.response = response
self.meta = request.meta
def __call__(self):
return ScrapyResponse(self.response, self.meta) Hope this helps. |
With the current API, I guess we could add |
More options: parsel, separate library :) |
One neat feature inside Scrapy is it's LinkExtractors functionality. We usually try to use this whenever we want links to be extracted inside a given page.
Inside web-poet, we can attempt to use it as:
The problem lies in the
extract_links()
method since it actually expects a Scrapy Response instance. On the current scope, we only have access to web-poet's ResponseData instead.At the moment, we could simply re-work the logic to avoid using
LinkExtractors
altogether. However, there might be some cases wherein it's a much better option.With this in mind, this issue attempts to be a starting point to open up these discussion points:
LinkExtractors
and web-poet being decoupled away from Scrapy itself, is it worth supportingLinkExtractors
?LinkExtractors
itself so it would be compatible with web-poet?The text was updated successfully, but these errors were encountered: