Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A typo(?) causing crashes in newer versions of apify + crawlee[parsel] #324

Closed
Rigos0 opened this issue Nov 13, 2024 · 4 comments
Closed
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Rigos0
Copy link

Rigos0 commented Nov 13, 2024

I am running an Actor using the CrawleeParsel.

After starting the actor, it immediately crashes because of this error:

TypeError: Requested global configuration object of type <class 'apify._configuration.Configuration'>, but <class 'crawlee.configuration.Configuration'> was found

image

This is my requirements file:

apify
beautifulsoup4[lxml]
httpx
types-beautifulsoup4
crawlee[parsel]

I managed to fix the error specifying apify==1.3.0

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 13, 2024
@janbuchar
Copy link
Contributor

Hello, and thank you for your interest in Crawlee! Could you give us an executable code snippet that reproduces the error?

@Rigos0
Copy link
Author

Rigos0 commented Nov 13, 2024

Hi, you should be able to replicate it with this snippet and the specified requirements file. Let me know if you would need any further info.

import asyncio
from apify import Actor
from crawlee.parsel_crawler import ParselCrawler

crawler = ParselCrawler()

@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
    context.log.info(f"Processing URL: {context.request.url}...")
    # Handle the request logic here (e.g., parsing, extracting data, etc.)
    ...

async def main() -> None:
    async with Actor:
        input_data = await Actor.get_input()
        start_urls = input_data.get("startUrls", [])
        
        url_to_crawl = start_urls[0]["url"]
        await crawler.run([url_to_crawl])

if __name__ == '__main__':
    asyncio.run(main())

@vdusek
Copy link
Contributor

vdusek commented Nov 14, 2024

Hi @Rigos0, the crawler initialization has to be wrapped inside async with Actor. If you'd like to define your request handlers at the top level, you can use the following structure:

import asyncio
from apify import Actor
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext
from crawlee.router import Router

router = Router[ParselCrawlingContext]()


@router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
    context.log.info(f'Processing URL: {context.request.url}...')
    ...


async def main() -> None:
    async with Actor:
        input_data = await Actor.get_input()
        start_urls = input_data.get('startUrls', [])
        url_to_crawl = start_urls[0]['url']

        crawler = ParselCrawler(request_handler=router)
        await crawler.run([url_to_crawl])


if __name__ == '__main__':
    asyncio.run(main())

I believe this should resolve your issue, so I'll close it. Feel free to re-open if this won't help.

Btw. I moved this to SDK as it belongs here.

@Rigos0
Copy link
Author

Rigos0 commented Nov 14, 2024

Yes, that fixes the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants