Bug: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon #2152

opendeluxe · 2023-10-26T21:07:00Z

opendeluxe
Oct 26, 2023

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/cheerio (CheerioCrawler)

Issue description

Cheerio crawler is not crawling when set maxRequestPerCrawl to 1.
Even when I set maxRequestPerCrawl to 10 or 100, after the 10th or 100th request nothing will be crawled again anymore.

I use a new instance of Cheerio for any single request, no parallel requests necessary in my usecases.
However, it counts requests on a global basis, no matter if I use a new instance of Cheerio for every request or if I use a shared instance.

Once the count of all requests is reaching the value of maxRequestPerCrawl, it will deny all further requests. The only solution is to shutdown the full process and start it again.

Code sample

const crawler = new CheerioCrawler({
      minConcurrency: 1,
      maxConcurrency: 1,

      //      proxyConfiguration: {},

      // On error, retry each page at most once.
      maxRequestRetries: 1,

      // Increase the timeout for processing of each page.
      requestHandlerTimeoutSecs: 30,

      // Limit to 10 requests per one crawl
      maxRequestsPerCrawl: 1,

     async requestHandler({ request, $, proxyInfo }) {
            // ...
     }
  )}

    await crawler.run([url]);
    await crawler.teardown();

Package version

3.4.0

Node.js version

18.17.1

Operating system

MacOS

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

Log:

INFO  CheerioCrawler: Starting the crawl
INFO  CheerioCrawler: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon. Requests that are in progress will be allowed to finish.
INFO  CheerioCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 1 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 1 requests and will shut down.
INFO  CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":190}

vladfrangu · 2023-10-26T21:10:58Z

vladfrangu
Oct 26, 2023
Maintainer

Can you try following the steps defined here: #2031 (comment), specifically providing a new instance of Configuration per crawler you make, with persistStorage set to false?

0 replies

opendeluxe · 2023-10-26T21:20:44Z

opendeluxe
Oct 26, 2023
Author

Hi @vladfrangu

Can you try following the steps defined here: #2031 (comment), specifically providing a new instance of Configuration per crawler you make, with persistStorage set to false?

I'm happy to try it out. However, there is now Configuration available when importing CheerioCrawler with these imports:

import * as cheerio from 'cheerio';
import { Cheerio } from 'cheerio/lib/cheerio';

How can I import the Configuration-Object?

0 replies

vladfrangu · 2023-10-26T21:21:31Z

vladfrangu
Oct 26, 2023
Maintainer

You have to import things from crawlee, not from Cheerio!

0 replies

opendeluxe · 2023-10-26T21:36:03Z

opendeluxe
Oct 26, 2023
Author

@vladfrangu
Using new Configuration({ persistStorage: false }); will not change the behaviour. It stays the same.

1 reply

B4nan Oct 30, 2023
Maintainer

It surely changes the behaviour, can you provide a complete code for what you are doing, including that?

Check out this guide, it describes how to create stateless crawler instance:

https://crawlee.dev/docs/deployment/aws-cheerio

And be sure to upgrade crawlee to latest, I think v3.4.0 does not contain all the fixes needed for this. In latest version, you can also use purgeDefaultStorage helper explicitly to clean up the storages (and unlike in older versions, it should work as expected).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon #2152

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Bug: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon #2152

opendeluxe Oct 26, 2023

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

Replies: 4 comments · 1 reply

vladfrangu Oct 26, 2023 Maintainer

opendeluxe Oct 26, 2023 Author

vladfrangu Oct 26, 2023 Maintainer

opendeluxe Oct 26, 2023 Author

B4nan Oct 30, 2023 Maintainer

opendeluxe
Oct 26, 2023

I have tested this on the `next` release

Replies: 4 comments 1 reply

vladfrangu
Oct 26, 2023
Maintainer

opendeluxe
Oct 26, 2023
Author

vladfrangu
Oct 26, 2023
Maintainer

opendeluxe
Oct 26, 2023
Author

B4nan Oct 30, 2023
Maintainer