[Improve concurrent downloading capacity of script. #18] #23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request: Improve Concurrent Downloading Capacity of Script #18
Description
This update boosts the number of PDFs we can download per minute from
scraping/download_and_extract_scripts/downloader.py
, all while keeping the server happy. I’ve made some changes to handle more downloads at once and added some smart pauses to avoid overwhelming the site. If we hit any blocks, I’ve also considered using multiple downloaders withtorify
as a backup plan.Changes Made
More Concurrent Downloads: Increased the batch size and adjusted semaphore limits for better throughput.
Smart Sleep Intervals: Added dynamic sleep times to manage request rates and keep things running smoothly.
Better Error Handling: Improved how we handle proxy connection issues to make downloads more resilient.
Enhanced Logging: Updated the logs for better visibility on what’s happening during downloads.
Plans for Multiple Downloaders: Thought about using multiple downloaders with
torify
if we encounter blocks.This should make our downloading process faster without stressing the server. Let me know what you think!