Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve concurrent downloading capacity of script. #18] #23

Closed
wants to merge 1 commit into from

Conversation

Sadique982
Copy link

Pull Request: Improve Concurrent Downloading Capacity of Script #18

Description

This update boosts the number of PDFs we can download per minute from scraping/download_and_extract_scripts/downloader.py, all while keeping the server happy. I’ve made some changes to handle more downloads at once and added some smart pauses to avoid overwhelming the site. If we hit any blocks, I’ve also considered using multiple downloaders with torify as a backup plan.

Changes Made

  1. More Concurrent Downloads: Increased the batch size and adjusted semaphore limits for better throughput.

  2. Smart Sleep Intervals: Added dynamic sleep times to manage request rates and keep things running smoothly.

  3. Better Error Handling: Improved how we handle proxy connection issues to make downloads more resilient.

  4. Enhanced Logging: Updated the logs for better visibility on what’s happening during downloads.

  5. Plans for Multiple Downloaders: Thought about using multiple downloaders with torify if we encounter blocks.

This should make our downloading process faster without stressing the server. Let me know what you think!

In this update, I have increased the number of PDFs downloaded per minute while ensuring that we do not overwhelm the server. The changes can be found in `scraping/download_and_extract_scripts/downloader.py`. I’ve implemented methods to manage concurrent downloads more effectively, including utilizing semaphore limits and adding sleep intervals. Additionally, I have considered strategies to avoid getting blocked, such as implementing multiple downloaders with `torify` if necessary.
@Sadique982
Copy link
Author

@fffoivos @zvr Check the PR.

@fffoivos
Copy link
Collaborator

fffoivos commented Nov 5, 2024

@Sadique982 this doesn't seem to make the downloader any faster I will close it now, thanks for your effort.

@fffoivos fffoivos closed this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants