[Improve concurrent downloading capacity of script. #18] #23

Sadique982 · 2024-11-02T10:19:52Z

Pull Request: Improve Concurrent Downloading Capacity of Script #18

Description

This update boosts the number of PDFs we can download per minute from scraping/download_and_extract_scripts/downloader.py, all while keeping the server happy. I’ve made some changes to handle more downloads at once and added some smart pauses to avoid overwhelming the site. If we hit any blocks, I’ve also considered using multiple downloaders with torify as a backup plan.

Changes Made

More Concurrent Downloads: Increased the batch size and adjusted semaphore limits for better throughput.
Smart Sleep Intervals: Added dynamic sleep times to manage request rates and keep things running smoothly.
Better Error Handling: Improved how we handle proxy connection issues to make downloads more resilient.
Enhanced Logging: Updated the logs for better visibility on what’s happening during downloads.
Plans for Multiple Downloaders: Thought about using multiple downloaders with torify if we encounter blocks.

This should make our downloading process faster without stressing the server. Let me know what you think!

In this update, I have increased the number of PDFs downloaded per minute while ensuring that we do not overwhelm the server. The changes can be found in `scraping/download_and_extract_scripts/downloader.py`. I’ve implemented methods to manage concurrent downloads more effectively, including utilizing semaphore limits and adding sleep intervals. Additionally, I have considered strategies to avoid getting blocked, such as implementing multiple downloaders with `torify` if necessary.

Sadique982 · 2024-11-02T10:20:59Z

@fffoivos @zvr Check the PR.

fffoivos · 2024-11-05T12:23:22Z

@Sadique982 this doesn't seem to make the downloader any faster I will close it now, thanks for your effort.

fffoivos closed this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improve concurrent downloading capacity of script. #18] #23

[Improve concurrent downloading capacity of script. #18] #23

Sadique982 commented Nov 2, 2024

Sadique982 commented Nov 2, 2024

fffoivos commented Nov 5, 2024 •

edited

Loading

[Improve concurrent downloading capacity of script. #18] #23

[Improve concurrent downloading capacity of script. #18] #23

Conversation

Sadique982 commented Nov 2, 2024

Pull Request: Improve Concurrent Downloading Capacity of Script #18

Description

Changes Made

Sadique982 commented Nov 2, 2024

fffoivos commented Nov 5, 2024 • edited Loading

fffoivos commented Nov 5, 2024 •

edited

Loading