Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to follow news sites not providing RSS/Atom feed or news sitemap #41

Open
sebastian-nagel opened this issue Jul 24, 2020 · 2 comments

Comments

@sebastian-nagel
Copy link
Collaborator

The news crawler (as of now) relies exclusively on RSS/Atom feeds and news sitemaps to find links to news articles. However, some news sites do not provide feeds or sitemaps. In order to follow these news sites, the crawler should be able monitor HTML pages manually marked as seeds and extract links from it:

  • add a parser class to the topology which
    • exclusively parses URLs marked as verified HTML seeds (eg. by a metadata key isHtmlSeed)
    • extracts links from the HTML and sends them to the status index as DISCOVERED
    • (optionally) outlinks are filtered: same host or domain, configurable URL patterns stored in status index for the HTML seed
  • the (adaptive) scheduler must be configured to schedule the refetch of HTML seeds
@vladignatyev
Copy link

Currently I'm working on very similar software. How could I contribute to the project?

@wumpus
Copy link
Member

wumpus commented Dec 19, 2023

Vlad, this project is not currently a high priority for us. This enhancement is a good idea, and it's an idea that the search engine I founded a long time ago used successfully for our news crawl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants