Allow to follow news sites not providing RSS/Atom feed or news sitemap #41

sebastian-nagel · 2020-07-24T11:16:12Z

The news crawler (as of now) relies exclusively on RSS/Atom feeds and news sitemaps to find links to news articles. However, some news sites do not provide feeds or sitemaps. In order to follow these news sites, the crawler should be able monitor HTML pages manually marked as seeds and extract links from it:

add a parser class to the topology which
- exclusively parses URLs marked as verified HTML seeds (eg. by a metadata key isHtmlSeed)
- extracts links from the HTML and sends them to the status index as DISCOVERED
- (optionally) outlinks are filtered: same host or domain, configurable URL patterns stored in status index for the HTML seed
the (adaptive) scheduler must be configured to schedule the refetch of HTML seeds

The text was updated successfully, but these errors were encountered:

vladignatyev · 2023-12-17T14:00:04Z

Currently I'm working on very similar software. How could I contribute to the project?

wumpus · 2023-12-19T06:13:06Z

Vlad, this project is not currently a high priority for us. This enhancement is a good idea, and it's an idea that the search engine I founded a long time ago used successfully for our news crawl.

sebastian-nagel added the enhancement label Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to follow news sites not providing RSS/Atom feed or news sitemap #41

Allow to follow news sites not providing RSS/Atom feed or news sitemap #41

sebastian-nagel commented Jul 24, 2020

vladignatyev commented Dec 17, 2023

wumpus commented Dec 19, 2023

Allow to follow news sites not providing RSS/Atom feed or news sitemap #41

Allow to follow news sites not providing RSS/Atom feed or news sitemap #41

Comments

sebastian-nagel commented Jul 24, 2020

vladignatyev commented Dec 17, 2023

wumpus commented Dec 19, 2023