[metascraper-description] add some fuzzy rules #194

osdiab · 2019-07-26T09:12:46Z

Prerequisites

Irrelevant

I'm using the last version.
My node version is the same as declared as package.json.

Subject of the issue

The metascraper-description package is fairly useful, but for pages that don't have the proper open/social graph tags, headings can be used as a fallback - i'm imagining a scraper that lists out up to n headings on a page like so:

// output of metascraper-headings
type HeadingsOutput = {
  level: number;  // 1 for h1, 2 for h2, etc
  content: string;
}[]  // array outputted in order found; or if instead ordered by a heuristic "significance", maybe add an extra "position" field to the blob above

I noticed this when seeing that metascraper-description for https://www.hekimaplace.org returns null, but if you dump the URL into Facebook, it populates the body of the card with the h1 element at the top of the page by default, which turns out to make a lot of sense as a description.

Steps to reproduce

Run the sample code provided in the frontpage of metascraper, but change the URL to the one mentioned above. You get null there.

Then try opening up facebook or messenger, and dump the url in there; the card has a sensical description, indicating some other heuristic they're probably using - since text matches the h1 i bet that's probably it.

Expected behaviour

Either metascraper-description can try to use other heuristics, or a separate rule can be made that executes those heuristics separately and the client can choose which one they'd prefer.

Actual behaviour

No clear way to use that heuristic... besides making a rule yourself :)

The text was updated successfully, but these errors were encountered:

Kikobeats · 2019-07-27T06:36:05Z

This kind of rules can be added an enabled using package options; Particularly I don't like this approach because the problem here is HTML markup is used totally different in every website, so even you think this could be improved data detection, in other cases, it will be worst.

Happy to accept a PR adding some conditional rules suggestions 🙂

osdiab changed the title ~~New rule: Getting most significant headings from page~~ New rule: most significant headings from page Jul 26, 2019

Kikobeats changed the title ~~New rule: most significant headings from page~~ [metascraper-description] add some fuzzy rules Jul 27, 2019

Kikobeats added the enhancement label Jul 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metascraper-description] add some fuzzy rules #194

[metascraper-description] add some fuzzy rules #194

osdiab commented Jul 26, 2019 •

edited

Loading

Kikobeats commented Jul 27, 2019

[metascraper-description] add some fuzzy rules #194

[metascraper-description] add some fuzzy rules #194

Comments

osdiab commented Jul 26, 2019 • edited Loading

Prerequisites

Subject of the issue

Steps to reproduce

Expected behaviour

Actual behaviour

Kikobeats commented Jul 27, 2019

osdiab commented Jul 26, 2019 •

edited

Loading