HTML strip normalizer on the tokenizer #474

qdequele · 2022-05-20T14:01:09Z

qdequele
May 20, 2022
Maintainer

Sometimes I help users who send data that contains HTML without knowing that this dramatically affects the relevancy when searching, especially when words touch HTML tags.

Would it be possible to implement a normalizer that removes all HTML tags in document collections? Would it be possible to activate/deactivate it via settings or to activate it by default?

I found a crate to do it 🙂.

gmourier · 2022-06-02T12:19:21Z

gmourier
Jun 2, 2022
Maintainer

Hey @qdequele 👋

This could be a first addition in the direction of having a transformation pipeline.

Note: There is a possible workaround. Doing this on the user-side before sending the documents (the workaround is limited to people that can manipulate the documents).

It's not obvious that this is a priority right now to be solved but the normalizer thing could be explored quickly.

@ManyTheFish What do you think?

3 replies

ManyTheFish Jun 2, 2022
Collaborator

Handling this in the tokenizer is possible but is harder than just adding a Normalizer because it passes after the Segmenter.
That means that the text <span><a href=\"#\">Summer</a> is nice</span> would be segmented as ["<", "span", ">", "<", "a", " ", "href", "=", "\"", "#", "\"", ">", "Summer", "<", "/", "a", ">", " ", "is", " ", "nice", "<", "/", "span", ">"] before the normalization phase.
However, it would be possible to pre-segment the text in order to isolate HTML tags and consider them as separators 🤔

qdequele Jun 27, 2022
Maintainer Author

With this crate, it removes totally all HTML tags to keep only text.

ManyTheFish Jul 4, 2022
Collaborator

Yes @qdequele! But it will completely break the highlight if we use this library in a preprocessing part.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meilisearch

HTML strip normalizer on the tokenizer #474

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Meilisearch

HTML strip normalizer on the tokenizer #474

qdequele May 20, 2022 Maintainer

Replies: 1 comment · 3 replies

gmourier Jun 2, 2022 Maintainer

ManyTheFish Jun 2, 2022 Collaborator

qdequele Jun 27, 2022 Maintainer Author

ManyTheFish Jul 4, 2022 Collaborator

qdequele
May 20, 2022
Maintainer

Replies: 1 comment 3 replies

gmourier
Jun 2, 2022
Maintainer

ManyTheFish Jun 2, 2022
Collaborator

qdequele Jun 27, 2022
Maintainer Author

ManyTheFish Jul 4, 2022
Collaborator