HTML strip normalizer on the tokenizer #474
qdequele
started this conversation in
Feedback & Feature Proposal
Replies: 1 comment 3 replies
-
Hey @qdequele 👋 This could be a first addition in the direction of having a transformation pipeline. Note: There is a possible workaround. Doing this on the user-side before sending the documents (the workaround is limited to people that can manipulate the documents). It's not obvious that this is a priority right now to be solved but the normalizer thing could be explored quickly. @ManyTheFish What do you think? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Sometimes I help users who send data that contains HTML without knowing that this dramatically affects the relevancy when searching, especially when words touch HTML tags.
Would it be possible to implement a normalizer that removes all HTML tags in document collections? Would it be possible to activate/deactivate it via settings or to activate it by default?
I found a crate to do it 🙂.
Beta Was this translation helpful? Give feedback.
All reactions