Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Bring in fix from custom nodes #8539

Merged
merged 5 commits into from
Nov 14, 2024
Merged

fix: Bring in fix from custom nodes #8539

merged 5 commits into from
Nov 14, 2024

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Nov 12, 2024

Related Issues

Brings in the fix first introduced here https://github.com/deepset-ai/deepset-cloud-custom-nodes/pull/367

Proposed Changes:

  • To avoid fully subsuming the previous chunk, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases of: Doc1 = [s1, s2], Doc2 = [s1, s2, s3].
  • Only applies when splitting by word and respecting sentence boundary.
  • This mirrors the implementation in v1's PreProcessor.

Since this component was a port of the functionality of Haystack v1 I view this as a bug fix.

A few other changes:

  • While I was here I also finished adding function support for this component by updating the _split_into_units function and added the splitting_function init parameter.
  • Also fixed another bug which was caused by the introduction of the to_dict method of the underlying DocumentSplitter when adding function support. This means without defining a new to_dict method for this component, a number of parameters were being ignored when saving this component to yaml. So I added a to_dict method for this component and added a corresponding test.

How did you test it?

Copied over tests

Notes for the reviewer

Checklist

@sjrl sjrl requested a review from a team as a code owner November 12, 2024 12:53
@sjrl sjrl requested review from silvanocerza and removed request for a team November 12, 2024 12:53
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Nov 12, 2024
@coveralls
Copy link
Collaborator

coveralls commented Nov 12, 2024

Pull Request Test Coverage Report for Build 11812877811

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 11 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.02%) to 90.159%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/nltk_document_splitter.py 11 94.3%
Totals Coverage Status
Change from base Build 11799929241: -0.02%
Covered Lines: 7787
Relevant Lines: 8637

💛 - Coveralls

@sjrl sjrl requested a review from a team as a code owner November 12, 2024 15:59
@sjrl sjrl requested review from dfokina and removed request for a team November 12, 2024 15:59
@sjrl sjrl merged commit 0c11c7b into main Nov 14, 2024
18 checks passed
@sjrl sjrl deleted the fix-nltk-doc-splitter branch November 14, 2024 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants