Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed TitleNormalizer for Chinese characters #52

Merged
merged 2 commits into from
Nov 10, 2024

Conversation

damien-git
Copy link

This PR for TitleNormalizer fixes the order for Chinese characters by using ICUCollatorNormalizer. Without it, a search using Chinese characters will return a position before A but these titles are appearing after Z. The current issue might also affect other unicode characters.

@todolson
Copy link

todolson commented Nov 7, 2024

It seems we might want a TitleNormalizerTest to demonstrate the desired sort order with respect to how Roman, CJK, and potentially other code points sort. It could be based closely on or extended from ICUCollatorNormalizerTest.

@damien-git
Copy link
Author

@todolson The current ICUCollatorNormalizerTest does not seem hard enough to properly test this particular issue (the current implementation for TitleNormalizer works fine with most scripts), but I don't feel qualified to write good tests for that.

@demiankatz
Copy link
Member

@damien-git, would it be better than nothing to copy ICUCollatorNormalizerTest to TitleNormalizerTest and include something similar to the current diacritics test, only include a CJK string in the middle of the list and ensure that it sorts to the correct end? I probably don't know my way around the test framework here much better than you do, but if you need help, let me know and I'm willing to take a stab at it when time permits.

@todolson
Copy link

todolson commented Nov 7, 2024

@damien-git, along those lines, it would be useful to see an example of maybe a half-dozen strings that illustrate the desired sorting. I'm far removed from the test framework these days myself, but between the three of us we can probably figure it out without too much effort.

@damien-git
Copy link
Author

Yes, I guess the ICUCollatorNormalizerTest tests with an added example using Chinese characters would be better than nothing. I will prepare something.

Copy link
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @damien-git. This seems reasonable to me, but I won't merge immediately just in case @todolson has any further suggestions!

@todolson
Copy link

I think this looks good. It gives a baseline expectation, and can be build upon should the need arise.

@demiankatz
Copy link
Member

Thanks, everyone; merging now! @damien-git, do you have any more browse handler changes coming soon? If not, let me know if you'd like me to build this and update the dev branch of the main VuFind repo to use it. (Or you can open a PR to do that if you like!).

@demiankatz demiankatz merged commit 3236d56 into vufind-org:dev Nov 10, 2024
1 check passed
@damien-git
Copy link
Author

@demiankatz I don't have additional changes coming. I was planning to prepare a VuFind PR with the jars once this PR is merged. I will do it this week.

@demiankatz
Copy link
Member

@demiankatz I don't have additional changes coming. I was planning to prepare a VuFind PR with the jars once this PR is merged. I will do it this week.

Thanks, @damien-git, sounds good -- please let me know if I can do anything more to help; otherwise, I'll watch for your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants