-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lt-proc -b explodes on a-zA-Z regexes + long input #167
Comments
Each time you step by an uppercase letter, it also steps by the lowercase version, so for an N-letter word, you have 2^N states when you reach the end. |
Another option, which I think we should do regardless (to prevent me from DoS-ing our servers with bad regexes), is have a max state size. As soon as you go over about 2^16 states you can feel the slowdown for a single (long, uppercase) word (and you're bound to be getting so many results it's mostly garbage), so perhaps simply
Maybe with a warning the first time we reach that size. |
@ftyers can you imagine anywhere we would want a lowercase character (on the output-side) if input was upper and upper is possible and we end up in the same FST node? (That'd be like if you had both "iphone" and "iPhone" in your dix and user wrote "iPhone" – analysis would now only give "iPhone" instead of both. I would see that as a good thing, but maybe there are use-cases for having both.) |
currently 65536, quite high but at least within what most modern machines can deal with Also, delete FSTProcessor.current_state since confusingly all the processors (except transliteration) make a local State called current_state Should help a bit against #167
currently 65536, quite high but at least within what most modern machines can deal with Also, delete FSTProcessor.current_state since confusingly all the processors (except transliteration) make a local State called current_state Should help a bit against #167
Another thing we could do is within |
currently 65536, quite high but at least within what most modern machines can deal with Also, delete FSTProcessor.current_state since confusingly all the processors (except transliteration) make a local State called current_state Should help a bit against #167
currently 65536, quite high but at least within what most modern machines can deal with Also, delete FSTProcessor.current_state since confusingly all the processors (except transliteration) make a local State called current_state Should help a bit against #167
b.dix:
if I remove A-Z from the regex, it's fine again, but I get wrong lemmas (e.g.
FooBar
becomesFooBar<guess>/Foobar<guess>
, wrong capitalisation on the output)The text was updated successfully, but these errors were encountered: