Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email address filtering needs optimization or relaxed mode #121

Closed
robfromboulder opened this issue Aug 9, 2024 · 1 comment
Closed
Milestone

Comments

@robfromboulder
Copy link
Collaborator

phileas-benchmark results show that email address detection is more CPU intensive (and requires more memory & stack space) than other regex-based filters.

Performance of single identifiers with 4k values:
mask_credit_cards - 35k calls/sec
mask_bitcoin_addresses - 31k calls/sec
mask_iban_codes - 26k calls/sec
mask_bank_routing_numbers - 27k calls/sec
mask_ssns - 16k calls/sec
mask_phone_numbers - 14k calls/sec
mask_email_addresses - 5k calls/sec 🔥

The current regex is known to be pretty intense -- so it might make sense to have a "relaxed" version that performs better without trading off too much accuracy?

@robfromboulder
Copy link
Collaborator Author

@jzonthemtn I'm looking at a few regex variations that show better performance, but I need to do some more testing to see how accuracy is affected in the data I have available.

One interesting bit though -- the email address filter currently does not use the \b...\b fencing that many of the regex-based filters use. Wrapping the current email address regex in \b...\b roughly doubles performance on its own. I think that makes sense since it reduces how greedy some of those matches will be.

👆 Since we're also discussing use of \b from a confidence standpoint (in #120), I thought this was kinda neat to see how much the \b...\b fencing plays into performance too.

@jzonthemtn jzonthemtn added this to the 2.7.0 milestone Aug 25, 2024
jzonthemtn added a commit that referenced this issue Sep 3, 2024
* 131 Adding option to email filter for just email addresses with valid TLDs.

* #131 Adding property to docs.

* #121 Adding strict email option to docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants