Email address filtering needs optimization or relaxed mode #121

robfromboulder · 2024-08-09T21:15:25Z

phileas-benchmark results show that email address detection is more CPU intensive (and requires more memory & stack space) than other regex-based filters.

Performance of single identifiers with 4k values:
mask_credit_cards - 35k calls/sec
mask_bitcoin_addresses - 31k calls/sec
mask_iban_codes - 26k calls/sec
mask_bank_routing_numbers - 27k calls/sec
mask_ssns - 16k calls/sec
mask_phone_numbers - 14k calls/sec
mask_email_addresses - 5k calls/sec 🔥

The current regex is known to be pretty intense -- so it might make sense to have a "relaxed" version that performs better without trading off too much accuracy?

The text was updated successfully, but these errors were encountered:

robfromboulder · 2024-08-20T22:36:11Z

@jzonthemtn I'm looking at a few regex variations that show better performance, but I need to do some more testing to see how accuracy is affected in the data I have available.

One interesting bit though -- the email address filter currently does not use the \b...\b fencing that many of the regex-based filters use. Wrapping the current email address regex in \b...\b roughly doubles performance on its own. I think that makes sense since it reduces how greedy some of those matches will be.

👆 Since we're also discussing use of \b from a confidence standpoint (in #120), I thought this was kinda neat to see how much the \b...\b fencing plays into performance too.

* 131 Adding option to email filter for just email addresses with valid TLDs. * #131 Adding property to docs. * #121 Adding strict email option to docs.

robfromboulder mentioned this issue Aug 22, 2024

Improved performance for email address detection #132

Merged

jzonthemtn added this to the 2.7.0 milestone Aug 25, 2024

jzonthemtn closed this as completed Aug 25, 2024

jzonthemtn added a commit that referenced this issue Aug 26, 2024

#121 Adding strict email option to docs.

cf3505a

jzonthemtn added a commit that referenced this issue Sep 3, 2024

Adding optional check for email addresses with TLDs (#135)

45c567d

* 131 Adding option to email filter for just email addresses with valid TLDs. * #131 Adding property to docs. * #121 Adding strict email option to docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Email address filtering needs optimization or relaxed mode #121

Email address filtering needs optimization or relaxed mode #121

robfromboulder commented Aug 9, 2024

robfromboulder commented Aug 20, 2024

Email address filtering needs optimization or relaxed mode #121

Email address filtering needs optimization or relaxed mode #121

Comments

robfromboulder commented Aug 9, 2024

robfromboulder commented Aug 20, 2024