Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve confidence estimation for credit card numbers #120

Open
robfromboulder opened this issue Aug 9, 2024 · 5 comments
Open

Improve confidence estimation for credit card numbers #120

robfromboulder opened this issue Aug 9, 2024 · 5 comments

Comments

@robfromboulder
Copy link
Collaborator

robfromboulder commented Aug 9, 2024

I'm using Phileas to redact logging data, and see two interesting patterns that result in false positives on credit cards.

{ "data": {"quote_token":"null", "time_processed":"1647725122146" }}
👆 fails LUHN check and is ignored by default (which is good!)

Result from System.currentTimeMillis masked as credit card: (confidence = 0.9)
{ "quote_token":"...", "time_processed":"1647725122227" }
{ "quote_token":"...", "time_processed":"*************" }

Portions of Java UUID masked as credit card: (confidence = 0.9)
{ "query":" { quote(account_token:\"47223179-9330-4259-b66c-f2db26efb20c\", amount_usd:\"62\", coin_type:\"BTC\" )}"}
{ "query":" { quote(account_token:\"******************-b66c-f2db26efb20c\", amount_usd:\"62\", coin_type:\"BTC\" )}"}

What is interesting is that LUHN checks (while certainly helpful) do not appear to be sufficient to prevent all cases where random data can leak through. (~5% of UUID or timestamp fields may contain valid LUHNs)

The solution to the first case could be reducing credit card confidence if the matched value is in an expected range (like timestamps over the last year and 3 months into the future). I haven't done the math but seems like that's a small number of values with valid LUHN checksums to exclude if we're considering a reasonably small time range.

The solution to the second case could be reducing credit card confidence when the match is found within the context of a larger string. Confidence in phone numbers is reduced if the phone number is embedded within a larger string, and we've found this extremely helpful in eliminating false positives. It would be very helpful if credit card filtering had a similar behavior.

Unfortunately there is no obvious/easy workaround, but seems like improved confidence estimation for credit cards would be generally useful (since detecting and redacting credit cards is a universal requirement for PII engines)

@robfromboulder
Copy link
Collaborator Author

The first case turns out to be easy to solve with ignoredPatterns, since Unix timestamps will always be 13 digits long and have a specific preamble.

CreditCard x = new CreditCard();
x.setIgnoredPatterns(List.of(new IgnoredPattern("1[5-8][0-9]{11}")));  // ignore unix timestamps

With ignoredPatterns set, Phileas still identifies these spans but does not apply them:

String value = "{ \"valid_until_millis\":\"1647725122227\" }";
FilterResponse fr = r.filter(value);
expect(fr.explanation().appliedSpans().size()).toEqual(0);
expect(fr.explanation().identifiedSpans().size()).toEqual(1);
expect(fr.explanation().identifiedSpans().get(0).getConfidence()).toEqual(0.9);
expect(fr.explanation().identifiedSpans().get(0).getFilterType().toString()).toEqual("credit-card");
expect(fr.explanation().identifiedSpans().get(0).getText()).toEqual("1647725122227");
expect(fr.filteredText()).toEqual(value);

👆 no changes to Phileas required to solve this first part

@jzonthemtn
Copy link
Member

That is really awesome. Do you think it be beneficial to include that ignored pattern as an option in the filter profile just to keep the user from having to set it manually? There could be a boolean on CreditCard called ignoreUnixTimestamps and when true it checks the credit card against that pattern.

@robfromboulder
Copy link
Collaborator Author

Well, I'm applying this ignoredPattern in multiple places already -- so if Phileas provided an option like that, I'd definitely use it. Beyond the reuse aspect, seems like a nice improvement to what Phileas understands about credit cards, for little new code 🤔

@jzonthemtn
Copy link
Member

I agree. Wrote #130 to capture it separate from this issue.

@robfromboulder
Copy link
Collaborator Author

The changes proposed in 129-credit-card-dashes will wrap up the rest of this one

Sorry this turned out to be a multi-part issue, I'll try to keep things more atomic ⚛️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants