Improve confidence estimation for credit card numbers #120

robfromboulder · 2024-08-09T20:04:43Z

I'm using Phileas to redact logging data, and see two interesting patterns that result in false positives on credit cards.

{ "data": {"quote_token":"null", "time_processed":"1647725122146" }}
👆 fails LUHN check and is ignored by default (which is good!)

Result from System.currentTimeMillis masked as credit card: (confidence = 0.9)
{ "quote_token":"...", "time_processed":"1647725122227" }
{ "quote_token":"...", "time_processed":"*************" }

Portions of Java UUID masked as credit card: (confidence = 0.9)
{ "query":" { quote(account_token:\"47223179-9330-4259-b66c-f2db26efb20c\", amount_usd:\"62\", coin_type:\"BTC\" )}"}
{ "query":" { quote(account_token:\"******************-b66c-f2db26efb20c\", amount_usd:\"62\", coin_type:\"BTC\" )}"}

What is interesting is that LUHN checks (while certainly helpful) do not appear to be sufficient to prevent all cases where random data can leak through. (~5% of UUID or timestamp fields may contain valid LUHNs)

The solution to the first case could be reducing credit card confidence if the matched value is in an expected range (like timestamps over the last year and 3 months into the future). I haven't done the math but seems like that's a small number of values with valid LUHN checksums to exclude if we're considering a reasonably small time range.

The solution to the second case could be reducing credit card confidence when the match is found within the context of a larger string. Confidence in phone numbers is reduced if the phone number is embedded within a larger string, and we've found this extremely helpful in eliminating false positives. It would be very helpful if credit card filtering had a similar behavior.

Unfortunately there is no obvious/easy workaround, but seems like improved confidence estimation for credit cards would be generally useful (since detecting and redacting credit cards is a universal requirement for PII engines)

The text was updated successfully, but these errors were encountered:

robfromboulder · 2024-08-15T23:23:35Z

The first case turns out to be easy to solve with ignoredPatterns, since Unix timestamps will always be 13 digits long and have a specific preamble.

CreditCard x = new CreditCard();
x.setIgnoredPatterns(List.of(new IgnoredPattern("1[5-8][0-9]{11}")));  // ignore unix timestamps

With ignoredPatterns set, Phileas still identifies these spans but does not apply them:

String value = "{ \"valid_until_millis\":\"1647725122227\" }";
FilterResponse fr = r.filter(value);
expect(fr.explanation().appliedSpans().size()).toEqual(0);
expect(fr.explanation().identifiedSpans().size()).toEqual(1);
expect(fr.explanation().identifiedSpans().get(0).getConfidence()).toEqual(0.9);
expect(fr.explanation().identifiedSpans().get(0).getFilterType().toString()).toEqual("credit-card");
expect(fr.explanation().identifiedSpans().get(0).getText()).toEqual("1647725122227");
expect(fr.filteredText()).toEqual(value);

👆 no changes to Phileas required to solve this first part

jzonthemtn · 2024-08-16T14:50:52Z

That is really awesome. Do you think it be beneficial to include that ignored pattern as an option in the filter profile just to keep the user from having to set it manually? There could be a boolean on CreditCard called ignoreUnixTimestamps and when true it checks the credit card against that pattern.

robfromboulder · 2024-08-16T18:55:38Z

Well, I'm applying this ignoredPattern in multiple places already -- so if Phileas provided an option like that, I'd definitely use it. Beyond the reuse aspect, seems like a nice improvement to what Phileas understands about credit cards, for little new code 🤔

jzonthemtn · 2024-08-16T21:44:17Z

I agree. Wrote #130 to capture it separate from this issue.

robfromboulder · 2024-08-28T21:32:15Z

The changes proposed in 129-credit-card-dashes will wrap up the rest of this one

Sorry this turned out to be a multi-part issue, I'll try to keep things more atomic ⚛️

robfromboulder mentioned this issue Aug 16, 2024

Reduce confidence when credit card spans are bordered by dashes #129

Closed

robfromboulder mentioned this issue Aug 20, 2024

Email address filtering needs optimization or relaxed mode #121

Closed

jzonthemtn mentioned this issue Aug 25, 2024

Add a property to the Credit Card filter to ignore spans which are in Unix timestamps #130

Closed

JessieAMorris mentioned this issue Nov 5, 2024

Add more in depth regex checking on credit cards based on BIN #152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve confidence estimation for credit card numbers #120

Improve confidence estimation for credit card numbers #120

robfromboulder commented Aug 9, 2024 •

edited

Loading

robfromboulder commented Aug 15, 2024

jzonthemtn commented Aug 16, 2024

robfromboulder commented Aug 16, 2024

jzonthemtn commented Aug 16, 2024

robfromboulder commented Aug 28, 2024

Improve confidence estimation for credit card numbers #120

Improve confidence estimation for credit card numbers #120

Comments

robfromboulder commented Aug 9, 2024 • edited Loading

robfromboulder commented Aug 15, 2024

jzonthemtn commented Aug 16, 2024

robfromboulder commented Aug 16, 2024

jzonthemtn commented Aug 16, 2024

robfromboulder commented Aug 28, 2024

robfromboulder commented Aug 9, 2024 •

edited

Loading