Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added substring support for field actions #30

Merged
merged 1 commit into from
Aug 23, 2024

Conversation

YuriMB
Copy link
Contributor

@YuriMB YuriMB commented Aug 13, 2024

We have a use case where we need the start of a field, but due to privacy issues do not want the entire field to be anonymized. So I've proposed a substring method, that takes an argument for how many chars it will copy.

This could also have been done with the new python implementation, but we had issues with mismatching python version on our build agent vs actual data node.

Not sure if you appreciate the direct proposal, but no harm in asking, right?

@ppomes
Copy link
Owner

ppomes commented Aug 20, 2024

Hi !

Thanks for the PR! I think it would be more logical to see it as an (optional) keyword of current text rules, in order to have :

testhash 10 substr 3
emailhash 'example.com' 10 substr 5

If you can adjust you PR, I will be happy to merge it.

However, please note the current hash processing is destructive:

  • sha256 hash is 64 bytes long, so using "texthash 20" will cut it.
  • each byte is modified to ascii "human readable", using a modulo, to have only from "a" to "z" character (see make_readable_hash function in main/myanon.c)

Based on the last point, maybe you do not to worry about privacy.

Pierre

@YuriMB
Copy link
Contributor Author

YuriMB commented Aug 23, 2024

Hi,

So the use case we have now is that we need the start of the actual string because logic depends on it. In our case the VAC number of customers start with a country code and we need to determine the correct VAC. We use an anonymized dump on our internal testing environment, so we can replay certain scenario's, and although it doesn't contain any more privacy related information, we'd still like to retain the same country code for each user in the test environment.
However, that information is stored at the start of the field. So we don't need to know the full number.
E.g. NL12345324534B01 could very well become NL123. But we need to retain the original 2 characters.

That functionality doesn't hold anymore if we use any hash, hence the separate keyword.
One other issue we found is that if you use this method on data with special characters (in our case \0 bytes), it might break.

We had a field that started with 55 \0 bytes. Originally looking something like this:

'\0\0\0\0\0\0...\0\0something here'

but due to the way the tokenizer works it takes '\' as a character and '0'. We cut it off at 5 characters, resulting in:

'\0\0\',...

escaping the terminating ', and breaking the entire dump. Maybe a nice improvement for the tokenizer :).

As for this PR: not really sure how to continue, I don't feel like it serves our use case in your scenario. It's OK if our scenario is not something you would like to support. Feel free to close the PR if so :).

P.S.
We've internally also introduces a "scramble" method, which replaces each character for another random character, so we can also test some UI logic on more or less realistic screens, and ran into a few more issues.
One of those is that passing data via the stack in a struct is really limited for the length of the result. So that's definitely not something we would offer as a PR at this stage.
(oh, and it is slow 🥲 )

@ppomes
Copy link
Owner

ppomes commented Aug 23, 2024

Well, I think my main problem is to have read your PR with a phone during holidays ;-)

I missed the strcpy, and I thought your need was to hash data using first 'n' bytes, and not to keep 'n' bytes. So my previous answer has no sense.

Your PR adds a new useful feature!

About escaped characters, I have some idea ;-)

@ppomes ppomes merged commit d1f3494 into ppomes:main Aug 23, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants