-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
removal of words that could result in "offensive" combinations? #1
Comments
I've been contemplating opening issue 1 for this. Thank you for beating me to it. ;-) So I've been looking into addressing this kind of issues in passwdqc for a long time. I realized that to do it well is a huge undertaking, and then the criteria would be reconsidered, so instead of removing words and forgetting about them they should be categorized. I also realized that it's not productive use of my time to work on this further. So my current intent is to accept community contributions moving words to the end of the list, and maybe adding more words. No removals, at least not until code changes are made, because removals affect not only random passphrases but also which passwords pass or fail the "word-based" check, and because of needing categorization to allow for future reconsideration. I think I've added enough words recently to allow for quite some removals from the initial 4096 (moves to the end of list). There are about 50% more words than required now. However, even this might not be enough. In my own strict manual application of criteria I had mention on passwdqc-users to a generated common English words list, having processed in a few hours just the words starting with the letter "a", I had only 55% of them left. 45% would be gone. This means with those criteria we need an initial total of 7500+, not the current 6000+, to have 4096 left. And indeed the resulting passphrases would be far harder to memorize - they would become kind of toothless. If the community wants this, they should feel free. Per a Twitter poll I ran and some comments I received, there's also great demand for deliberately NSFW passphrases. Perhaps a mode like this should then be added. The categorization and bad words lists like the one you referenced (thanks!) should be helpful there. BTW, I think you could find useful So I'd appreciate you and others sending PRs like this. It won't be my problem then. ;-) I don't intend to merge further changes to the word list before releasing 2.0, though, because the word list is already set in stone in the passwdqc for Windows release 2.0, and I'd like releases for different platforms to be consistent. (I managed to push the Windows release out before the source code release because it excludes some other components that I'm considering making further changes to before 2.0.) |
sigh ... everything becomes more complicated the more you look into it ... Besides vulgarity and profanity, there are other potential relevant classifications based on group belonging, situation, the particular action etc. It would be extremely hard to prevent all possible "insensitive" combinations based on in itself, "safe" words. A starting point of what to avoids beyond flat out slurs could be: sexuality, religion, ethnicity related and verbs that could be considered "violent". There are a bunch of relevant projects, some abandoned for years but none that is regularly updated / classified word lists. "DansGuardian" (last update 2012: https://sourceforge.net/projects/dansguardian/files/dansguardian-2.12.0.3.tar.bz2/download There are a few projects on (e.g.) GitHub that track "bad" words / slurs etc but most have the same subsets of about ~300-500 vs. Luis von Ahn's 1.3k "bad-words.txt". In summary, I couldn't (easily) find a source that does provides classified, up-to-date word lists and tackling this thoroughly would require a lot more effort, probably from someone that has already done related work. As an interim step, I could combine a few lists and move what can be moved below 4096, that would reduce the probability but definitely not come close to your desired state ... |
I'd appreciate that, for merging into version 2.1 or such. Thank you!
I don't have a specific desired state - like I mentioned, trying to address all concerns about maybe-inappropriate words and combinations results in harder to memorize generated passphrases. More importantly, this should be as desired by the users, and their preferences vary. One way to address this is to have a balanced list - avoid what's "obviously" inappropriate, keep the rest, insert some more common words that are currently missing for no reason (or are below 4096). (BTW, what words are common varies by corpus, and I think even more importantly the words should be recognizable by a large fraction of users rather than commonly used. There are words that people don't use very often, but generally know at least some meanings of. Conversely, there are words that some people use somewhat more often, so there are more occurrences in a corpus, but many other people don't recognize at all. I wish we could somehow rank words by percentage of people that recognize them.) Another way is to have multiple lists, or maybe two entry points into a list - e.g., we can group "bad" words at the very beginning and have a second entry point to right after that sub-list, so we'll have separate sets of 4096 words that would efficiently share most of them. A drawback of the latter trick is that this would look bad in the source code (worse than a list ending in "bad" words) and that only two options might not be enough (e.g., besides a completely cleaned out list and a deliberately NSFW-focused list, it could make sense to also have an unbiased uncensored list similar to what we had before my recent changes). We can also use some C preprocessor magic to Then there's some interest in non-English word lists for the random passphrases. (I know someone patched a Spanish word list into an older version of passwdqc.) Should they become external files configured at run time, then? This has its own pros and cons, and needs more code. With code changes, we could also consider other word counts, length ranges, case alterations or lack thereof. Currently the code optionally toggles the case of the first letter of each word, so the 4096 input words are effectively 8192. We could instead e.g. have only 1024 words only of length 4 and alter the case of each letter, which would be effectively 16384 in a lower maximum length and fewer maximum keypresses (needing to press Shift at most twice per word, for a total of 4 to 6 keypresses per word vs. the current maximum of 7 for a capitalized length 6 word), or 8192 if we limit to at most one Shift per word (maximum 5 keypresses per word). It's also way fewer words to review and categorize. Of course, that would be a move somewhat away from phrases and to cryptic strings, which is probably a drawback. We could also move in the other direction - allow for longer words so that we can use e.g. EFF's lists + BIP-0039 and have 4096 or even 8192 words coming right from there. Then it's kind of not our fault that some words might be bad, because the authors of these lists had tried to avoid bad words. However, I think too many of the words included in the EFF lists are too obscure. |
FWIW, some I had found are: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words |
I don't see any future branches as of yet, let me know when approximately you are planning on opening one and I'll set a reminder to make the PR. |
@vbondzio This PR doesn't need to be against a new branch - it can be against |
@vbondzio In case you didn't notice, 2.0 has been out for a while now, and you didn't need to wait anyway. ;-) BTW, there are some word removal commits here: |
I swear I didn't forget about it! :-) Back then I just ran down a bit of a rabbit hole trying to find some more theory / research on this and ran out of "free weekend time". I'll bump a simple removal based on a bunch of the word lists up on my todo list. Thanks for the CLs I check them out (this WE)! |
... and while most of them are funny and most people understand that this is based on chance and not personal etc., some distributions might disable "random" by default before being in some tweeted screenshot about a password that is proposing to shoot someone of some sexual orientation or bombing some deity.
I saw:
When I forked to edit I initially compared a random "bad word" list I found online (https://www.cs.cmu.edu/~biglou/resources/bad-words.txt) and just deleted those lines:
for word in $(cat ./bad-words.txt); do sed -i "/^[[:space:]]\"${word}\",.*$/Id" ./wordset_4k.c; done
Then saw a bunch of words that would probably also fit the criteria and some comments that made me think keeping those words might be by design? Hence just leaving this as an FYI / issue unless there is interest in "sanitizing" the list and I can look into a more complete "bad word" list?
The text was updated successfully, but these errors were encountered: