-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add filtering of english words from entropy checks #240
Comments
KevinHock
added a commit
that referenced
this issue
Sep 19, 2019
- Add `pyahocorasick` as an optional dependency See issue #240 for more information.
KevinHock
added a commit
that referenced
this issue
Sep 19, 2019
- Add `pyahocorasick` as an optional dependency See issue #240 for more information.
PR was merged, going to close, will release a new version soon (today or tomorrow) Note for posterity that, aside from e.g.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
By filter I mean, before alerting off of a secret, we check if an english word (of length 4 or greater) is in the string, and then don't alert off of it, in order to reduce false-positives.
To get the wordlist, we can either:
--word-list words.txt
optionor
How Bad Can It Git?
paper did, and hard-code a list of 2,298 English words.Once we have the wordlist, we have to do an Aho-Corasick type algorithm to efficiently tell if a secret has an English word in it. We can either:
or
For a first iteration, a word list argument and using a library seems to be the quickest way to get an MVP.
From the
How Bad Can It Git?
paper:Words Filter
Another intuition is that a random string should not contain linguistic sequences of characters [12]. For this check, we compiled a dictionary of English words of length as least as long as a defined threshold. Then we searched each candidate string for each one of these words and failed the check if detected.
A trade-off exists in choosing this threshold. If it is too small, randomly occurring sequences that happen to create words will create false negatives (marking valid secrets as invalid), but if it is too large, legitimate words will be missed and create false positives (marking invalid secrets as valid). In our experiments, we set the word length threshold to be 5. This threshold was chosen as a best judgment after careful manual review; unfortunately, experimental derivation of this threshold was not possible given limited initial ground truth.
A dictionary of every English word would contain words that would not likely be used as part of a string in a code file and cause high amounts of false negatives. Therefore, we took the intersection of an English dictionary [45] and a dictionary of the 5 most common words used in source code files on GitHub [40]. The resulting dictionary contained the 2,298 English words that were likely to be used within code files, reducing the potential for false negatives.
The text was updated successfully, but these errors were encountered: