Add filtering of english words from entropy checks #240

KevinHock · 2019-09-16T22:20:08Z

By filter I mean, before alerting off of a secret, we check if an english word (of length 4 or greater) is in the string, and then don't alert off of it, in order to reduce false-positives.

To get the wordlist, we can either:

add an e.g. --word-list words.txt option
or
do what the How Bad Can It Git? paper did, and hard-code a list of 2,298 English words.

Once we have the wordlist, we have to do an Aho-Corasick type algorithm to efficiently tell if a secret has an English word in it. We can either:

write hacky regex code to make a trie regex out of the english words
or
Use the seemingly best python library for Aho-Corasick (there are other libs, but they don't look as good. And make it an optional dependency

For a first iteration, a word list argument and using a library seems to be the quickest way to get an MVP.

From the `How Bad Can It Git?` paper:

Words Filter

Another intuition is that a random string should not contain linguistic sequences of characters [12]. For this check, we compiled a dictionary of English words of length as least as long as a defined threshold. Then we searched each candidate string for each one of these words and failed the check if detected.

A trade-off exists in choosing this threshold. If it is too small, randomly occurring sequences that happen to create words will create false negatives (marking valid secrets as invalid), but if it is too large, legitimate words will be missed and create false positives (marking invalid secrets as valid). In our experiments, we set the word length threshold to be 5. This threshold was chosen as a best judgment after careful manual review; unfortunately, experimental derivation of this threshold was not possible given limited initial ground truth.

A dictionary of every English word would contain words that would not likely be used as part of a string in a code file and cause high amounts of false negatives. Therefore, we took the intersection of an English dictionary [45] and a dictionary of the 5 most common words used in source code files on GitHub [40]. The resulting dictionary contained the 2,298 English words that were likely to be used within code files, reducing the potential for false negatives.

The text was updated successfully, but these errors were encountered:

- Add `pyahocorasick` as an optional dependency See issue #240 for more information.

KevinHock · 2019-09-24T02:13:45Z

PR was merged, going to close, will release a new version soon (today or tomorrow)

Note for posterity that, aside from e.g. /usr/share/dict/words, you'll probably have to add things like the following to get the most use out of this functionality

  .org
  addr
  http
  attr
  href
  html
  yaml
  info
  %.2d
  json
  uri
  debug
  value
  123456789
  abcd
  !@#$%^&*(
  utf-8
  ISO-
  %2c
  %3A
  Mon:
  Fri:
  Sat:
  Mon-
  Fri-
  Sat-
  5:30PM
  10AM
  6PM

) * Update CONTRIBUTING.md to outline detector development process Supports git-defenders/detect-secrets-discuss#312 * Minor wording update * Address comments

KevinHock added the accuracy label Sep 16, 2019

KevinHock self-assigned this Sep 16, 2019

KevinHock added a commit that referenced this issue Sep 19, 2019

🎉 Add --word-list option

c9f3875

- Add `pyahocorasick` as an optional dependency See issue #240 for more information.

KevinHock added a commit that referenced this issue Sep 19, 2019

🎉 Add --word-list option

f8cb31f

- Add `pyahocorasick` as an optional dependency See issue #240 for more information.

KevinHock mentioned this issue Sep 19, 2019

Add filtering of english words from entropy (and keyword) plugins #241

Merged

KevinHock closed this as completed Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add filtering of english words from entropy checks #240

Add filtering of english words from entropy checks #240

KevinHock commented Sep 16, 2019

KevinHock commented Sep 24, 2019

Add filtering of english words from entropy checks #240

Add filtering of english words from entropy checks #240

Comments

KevinHock commented Sep 16, 2019

From the How Bad Can It Git? paper:

Words Filter

KevinHock commented Sep 24, 2019

From the `How Bad Can It Git?` paper: