Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filtering of english words from entropy (and keyword) plugins #241

Merged
merged 26 commits into from
Sep 24, 2019

Conversation

KevinHock
Copy link
Collaborator

@KevinHock KevinHock commented Sep 19, 2019

This is a solution for #240.

I will add that I am personally not against hardcoding a list of ~2,298 English words in the future, like from the How bad can it git paper, as that can be immediately beneficial to most users with no effort, but for now this is okay.

I tried to keep the commit history pretty clean 😁

`os.path.splitext(filename)[1]` includes the '.'
The current heuristic will never return True
Start capitalized and after 2 spaces when on the same line as code
- Add `pyahocorasick` as an optional dependency

See issue #240 for more information.
@KevinHock
Copy link
Collaborator Author

Forgot to make it case-insensitive, will do tomorrow.

@KevinHock
Copy link
Collaborator Author

KevinHock commented Sep 19, 2019

I used /usr/share/dict/words which has ~235k words, and it had a very minimal performance impact. It initially found 0 things, then I changed util.py to only allow adding words with len() > 3 and got way better results :D

Something random that made me think it was missing things was that --display-results outputs line context, not secret content, in the output for Yaml files. No big deal.

- 🐍 Refactor hadouken code
- 🐍 Remove high_entropy_strings.py from the uncovered files list in tox
- Change `is_secret` string names to `(true|false)-positives` (Negative meaning false-positive was confusing.)
- Replace list of secrets with {filename: {plaintext: line:}}
- Replace top-level `results` key with `plugins` since we have a results key already
And normalize main_test.py comments
Copy link
Contributor

@OiCMudkips OiCMudkips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally seems fine to me. The main contention I have is the new class but if you have a good reason for it I want to know :)

LICENSE Show resolved Hide resolved
@@ -65,7 +79,7 @@ def initialize(
files_to_scan,
)

for file in files_to_scan:
for file in sorted(files_to_scan):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real reason, just looked nicer when I was running it on things.

detect_secrets/core/constants.py Outdated Show resolved Hide resolved
detect_secrets/plugins/base.py Outdated Show resolved Hide resolved
detect_secrets/plugins/common/filters.py Show resolved Hide resolved
detect_secrets/plugins/mailchimp.py Show resolved Hide resolved
detect_secrets/plugins/stripe.py Show resolved Hide resolved
detect_secrets/util.py Outdated Show resolved Hide resolved
@@ -570,15 +570,18 @@ def test_determine_audit_results_plugin_config(

results = audit.determine_audit_results(baseline, '.secrets.baseline')

assert results['results']['HexHighEntropyString']['config'].items() \
>= plugins_used[0].items()
assert (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL this formatting 👍

coverage report --show-missing --include=tests/* --fail-under 100
coverage report --show-missing --include=detect_secrets/* --fail-under 98
# This is so that we do not regress unintentionally
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@KevinHock KevinHock merged commit 5bc676e into master Sep 24, 2019
@KevinHock KevinHock deleted the add_filter_for_words branch September 24, 2019 02:00
killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request May 28, 2020
* use more python env

* keep pypy out
killuazhu pushed a commit to IBM/detect-secrets that referenced this pull request Jul 9, 2020
* use more python env

* keep pypy out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants