Skip to content
forked from cfpb/clouseau

Search your repository's git history for undesirable text patterns such as passwords, ssh keys and othe personal identifiable information

License

Notifications You must be signed in to change notification settings

virtix/clouseau

 
 

Repository files navigation

Clouseau

Build Status

What is Clouseau?

Clouseau is a silly git repo inspector.

Clouseau is a P.I. for your PII. It searches git commits -- source code and commit messages -- for undesirable text patterns, such as passwords, ssh keys and personal identifiable information. You can search for profanity or other information with a new pattern file or a regular expression specified on the command line.

See the Get Involved section at the end of this readme to see the current status of this project and contribute.

Dependencies

See the requirements.txt file for additional dependencies to be installed in the quick setup.

Quick setup

  1. Clone this repository somewhere you can execute Python code.

  2. From the cloned Clouseau project root, set up a virtualenv:

    virtualenv --no-site-packages --distribute venv    # creates the virtualenv named "venv"
    source venv/bin/activate                           # activates (places you in) the virtualenv
  3. Install the requirements:

    pip install -r requirements.txt
  4. Tell Python to also look in this directory for libraries.

    export PYTHONPATH=$PYTHONPATH:.

And that's it! Now follow the usage instructions below.

Basic Usage

Search a github repository to match patterns:

$ bin/clouseau --url [repo-url] ; e.g., $ bin/clouseau --url https://github.com/virtix/cato.git

This will search against the default pattern file (clouseau/patterns/default.txt) and display any matches for each of the patterns the file contains.

The results should look something like this:

Additional Usage Options

Search using a single regular expression:
$ bin/clouseau --url https://github.com/virtix/cato.git --term "Your Name"

Search the entire history for a single term (quite slow and needs threading or multi-process work):
$ bin/clouseau --url https://github.com/virtix/cato.git --term "Your Name" --revlist all

Search the current revision using a different pattern file:
$ bin/clouseau -u https://github.com/virtix/cato.git --patterns ~/projects/patterns/profanity.txt

Search the current revision using multiple pattern files:
$ bin/clouseau -u https://github.com/virtix/cato.git --patterns ~/projects/patterns/profanity.txt,~/projects/patterns/custom_pattern.txt

Skip either cloning or pulling and just scan:
$ bin/clouseau -u https://github.com/virtix/cato.git --skip

Search the specific revision :
$ bin/clouseau -u https://github.com/virtix/cato.git --revlist 5c0b30b007

Search between the range of two commits:
$ bin/clouseau -u https://gituhub.com/virtix/cato.git --revlist d46868fe...3ea013e8

Search since a given date:
$ bin/clouseau -u https://github.com/virtix/cato.git --after 03/10/13

Blame:
$ bin/clouseau -u https://github.com/virtix/cato.git --author bill

Intended command-line interface

$ bin/clouseau -h
usage: clouseau [-h] [-v] --url URL [--term TERM] [--patterns PATTERNS]
                [--clean] [--output OUTPUT_FORMAT]
                [--output-destination OUTPUT_DESTINATION] [--dest DEST]
                [--revlist REVLIST]

Clouseau: A silly git inspector

 optional arguments:
   -h, --help               show this help message and exit
   -v, --version            show program's version number and exit
   --url URL, -u URL        Fully qualified git URL (http://www.kernel.org/pub//software/scm/git/docs/git-clone.html)
   --term TERM, -t TERM     Search for a single regular expression instead of every term in patterns.txt
   --patterns PATTERNS, -p PATTERNS
                            Path to list of regular expressions to use.
   --clean, -c              Delete the existing git repo and re-clone
   --output OUTPUT_FORMAT, -o OUTPUT_FORMAT  (NOT YET IMPLEMENTED)
                            Output formats: console, markdown, raw, html, json
   --output-destination OUTPUT_DESTINATION, -od OUTPUT_DESTINATION  (NOT YET IMPLEMENTED)
                            Location where the output is to be stored. Default ./temp.
   --dest DEST, -d DEST  The directory where the git repo is stored. Default: ./temp  (NOT YET IMPLEMENTED)
   --revlist REVLIST, -rl REVLIST
                           A space-delimted list of revisions (commits) to search.
                           Defaults to HEAD. Specify 'all' to search the entire history.
   --before BEFORE, -b BEFORE
                            Search commits that occur prior to this date; e.g., Mar-08-2013
   --after AFTER, -a AFTER
                            Search commits that occur after this date; e.g., Mar-10-2013
   --author AUTHOR         Perform searched for commits made by AUTHOR; e.g., an email address or name.
   --skip   SKIP           If specified, skips any calls to git-clone or git-pull.

Minimal output

For continuous integration environments, minimal output may be desirable. In that case, use bin/clouseau_thin:

$ bin/clouseau_thin -u [git_url] ...

clouseau_thin supports all clouseau options and differs only in the verbosity and attractiveness of its output.

Running as a post-commit hook

First, install Clouseau by changing directory to your cloned Clouseau project root and then pip install -e ./

Test the install by changing to any other directory and issuing clouseau and also clouseau_thin

Now, change to one of your local git repos.

Create .git/hooks/post-commit and make it executable (chmod +x .git/hooks/post-commit)

Edit it with content such as this:

#!/bin/sh

echo "running clouseau"
remote_url=$(git config --get remote.origin.url)
clouseau_thin -u $remote_url --skip --dest $(dirname $(pwd)) --revlist="HEAD"

Now, make a commit to that project.

You should see that Clouseau runs and finds nothing.

Make another commit, this time adding something that looks like a SSN or IP to the file and/or the commit message. Run Clouseau again, and you should see output such as this:

running clouseau
Skipping git-clone or git-pull as --skip was found on the command line.
Clouseau: a silly git inspector, searching [your_git_url]

✓  hooktest.txt
Search term:  username[ ]*=[ ]*.+
git@github.com:marcesher/cato/commit/0731c34b40bcd4322c6b4daf044ec3587211808a
Author: Marc Esher <marc.esher@gmail.com> Date:   Tue Feb 25 15:41:37 2014 -0500
my username=foo

+production_ip=127.0.0.1  Line:19
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

✓  Commit Message
Search term:  username[ ]*=[ ]*.+
git@github.com:marcesher/cato/commit/0731c34b40bcd4322c6b4daf044ec3587211808a
Author: Marc Esher <marc.esher@gmail.com> Date:   Tue Feb 25 15:41:37 2014 -0500
my username=foo

my username=foo  Line:1
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Running with Docker

Clouseau is now in the Docker index and you can run it with a simple docker command:

docker run -i -e "GIT_URL=https://github.com/virtix/cato.git" -t dlapiduz/clouseau

Running unit tests

To run unit tests, issue:

nosetests

Getting involved

If you're interested in using Clouseau to scan your source code and commit messages for undesirable content, please get involved.

Clouseau is currently in an early stage of development and not recommended for production use.

  • Proof of concept
  • Multiple output formats
  • Works on reasonably sized repos (concurrency)
  • Stores previous runs

The intent is that this can be run against any repo and it will search the index for file blobs containing the patterns defined in a patterns.txt file or a regular expression specified on the command line.

We welcome feature requests, bug reports, and code / documentation improvements. We also welcome stories of how you're using Clouseau.

General instructions on how to contribute are described in CONTRIBUTING.

Open source licensing info

  1. TERMS
  2. LICENSE
  3. CFPB Source Code Policy

About

Search your repository's git history for undesirable text patterns such as passwords, ssh keys and othe personal identifiable information

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 86.8%
  • HTML 8.7%
  • Shell 4.5%