Find similar code in Git repositories
Gemini is a tool for searching for similar 'items' in source code repositories. The supported granularity levels for items are:
- repositories (TBD)
- files
- functions
Gemini is based on its sister research project codenamed Apollo.
./hash <path-to-repos-or-siva-files>
./query <path-to-file>
./report
You would need to prefix commands with docker-compose exec gemini
if you run it in docker. Read below how to start gemini in docker or standalone mode.
To pre-process number of repositories for a quick finding of the duplicates run
./hash ./src/test/resources/siva
Input format of the repositories is the same as in src-d/Engine.
To pre-process repositories for search of similar functions run:
./hash -m func ./src/test/resources/siva
Besides local file system gemini support different distributed storages.
To find all duplicate of the single file run
./query <path-to-single-file>
To find all similar function defined in a file run:
./query -m func <path-to-single-file>
If you are interested in similarities of only 1 function defined in the file you can run:
./query -m func <path-to-single-file>:<function name>:<line number where the function is defined>
To find all duplicate files and similar functions in all repositories run
./report
All repositories must be hashed before and a community detection library installed.
Start containers:
docker-compose up -d
Local directories repositories
and query
are available as /repositories
and /query
inside the container.
Examples:
docker-compose exec gemini ./hash /repositories
docker-compose exec gemini ./query /query/consumer.go
docker-compose exec gemini ./report
You would need:
- JVM 1.8
- Apache Cassandra or ScyllaDB
- Apache Spark 2.2.x
- Python 3
- Bblfshd v2.5.0+
By default, all commands are going to use
- Apache Cassandra or ScyllaDB instance available at
localhost:9042
- Apache Spark, available though
$SPARK_HOME
# save some repos in .siva files using Borges
echo -e "https://github.com/src-d/borges.git\nhttps://github.com/erizocosmico/borges.git" > repo-list.txt
# get Borges from https://github.com/src-d/borges/releases
borges pack --loglevel=debug --workers=2 --to=./repos -f repo-list.txt
# start Apache Cassandra
docker run -p 9042:9042 \
--name cassandra -d rinscy/cassandra:3.11
# or ScyllaDB \w workaround https://github.com/gocql/gocql/issues/987
docker run -p 9042:9042 --volume $(pwd)/scylla:/var/lib/scylla \
--name some-scylla -d scylladb/scylla:2.0.0 \
--broadcast-address 127.0.0.1 --listen-address 0.0.0.0 --broadcast-rpc-address 127.0.0.1 \
--memory 2G --smp 1
# to get access to DB for development
docker exec -it some-scylla cqlsh
Use env variables to set memory for hash job:
export DRIVER_MEMORY=30g
export EXECUTOR_MEMORY=60g
To use a external cluster just set the URL to the Spark Master though an env var:
MASTER="spark://<spark-master-url>" ./hash <path>
All three commands accept parameters for database connection and logging:
-h/--host
- cassandra/scylla db hostname, default127.0.0.1
-p/--port
- cassandra/scylla db port, default9042
-k/--keyspace
- cassandra/scylla db keyspace, defaulthashes
-v/--verbose
- producing more verbose output, defaultfalse
For query
and hash
commands parameters for bblfsh/features extractor configuration are available:
-m/--mode
- similarity modes:file
orfunction
, defaultfile
--bblfsh-host
- babelfish server host, default127.0.0.1
--bblfsh-port
- babelfish server port, default9432
--features-extractor-host
- features-extractor host, default127.0.0.1
--features-extractor-port
- features-extractor port, default9001
Hash command specific arguments:
-l/--limit
- limit the number of repositories to be processed. All repositories will be processed by default-f/--format
- format of the stored repositories. Supported input data formats that repositories could be stored in aresiva
,bare
orstandard
, defaultsiva
--gcs-keyfile
- path to JSON keyfile for authentication in Google Cloud Storage
Report specific arguments:
--output-format
- output format: text or json--cassandra
- Enable advanced cql queries for Apache Cassandra database
Currently gemini targets medium size repositories and datasets.
We set resonable defaults and pre-filtering rules to provide the best results for this case. List of rules:
- Exclude binary files
- Exclude empty files from full duplication results
- Exclude files less than 500B from file-similarity results
- Similarity deduplication works only for languages supported by babelfish and syntactically correct files
We recommend to run Spark with 10GB+ memory for each executer and for the driver. Gemini wouldn't benifit from more than 1 CPU per task.
Horizontal scaling doesn't work well for the first stage of the pipeline and depends on size of the biggest repositories in a dataset but the rest of pipeline scales well.
Gemini supports different distributed storages in local and cluster mode. It already includes all necessary jars as a part of fat jar.
Path format to git repositories: hdfs://hdfs-namenode/path
To configure HDFS in local or cluster mode please consult Hadoop documentation.
Path format to git repositories: gs://bucket/path
To connect to GCS locally use --gcs-keyfile
flag with path to JSON keyfile.
To use GCS in cluster mode please consult Google Cloud Storage Connector documentation.
Path format to git repositories: s3a://bucket/path
To connect to S3 locally use following flags:
--aws-key
- AWS access keys--aws-secret
- AWS access secret--aws-s3-endpoint
- region endpoint of your S3 bucket
Due to some limitations passing key&secret as part of URI is not supported.
To use AWS S3 in cluster mode please consult hadoop-aws documentation
- Search for similarities in C# code isn't supported right now (patch with workaround)
- Timeout for UAST extraction is relatevely low on real dataset according to our experience and it isn't configurable (patch1 and path2 with workaround)
- For standard & bare format gemini prints wrong repositories listing (issue)
If env var DEV
is set, ./sbt
is used to compile and run all non-Spark commands: ./hash
and ./report
.
This is a convenient for local development, as not requiring a separate "compile" step allows for a dev workflow
that is similar to experience with interpreted languages.
To build final .jars for all commands
./sbt assemblyPackageDependency
./sbt assembly
Instead of 1 fatJar we bulid 2, separating all the dependencies from actual application code to allow for lower build times in case of simple changes.
To run tests, that rely
./sbt test
Latest generated code for gRPC is already checked in under src/main/scala/tech/sourced/featurext
.
In case you update any of the src/main/proto/*.proto
, you would need to generate gRPC code for Feature Extractors:
./src/main/resources/generate_from_proto.sh
To generate new protobuf messages fixtures for tests, you may use bblfsh-sdk-tools:
bblfsh-sdk-tools fixtures -p .proto -l <LANG> <path-to-source-code-file>
Copyright (C) 2018 source{d}. This project is licensed under the GNU General Public License v3.0.