Skip to content
/ gemini Public

Advanced similarity and duplicate source code at scale.


Notifications You must be signed in to change notification settings


Repository files navigation

Gemini Build Status codecov

Find similar code in Git repositories

Gemini is a tool for searching for similar 'items' in source code repositories. The supported granularity levels for items are:

  • repositories (TBD)
  • files
  • functions

Gemini is based on its sister research project codenamed Apollo.


./hash   <path-to-repos-or-siva-files>
./query  <path-to-file>

You would need to prefix commands with docker-compose exec gemini if you run it in docker. Read below how to start gemini in docker or standalone mode.


To pre-process number of repositories for a quick finding of the duplicates run

./hash ./src/test/resources/siva

Input format of the repositories is the same as in src-d/Engine.

To pre-process repositories for search of similar functions run:

./hash -m func ./src/test/resources/siva

Besides local file system gemini support different distributed storages.


To find all duplicate of the single file run

./query <path-to-single-file>

To find all similar function defined in a file run:

./query -m func <path-to-single-file>

If you are interested in similarities of only 1 function defined in the file you can run:

./query -m func <path-to-single-file>:<function name>:<line number where the function is defined>


To find all duplicate files and similar functions in all repositories run


All repositories must be hashed before and a community detection library installed.



Start containers:

docker-compose up -d

Local directories repositories and query are available as /repositories and /query inside the container.


docker-compose exec gemini ./hash /repositories
docker-compose exec gemini ./query /query/consumer.go
docker-compose exec gemini ./report


You would need:

  • JVM 1.8
  • Apache Cassandra or ScyllaDB
  • Apache Spark 2.2.x
  • Python 3
  • Bblfshd v2.5.0+

By default, all commands are going to use

  • Apache Cassandra or ScyllaDB instance available at localhost:9042
  • Apache Spark, available though $SPARK_HOME
# save some repos in .siva files using Borges
echo -e "\n" > repo-list.txt

# get Borges from
borges pack --loglevel=debug --workers=2 --to=./repos -f repo-list.txt

# start Apache Cassandra
docker run -p 9042:9042 \
  --name cassandra -d rinscy/cassandra:3.11

# or ScyllaDB \w workaround
docker run -p 9042:9042 --volume $(pwd)/scylla:/var/lib/scylla \
  --name some-scylla -d scylladb/scylla:2.0.0 \
  --broadcast-address --listen-address --broadcast-rpc-address \
  --memory 2G --smp 1

# to get access to DB for development
docker exec -it some-scylla cqlsh

Configuration for Apache Spark

Use env variables to set memory for hash job:

export DRIVER_MEMORY=30g

To use a external cluster just set the URL to the Spark Master though an env var:

MASTER="spark://<spark-master-url>" ./hash <path>

CLI arguments

All three commands accept parameters for database connection and logging:

  • -h/--host - cassandra/scylla db hostname, default
  • -p/--port - cassandra/scylla db port, default 9042
  • -k/--keyspace - cassandra/scylla db keyspace, default hashes
  • -v/--verbose - producing more verbose output, default false

For query and hash commands parameters for bblfsh/features extractor configuration are available:

  • -m/--mode - similarity modes: file or function, default file
  • --bblfsh-host - babelfish server host, default
  • --bblfsh-port - babelfish server port, default 9432
  • --features-extractor-host - features-extractor host, default
  • --features-extractor-port - features-extractor port, default 9001

Hash command specific arguments:

  • -l/--limit - limit the number of repositories to be processed. All repositories will be processed by default
  • -f/--format - format of the stored repositories. Supported input data formats that repositories could be stored in are siva, bare or standard, default siva
  • --gcs-keyfile - path to JSON keyfile for authentication in Google Cloud Storage

Report specific arguments:

  • --output-format - output format: text or json
  • --cassandra - Enable advanced cql queries for Apache Cassandra database


Currently gemini targets medium size repositories and datasets.

We set resonable defaults and pre-filtering rules to provide the best results for this case. List of rules:

  • Exclude binary files
  • Exclude empty files from full duplication results
  • Exclude files less than 500B from file-similarity results
  • Similarity deduplication works only for languages supported by babelfish and syntactically correct files

Performance tips

We recommend to run Spark with 10GB+ memory for each executer and for the driver. Gemini wouldn't benifit from more than 1 CPU per task.

Horizontal scaling doesn't work well for the first stage of the pipeline and depends on size of the biggest repositories in a dataset but the rest of pipeline scales well.

Distributed storages

Gemini supports different distributed storages in local and cluster mode. It already includes all necessary jars as a part of fat jar.


Path format to git repositories: hdfs://hdfs-namenode/path

To configure HDFS in local or cluster mode please consult Hadoop documentation.

Google Cloud Storage

Path format to git repositories: gs://bucket/path

To connect to GCS locally use --gcs-keyfile flag with path to JSON keyfile.

To use GCS in cluster mode please consult Google Cloud Storage Connector documentation.

Amazon Web Services S3

Path format to git repositories: s3a://bucket/path

To connect to S3 locally use following flags:

  • --aws-key - AWS access keys
  • --aws-secret - AWS access secret
  • --aws-s3-endpoint - region endpoint of your S3 bucket

Due to some limitations passing key&secret as part of URI is not supported.

To use AWS S3 in cluster mode please consult hadoop-aws documentation

Known bugs

  • Search for similarities in C# code isn't supported right now (patch with workaround)
  • Timeout for UAST extraction is relatevely low on real dataset according to our experience and it isn't configurable (patch1 and path2 with workaround)
  • For standard & bare format gemini prints wrong repositories listing (issue)


Compile & Run

If env var DEV is set, ./sbt is used to compile and run all non-Spark commands: ./hash and ./report. This is a convenient for local development, as not requiring a separate "compile" step allows for a dev workflow that is similar to experience with interpreted languages.


To build final .jars for all commands

./sbt assemblyPackageDependency
./sbt assembly

Instead of 1 fatJar we bulid 2, separating all the dependencies from actual application code to allow for lower build times in case of simple changes.


To run tests, that rely

./sbt test

Re-generate code

Latest generated code for gRPC is already checked in under src/main/scala/tech/sourced/featurext. In case you update any of the src/main/proto/*.proto, you would need to generate gRPC code for Feature Extractors:


To generate new protobuf messages fixtures for tests, you may use bblfsh-sdk-tools:

bblfsh-sdk-tools fixtures -p .proto -l <LANG> <path-to-source-code-file>


Copyright (C) 2018 source{d}. This project is licensed under the GNU General Public License v3.0.