Skip to content

Identifying GitHub "sample repositories" (SR), that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency

License

Notifications You must be signed in to change notification settings

h1alexbel/sr-detection

Repository files navigation

sr-detection

build Docker Image Version Hits-of-Code PDD status License

The goal of the study is to create a model that, by looking at the README file and meta-information, can identify GitHub "sample repositories" (SR), that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency.

Motivation. During the work on CaM project, we were required to filter out repositories with samples. No readily available technique or tool existed that could perform that function, so we conducted research on this very subject.

The repository structured as follows:

  • sr-data, module that consists of a set of tasks that filters collected metadata about GitHub repositories.
  • sr-train, module for training ML models.
  • sr-detector, trained and reusable model for SR detection.
  • sr-paper, LaTeX source for a paper on SR detection.

Hypotheses

Our research based on the following hypotheses:

  • SRs usually don't have release pipeline inside .github/workflows
  • SRs usually have less strict build pipeline inside .github/workflows
  • SRs usually don't have releases
  • SRs have less pull requests
  • SRs don't have section about how to use it
  • SRs have more disconnected directories/files

Run experiments

First, prepare datasets:

docker run --rm -v "$(pwd)/output:/collection" -e START="<start date>" \
  -e END="<end date>" -e COLLECT_TOKEN="<GitHub PAT to collect repositories>" \
  -e COLLECT_TOKEN="<GitHub PAT to fetch metadata>" \
  -e HF_TOKEN="<Huggingface PAT>" -e COHERE_TOKEN="<Cohere API token>" \
  -e OUT="sr-data" h1alexbel/sr-detection

In the output directory you should have these datasets:

  • d1-scores.csv
  • d2-sbert.csv
  • d3-e5.csv
  • d4-embedv3.csv
  • d5-scores+sbert.csv
  • d6-scores+e5.csv
  • d7-scores+embedv3.csv

Alternatively, you can download existing datasets from gh-pages branch.

Then, you should run models against collected datasets:

just cluster

TBD..

How to contribute

Make sure that you have Python 3.10+, just, and npm installed on your system, fork this repository, make changes, send us a pull request. We will review your changes and apply them to the master branch shortly, provided they don't violate our quality standards. To avoid frustration, before sending us your pull request please run full build:

just full

About

Identifying GitHub "sample repositories" (SR), that mostly contain educational or demonstration materials supposed to be copied instead of reused as a dependency

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •