Github analysis

Various analysis on the top 8000 (4000 for R) project descriptions on Github of various languages.

We also train various models to predict languages based purely on the project description. Models include knn, LASSO, and XGBoost in R and a Pytorch transformer (distilroberta) model. Overall, the transformer model performed best on the holdout set (~67% vs ~60% from XGBoost being next closest).

The relevant files are

R/get_data.R - get data from Github API
R/simple_analysis.Rmd (html notebook) - EDA of repo information not including the description
R/description_analysis.Rmd (html notebook)- EDA of repo description
R/model.Rmd (html notebook) - Fitting knn, LASSO, and XGBoost models and stacking
src/* - module for training Pytorch transformer
train.ipynb - simple Jupyter notebook for training transformer and plotting training curves
app.py - Streamlit app

You can train the transformer model from the command line by running from the base directory

python -m src.train

After it trains, it will save the model weights in output and you can then run the Streamlit app to test out the model using

streamlit run app.py

To demo the app, visit the Heroku app. (Possibly let it boot up for a few minutes)

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
R		R
data		data
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
github_analysis.Rproj		github_analysis.Rproj
requirements.txt		requirements.txt
train.ipynb		train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Github analysis

About

Releases

Packages

Languages

License

ilnaes/github-analysis

Folders and files

Latest commit

History

Repository files navigation

Github analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages