Skip to content

ilnaes/github-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Github analysis

Various analysis on the top 8000 (4000 for R) project descriptions on Github of various languages.

We also train various models to predict languages based purely on the project description. Models include knn, LASSO, and XGBoost in R and a Pytorch transformer (distilroberta) model. Overall, the transformer model performed best on the holdout set (~67% vs ~60% from XGBoost being next closest).

The relevant files are

  • R/get_data.R - get data from Github API
  • R/simple_analysis.Rmd (html notebook) - EDA of repo information not including the description
  • R/description_analysis.Rmd (html notebook)- EDA of repo description
  • R/model.Rmd (html notebook) - Fitting knn, LASSO, and XGBoost models and stacking
  • src/* - module for training Pytorch transformer
  • train.ipynb - simple Jupyter notebook for training transformer and plotting training curves
  • app.py - Streamlit app

You can train the transformer model from the command line by running from the base directory

python -m src.train

After it trains, it will save the model weights in output and you can then run the Streamlit app to test out the model using

streamlit run app.py

To demo the app, visit the Heroku app. (Possibly let it boot up for a few minutes)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published