Name		Name	Last commit message	Last commit date
parent directory ..
deployment		deployment
flask_app		flask_app
hyperparam_sweep		hyperparam_sweep
notebooks		notebooks
script		script
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
cpu.Dockerfile		cpu.Dockerfile
developer_guide.md		developer_guide.md
gpu.Dockerfile		gpu.Dockerfile
requirements.txt		requirements.txt
run_train.sh		run_train.sh
skaffold.yaml		skaffold.yaml
train.py		train.py

README.md

An API that returns embeddings from GitHub Issue Text.

Embeddings are learned from a language model Trained On 16M+ GitHub Issues.

The manifest files in /deployment define a service that will return 2400 dimensional embeddings given the text of an issue. The manifests define an incluster/internal service suitable for use as a backend for other services. If you need to expose it publicly you can do so by

Exposing it behind the Kubeflow ingress so you get authorization and access control
Creating its own ingress

All routes expect POST. Below is a list of endpoints:

http:///issue-embedding-server: expects a json payload of title and body and returns a single 2,400 dimensional vector that represents latent features of the text. For example, this is how you would interact with this endpoint from python:

import requests
import json
import numpy as np
from passlib.apps import custom_app_context as pwd_context

API_ENDPOINT = 'http://issue-embedding-server/text'

# A toy example of a GitHub Issue title and body
data = {'title': 'Fix the issue', 
        'body': 'I am encountering an error\n when trying to push the button.'}

# sending post request and saving response as response object 
r = requests.post(url=API_ENDPOINT, json=data)

# convert string back into a numpy array
embeddings = np.frombuffer(r.content, dtype='<f4')

https://issue-embedding-server/all_issues/<owner>/<repo> 🚧 this will return a numpy array of the shape (# of labeled issues in repo, 2400), as well a list of all the labels for each issue. This endpoint is still under construction.

A Language Model Trained On 16M+ GitHub Issues For Transfer Learning

Motivation: Issue Label Bot predicts 3 generic issue labels: bug, feature request and question. However, it would be nice to predict personalized issue labels instead of generic ones. To accomplish this, we can use the issues that are already labeled in a repository as training data for a model that can predict personalized issue labels. One challenge with this approach is there is often a small number of labeled issues in each repository. In order to mitigate this concern, we utilize transfer-learning by training a language trained over 16 million GitHub Issues and fine-tune this to predict issue labels.

Training the Language Model

The language model is built with the fastai library. The notebooks folder contains a tutorial of the steps you need to build a language model:

01_AcquireData.ipynb: Describes how to acquire and pre-process the data using mdparse, which parses and annotates markdown files.
02_fastai_DataBunch.ipynb: The fastai library uses an object called a Databunch around pytorch's dataloader class to encapuslate additional metadata and functionality. This notebook walks through the steps of preparing this data structure which will be used by the model for training.
03_Create_Model.ipynb: This walks through the process of instantiating the fastai language model, along with callbacks for early stopping, logging and saving of artifacts. Additionally, this notebook illustrates how to train the model.
04_Inference.ipynb: shows how to use the language model to perform inference in order to extract latent features in the form of a 2,400 dimension vector from GitHub Issue text. This notebook shows how to load the Databunch and model and save only the model for inference. /flask_app/inference.py contains utilities that makes the inference process easier.

Putting it all together: hyper-parameter tuning

The hyperparam_sweep folder contains lm_tune.py which is a script used to train the model. Most importantly, we use this script in conjuction with hyper-parameter sweeps in Weights & Biases

We were able to try 538 different hyper-paramter combinations using Bayesian and random grid search concurrently to choose the best model:

The hyperparameter tuning process is described in greater detail in the hyperparam_sweep folder.

Files

/notebooks: contains notebooks on how to gather and clean the data and train the language model.
/hyperparam_sweep: this folder contains instructions on doing a hyper-parameter sweep with Weights & Biases.
/flask_app: code for a flask app that is the API that listens for POST requests.
/script: this directory contains the entry point for running the REST API server that end users will interface with:
- dev: this bash script pulls the necessary docker images and starts the API server.
- bootstrap: this re-builds the docker image and pushes it to Dockerhub. It is necessary to re-build the container anytime the code for the flask app or language model is updated.
/deployment: This directory contains files that are helpful in deploying the app.
- Dockerfile this is the definition of the container that is used to run the flask app. The build for this container is hosted on DockerHub at hamelsmu/issuefeatures-api-cpu.
- *.yaml: these files relate to a Kubernetees deployment.

Appendix: Location of Language Model Artifacts

Google Cloud Storage

model for inference (965 MB): https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/trained_model_22zkdqlr.pkl
encoder (for fine-tuning w/a classifier) (965 MB): https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/trained_model_encoder_22zkdqlr.pth
fastai.databunch (27.1 GB): https://storage.googleapis.com/issue_label_bot/model/lang_model/data_save.pkl
checkpointed model (2.29 GB): https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/best_22zkdqlr.pth

Weights & Biases Run

https://app.wandb.ai/github/issues_lang_model/runs/22zkdqlr/overview

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue_Embeddings

Issue_Embeddings

README.md

An API that returns embeddings from GitHub Issue Text.

A Language Model Trained On 16M+ GitHub Issues For Transfer Learning

Training the Language Model

Putting it all together: hyper-parameter tuning

Files

Appendix: Location of Language Model Artifacts

Google Cloud Storage

Weights & Biases Run

Files

Issue_Embeddings

Directory actions

More options

Directory actions

More options

Latest commit

History

Issue_Embeddings

Folders and files

parent directory

README.md

An API that returns embeddings from GitHub Issue Text.

A Language Model Trained On 16M+ GitHub Issues For Transfer Learning

Training the Language Model

Putting it all together: hyper-parameter tuning

Files

Appendix: Location of Language Model Artifacts

Google Cloud Storage

Weights & Biases Run