This is the official repository for CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios, accepted to the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) 2024.
CoderUJB (Unified Java Benchmark): A new benchmark designed to evaluate LLMs across diverse Java programming tasks that are executable and reflective of actual development scenarios, acknowledging Java’s prevalence in real-world software production.
-
Install codeujb.
# create a new conda environment conda create -n ujb python=3.10 conda activate ujb # clone and install codeujb git clone https://github.com/ZZR0/ISSTA24-CoderUJB.git cd ISSTA24-CoderUJB pip install -r requirements.txt pip install -e .
For more details packages version, please refer to
requirements.txt
. -
Refer to defects4j repository for install execution environment.
We support three backbones for generating CodeUJB answers: hf
, openai
and tgi
.
# generate answers with huggingface `transformers` backbone.
python code_ujb/generate_hf.py \
--model-path $model_name_or_path \
--model-id $run_id \
--gen-mode $gen_mode \
--bench-name $dataset \
--num-samples $num_samples \
--save-generations-path ./log/$run_id/$dataset/generations-$gen_mode.json
# generate answers with openai API backbone.
export OPENAI_API_BASE=''
export OPENAI_API_KEY=''
python code_ujb/generate_api.py \
--model-path $run_id \
--model-id $run_id \
--gen-mode $gen_mode \
--bench-name $dataset \
--num-samples $num_samples \
--parallel 8 \
--save-generations-path ./log/$run_id/$dataset/generations-$gen_mode.json
# If `model-id` not in OpenAI model list, `generate_api.py` will generate answers with Text Generation Inference backbone.
# Please refer to [Text Generation Inference](https://github.com/huggingface/text-generation-inference) for deploying your TGI server first.
export TGI_API_URL_${run_id//-/_}=http://127.0.0.1:8081,http://127.0.0.1:8082 # The Text Generation Inference API URL.
python code_ujb/generate_api.py \
--model-path $run_id \
--model-id $run_id \
--gen-mode $gen_mode \
--bench-name $dataset \
--num-samples $num_samples \
--parallel 32 \
--save-generations-path ./log/$run_id/$dataset/generations-$gen_mode.json
Arguments:
[model-path]
is the path to the weights, which can be a local folder or a Hugging Face repo ID. If you usinggenerate_api.py
, it should be the same as model ID.[model-id]
is a name you give to the model.[gen-mode]
have two options:complete
for model without instruction-finetuning andchat
for model with instruction-finetuning.[bench-name]
is the name of the dataset you want to evaluate. There five datasets in CodeUJB:codeujbrepair
,codeujbcomplete
,codeujbtestgen
,codeujbtestgenissue
,codeujbdefectdetection
.[num-samples]
is the number of samples for each coding question you want to generate.[save-generations-path]
is the path to save the generated answer.[parallel]
is the number of parallel API calls. e.g.,
python code_ujb/generate_api.py --model-path gpt-3.5-turbo --model-id gpt-3.5-turbo --gen-mode chat --bench-name codeujbcomplete --num-samples 10 --save-generations-path log/gpt-3.5-turbo/codeujbcomplete/generations-chat.jsonl
The answers will be saved to log/gpt-3.5-turbo/codeujbcomplete/generations-chat.jsonl
.
Please make sure you have installed defects4j
first.
python3 code_ujb/evaluate.py \
--model-path $model_name_or_path \
--model-id $run_id \
--gen-mode $gen_mode \
--bench-name $dataset \
--num-samples $num_samples \
--load-generations-path ./log/$run_id/$dataset/generations-$gen_mode.json \
--eval-output-path ./log/$run_id/$dataset/evaluation-$gen_mode.json
Arguments:
[load-generations-path]
is the path to the generated answer.[eval-output-path]
is the path to save the evaluation results.
e.g.,
python code_ujb/evaluate.py --model-path gpt-3.5-turbo --model-id gpt-3.5-turbo --gen-mode chat --bench-name codeujbcomplete --num-samples 10 --load-generations-path log/gpt-3.5-turbo/codeujbcomplete/generations-chat.jsonl --eval-output-path ./log/gpt-3.5-turbo/codeujbcomplete/evaluation-chat.json
The evaluation results will be saved to ./log/gpt-3.5-turbo/codeujbcomplete/evaluation-chat.json
# generate and evaluate with openai api, please setting the Openai API key first.
# export OPENAI_API_BASE=''
# export OPENAI_API_KEY=''
./scripts/run_code_ujb.sh api_gen chat multiplepython gpt-3.5-turbo gpt-3.5-turbo
./scripts/run_code_ujb.sh eval chat multiplepython gpt-3.5-turbo gpt-3.5-turbo
# generate with ray inference
./scripts/run_code_ujb.sh local_gen chat multiplepython $HOME/models/deepseekcoder-instruct-7b deepseekcoder-instruct-7b
./scripts/run_code_ujb.sh eval chat multiplepython $HOME/models/deepseekcoder-instruct-7b deepseekcoder-instruct-7b
# generate with tgi inference
./scripts/run_code_ujb.sh tgi_gen chat multiplepython $HOME/models/deepseekcoder-instruct-7b deepseekcoder-instruct-7b
./scripts/run_code_ujb.sh eval chat multiplepython $HOME/models/deepseekcoder-instruct-7b deepseekcoder-instruct-7b