Skip to content

Releases: evalplus/repoqa

RepoQA v0.1.2

25 May 06:05
Compare
Choose a tag to compare

Notable updates

  • Fixed wget dependency
  • Propageted trust_remote_code for tokenizers

Resources

RepoQA v0.1.1

19 May 19:44
Compare
Choose a tag to compare

Notable updates

  • Trimming output before post-processing largely improved certain cases @ganler
  • Fixed HF backend @zyzzzz-123 @ganler
  • HF backend supports attn-implementation to enable flash-attn 2 @ganler
  • Optimized the computation of trained context size @JialeTomTian #38
  • End-of-string optimization largely improved the inference speed @ganler
  • Optimized post-processing accuracy using a better regex expression @ganler

Finished features/fixes are listed as noticeable.
WIP updates will be listed in subsequent releases when they are fully done.

Full changelog: v0.1.0...v0.1.1

Quick examples

pip install repoqa
repoqa.search_needle_function --model "gpt4-turbo" --backend openai
repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code --attn-implementation "flash_attention_2"
repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google

Resources

RepoQA v0.1.0

26 Apr 09:58
Compare
Choose a tag to compare

RepoQA for Long-Context Code Understanding

Introduction

RepoQA is a benchmark that aims to exercise LLM's long-context code understanding ability.

  • Multi-Lingual: RepoQA now supports repositories from 5 programming languages:
    • Python
    • C++
    • TypeScript
    • Rust
    • Java
  • Application-driven: RepoQA aims to evaluate LLMs on long-context tasks that can reflect real-life uses. Before RepoQA, long-context evaluators mainly focus on using synthetic tasks to examine the vulnerable parts of the LLM's long context, such as "Needle in the Code" by CodeQwen and "Needle in a Haystack".
  • The first RepoQA task we propose is 🔍 Searching Needle Function:
    • 500 sub-tasks = 5 PLs x 10 repos x 10 needles
    • Asks the model to search the corresponding function (we call it needle function) given a precise natural language description

RepoQA is easy to use

  • Supports following backends
    • OpenAI
    • Anthropic
    • vLLM
    • HuggingFace transformers
    • Google Generative AI API (Gemini)
  • 🚀 Evaluation can be done in one command
  • 🏆 A leaderboard: https://evalplus.github.io/repoqa.html

Quick examples

pip install repoqa
repoqa.search_needle_function --model "gpt4-turbo" --backend openai
repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code
repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google

Resources

RepoQA v0.1.0 Release Candidate 1

24 Apr 06:54
Compare
Choose a tag to compare
Pre-release
v0.1.0rc1

refactor: clean files for release

RepoQA Search-Needle-Function Dataset 2024-04-20

21 Apr 02:04
Compare
Choose a tag to compare
dev-dataset

refactor: optimize dataset name

Evaluated Results

20 Apr 03:06
Compare
Choose a tag to compare
Evaluated Results Pre-release
Pre-release

See attachment; some results might be incomplete.

Release of dependency and base dataset

11 Apr 00:24
Compare
Choose a tag to compare
Pre-release

We use this release to upload dependency files of different languages produced by https://github.com/evalplus/repoqa/tree/main/scripts/curate/dep_analysis