llm-as-a-judge

Here are 29 public repositories matching this topic...

Agenta-AI / agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

prompt-engineering prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability llmops-platform

Updated Aug 5, 2025
Python

prometheus-eval / prometheus-eval

Star

Evaluate your LLM's response with Prometheus and GPT4 💯

python evaluation gpt4 llm llmops vllm litellm llm-as-a-judge llm-as-evaluator

Updated Apr 25, 2025
Python

metauto-ai / agent-as-a-judge

Star

⚖️ The First Coding Agent-as-a-Judge

llms llm-as-a-judge agent-as-a-judge

Updated May 14, 2025
Python

IAAR-Shanghai / xFinder

Star

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Updated Feb 26, 2025
Python

IAAR-Shanghai / xVerify

Star

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

benchmark regex reliability evaluation llm reliability-tools chatgpt cc-by-nc-nd-4 open-compass llm-as-a-judge deepseek-math judge-model reasoning-models open-r1 xverify math-verify

Updated Apr 17, 2025
Python

martin-wey / CodeUltraFeedback

Star

CodeUltraFeedback: aligning large language models to coding preferences

alignment code-generation dpo large-language-models llm-as-a-judge codeultrafeedback codal-bench

Updated Jun 25, 2024
Python

whitecircle-ai / circle-guard-bench

Star

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

benchmarking benchmark ai jailbreak safeguard guardrail guardrails large-language-models llm large-language-model llm-security llm-eval llm-evaluation llm-as-a-judge llm-jailbreaks

Updated Jul 28, 2025
Python

lupantech / ineqmath

Star

Solving Inequality Proofs with Large Language Models.

theorem-proving inequality olympiad llms llm-as-a-judge math-reasoning

Updated Aug 4, 2025
Python

docling-project / docling-sdg

Star

A set of tools to create synthetically-generated data from documents

ai question-answering documents sdg llm-as-a-judge

Updated Jun 17, 2025
Python

zhaochen0110 / Timo

Star

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

temporal-reasoning sota-model llms rlhf rlaif llm-as-a-judge llm-as-evaluator self-critic-framework colm2024

Updated Oct 23, 2024
Python

PKU-ONELab / Themis

Star

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

evaluation nlg llm-as-a-judge

Updated Feb 23, 2025
Python

OussamaSghaier / CuREV

Star

Harnessing Large Language Models for Curated Code Reviews

code-review software-maintenance empirical-software-engineering large-language-models dataset-curation llm-as-a-judge

Updated Mar 19, 2025
Python

root-signals / rs-sdk

Star

Root Signals SDK

evaluation observability llm evals llm-as-a-judge

Updated Aug 6, 2025
Python

UMass-Meta-LLM-Eval / llm_eval

Star

A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.

nlp-machine-learning large-language-models llm-as-a-judge

Updated Oct 1, 2024
Python

PKU-ONELab / LLM-evaluator-reliability

Star

The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?

evaluation nlg llm-as-a-judge

Updated Feb 23, 2025
Python

root-signals / root-signals-mcp

Star

MCP for Root Signals Evaluation Platform

mcp evals llm-as-a-judge agentic-ai model-context-protocol pydantic-ai

Updated Jun 30, 2025
Python

Alab-NII / llm-judge-extract-qa

Star

LLM-as-a-judge for Extractive QA datasets

qa evaluation evaluation-metrics ai-evaluation llm-as-a-judge

Updated Apr 22, 2025
Python

romaingrx / llm-as-a-jailbreak-judge

Sponsor

Star

Explore techniques to use small models as jailbreaking judges

jailbreak aisafety llm-as-a-judge

Updated Sep 18, 2024
Python

emory-irlab / conqret-rag

Star

Controversial Questions for Argumentation and Retrieval

retrieval argumentation hallucination retrieval-augmented-generation llm-as-a-judge

Updated Dec 9, 2024
Python

djokester / groqeval

Star

Use groq for evaluations

groq llm generative-ai mixtral llm-as-a-judge llm-as-evaluator llama3

Updated Jul 11, 2024
Python

Improve this page

Add a description, image, and links to the llm-as-a-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-as-a-judge topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-as-a-judge

Here are 29 public repositories matching this topic...

Agenta-AI / agenta

prometheus-eval / prometheus-eval

metauto-ai / agent-as-a-judge

IAAR-Shanghai / xFinder

IAAR-Shanghai / xVerify

martin-wey / CodeUltraFeedback

whitecircle-ai / circle-guard-bench

lupantech / ineqmath

docling-project / docling-sdg

zhaochen0110 / Timo

PKU-ONELab / Themis

OussamaSghaier / CuREV

root-signals / rs-sdk

UMass-Meta-LLM-Eval / llm_eval

PKU-ONELab / LLM-evaluator-reliability

root-signals / root-signals-mcp

Alab-NII / llm-judge-extract-qa

romaingrx / llm-as-a-jailbreak-judge

emory-irlab / conqret-rag

djokester / groqeval

Improve this page

Add this topic to your repo