The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
-
Updated
Aug 5, 2025 - Python
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
Evaluate your LLM's response with Prometheus and GPT4 💯
⚖️ The First Coding Agent-as-a-Judge
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
CodeUltraFeedback: aligning large language models to coding preferences
First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)
Solving Inequality Proofs with Large Language Models.
A set of tools to create synthetically-generated data from documents
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.
Harnessing Large Language Models for Curated Code Reviews
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
The official repository for our ACL 2024 paper: Are LLM-based Evaluators Confusing NLG Quality Criteria?
MCP for Root Signals Evaluation Platform
LLM-as-a-judge for Extractive QA datasets
Explore techniques to use small models as jailbreaking judges
Controversial Questions for Argumentation and Retrieval
Use groq for evaluations
Add a description, image, and links to the llm-as-a-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-as-a-judge topic, visit your repo's landing page and select "manage topics."