llms-benchmarking

Star

Here are 44 public repositories matching this topic...

bboylyg / BackdoorLLM

Star

[NeurIPS 2025] BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models

attack backdoor defense llms llms-benchmarking

Updated Sep 20, 2025
Python

lechmazur / nyt-connections

Star

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words

testing benchmark evaluation puzzles reasoning claude llm gpt-5 llms-benchmarking gemini-pro grok4

Updated Sep 20, 2025
Python

lerogo / MMGenBench

Star

Official repository of MMGenBench

mllm llms-benchmarking mmgenbench

Updated Mar 8, 2025
Python

lamalab-org / chembench

Star

How good are LLMs at chemistry?

benchmark machine-learning chemistry safety materials-science llm llms llms-benchmarking

Updated Sep 11, 2025
Python

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

FSoft-AI4Code / XMainframe

Star

Language Model for Mainframe Modernization

migration cobol mainframe code-summarization codellm llms-benchmarking

Updated Aug 23, 2024
Python

multinear / multinear

Star

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Sep 2, 2025
Python

rajpurkarlab / craft-md

Star

conversational-ai llms-benchmarking clinical-llm multiturn-conversations

Updated Mar 14, 2025
Python

amazon-science / llm-code-preference

Star

Training and Benchmarking LLMs for Code Preference.

code-generation llm-training llm-evaluation llms-benchmarking

Updated Nov 15, 2024
Python

epfl-dlab / cc_flows

Star

The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".

ai competitive-programming agents competitive-programming-contests competitive-coding llms llms-reasoning llms-benchmarking aiflows

Updated Feb 12, 2024
Python

declare-lab / resta

Star

Restore safety in fine-tuned language models through task arithmetic

alignment safety alignment-algorithm llm llms llm-safety llms-benchmarking llm-safety-benchmark

Updated Mar 28, 2024
Python

Laoyu84 / 4onebench

Star

A minimalist benchmarking tool designed to test the routine-generation capabilities of LLMs.

agents large-language-models llms-benchmarking

Updated Nov 28, 2024
Python

RUC-GSAI / YuLan-SwarmIntell

Star

🐝 SwarmBench: Benchmarking LLMs' Swarm Intelligence

benchmark swarm swarm-intelligence kilobots swarm-robotics llms-benchmarking rlvr

Updated May 21, 2025
Python

gautierdag / plancraft

Star

Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs

minecraft planning interactive-environments llm multimodal-large-language-models llms-benchmarking agentic-ai

Updated Jul 9, 2025
Python

Paulescu / text-embedding-evaluation

Star

Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️

machine-learning embeddings llms llms-benchmarking

Updated Apr 19, 2024
Python

Kartik-3004 / facexbench

Star

FaceXBench: Evaluating Multimodal LLMs on Face Understanding

attributes face face-recognition gender-classification facial-expression-recognition race-detection age-estimation crowd-counting face-antispoofing multimodal face-segmentation face-perception deepfake-detection headpose-estimation llms llms-benchmarking multimodal-llms

Updated Feb 4, 2025
Python

SergioV3005 / llm-belief-bias

Star

Belief-Bias evaluation of local LLMs

bias-detection llms-reasoning llms-benchmarking

Updated Jul 3, 2025
Python

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

donut biases synthetic-dataset-generation layoutlm synthetic-dataset layoutxlm token-classification layoutlmv3 layoutlmv2 llms-benchmarking idefics2

Updated Jul 16, 2025
Python

microsoft / private-benchmarking

Star

A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.

platform benchmarking inference secure private mpc contamination trusted-execution-environment confidential-computing large-language-models llms-benchmarking private-benchmarking ezpc

Updated Sep 16, 2024
Python

cosmaadrian / romath

Star

Official repository for "RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴"

mathematics dataset romanian llms-benchmarking

Updated Feb 14, 2025
Python

Improve this page

Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llms-benchmarking

Here are 44 public repositories matching this topic...

bboylyg / BackdoorLLM

lechmazur / nyt-connections

lerogo / MMGenBench

lamalab-org / chembench

parea-ai / parea-sdk-py

FSoft-AI4Code / XMainframe

multinear / multinear

rajpurkarlab / craft-md

amazon-science / llm-code-preference

epfl-dlab / cc_flows

declare-lab / resta

Laoyu84 / 4onebench

RUC-GSAI / YuLan-SwarmIntell

gautierdag / plancraft

Paulescu / text-embedding-evaluation

Kartik-3004 / facexbench

SergioV3005 / llm-belief-bias

nachoDRT / MERIT-Dataset

microsoft / private-benchmarking

cosmaadrian / romath

Improve this page

Add this topic to your repo