[NeurIPS 2025] BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models
-
Updated
Sep 20, 2025 - Python
[NeurIPS 2025] BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words
How good are LLMs at chemistry?
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
Language Model for Mainframe Modernization
Develop reliable AI apps
Training and Benchmarking LLMs for Code Preference.
The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".
Restore safety in fine-tuned language models through task arithmetic
A minimalist benchmarking tool designed to test the routine-generation capabilities of LLMs.
🐝 SwarmBench: Benchmarking LLMs' Swarm Intelligence
Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs
Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️
FaceXBench: Evaluating Multimodal LLMs on Face Understanding
Belief-Bias evaluation of local LLMs
The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.
A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.
Official repository for "RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴"
Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.
To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."