🐢 Open-Source Evaluation & Testing library for LLM Agents
-
Updated
Oct 1, 2025 - Python
🐢 Open-Source Evaluation & Testing library for LLM Agents
Deliver safe & effective language models
MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.
A Python library for verifying code properties using natural language assertions.
Open-source framework for stress-testing LLMs and conversational AI. Identify hallucinations, policy violations, and edge cases with scalable, realistic simulations. Join the discord: https://discord.gg/ssd4S37WNW
Prompture is an API-first library for requesting structured JSON output from LLMs (or any structure), validating it against a schema, and running comparative tests between models.
Integration of OpenAI with Pytest to automate API test generation.
🚀 ARM64 Browser Automation for Claude Code - SaaS testing on 80 Raspberry Pi budget. The first solution that works where Playwright/Puppeteer fail on ARM64. Autonomous testing without human debugging.
An automated approach for exploring and testing conversational agents using large language models. TRACER discovers chatbot functionalities, generates user profiles, and creates comprehensive test suites for conversational AI systems.
Open-source tools, SDKs, and resources for AetherLab AI quality control platform
A lightweight dashboard to view and analyze test automation results. Built with Streamlit + PostgreSQL, and powered by AI (Gemini) to help debug test failures faster.
Agentic Workflow Evaluation: Text Summarization Agent. This project includes an AI agent evaluation workflow using a text summarization model with OpenAI API and Transformers library. It follows an iterative approach: generate summaries, analyze metrics, adjust parameters, and retest to refine AI agents for accuracy, readability, and performance.
This repository contains a study comparing the web search capabilities of four AI assistants: Gemini 2.0 Flash, ChatGPT-4 Turbo, DeepSeekR1, and Grok 3
Modular and extensible Python framework for applying synthetic inference and controlled perturbations to AI model inputs, labels, and hyperparameters. Evaluate robustness, sensitivity, and stability of algorithms under realistic variations and adverse scenarios.
Add a description, image, and links to the ai-testing topic page so that developers can more easily learn about it.
To associate your repository with the ai-testing topic, visit your repo's landing page and select "manage topics."