A benchmark that challenges language models to code solutions for scientific problems
-
Updated
Sep 29, 2025 - Python
A benchmark that challenges language models to code solutions for scientific problems
AGI-Elo: How Far Are We From Mastering A Task?
A production-grade benchmarking suite that evaluates vector databases (Qdrant, Milvus, Weaviate, ChromaDB, Pinecone, SQLite, TopK) for music semantic search applications. Features automated performance testing, statistical analysis across 15-20 iterations, real-time web UI for database comparison, and comprehensive reporting with production.
LlamaEval is a rapid prototype developed during a hackathon to provide a user-friendly dashboard for evaluating and comparing Llama models using the TogetherAI API.
Add a description, image, and links to the ai-benchmarks topic page so that developers can more easily learn about it.
To associate your repository with the ai-benchmarks topic, visit your repo's landing page and select "manage topics."