A MNIST-like fashion product database. Benchmark 👇
-
Updated
Jun 13, 2022 - Python
A MNIST-like fashion product database. Benchmark 👇
OpenMMLab Pose Estimation Toolbox and Benchmark.
Benchmarks of approximate nearest neighbor libraries in Python
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
A series of large language models developed by Baichuan Intelligent Technology
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Python package for the evaluation of odometry and SLAM
A 13B large language model developed by Baichuan Intelligent Technology
A unified evaluation framework for large language models
[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
A machine learning toolkit for log parsing [ICSE'19, DSN'16]
Reference implementations of MLPerf™ training benchmarks
⚡FlashRAG: A Python Toolkit for Efficient RAG Research
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Efficient Retrieval Augmentation and Generation Framework
📊 Benchmark multiple object trackers (MOT) in Python
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Add a description, image, and links to the benchmark topic page so that developers can more easily learn about it.
To associate your repository with the benchmark topic, visit your repo's landing page and select "manage topics."