evaluation-framework

Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent，格式化输出，指令追随，长文本，多语言，代码，自定义任务的能力基准测试。

openai evaluation-framework huggingface large-language-models llamacpp vllm llm-agent llms-benchmarking

Updated May 10, 2024
Python

MUSC-TBIC / etude-engine

Star

ETUDE (Evaluation Tool for Unstructured Data and Extractions) is a Python-based tool that provides consistent evaluation options across a range of annotation schemata and corpus formats

nlp machine-learning biomedical-informatics nlp-machine-learning evaluation-framework evaluation-engine

Updated Jul 22, 2022
Jupyter Notebook

brettdidonato / BSD_Evals

Star

LLM evaluation framework

bigquery evaluations gcp google-cloud openai evaluation-metrics evaluation-framework nl2sql text2sql llms generative-ai anthropic gemini-pro

Updated Apr 13, 2024
Jupyter Notebook

MinhVuong2000 / LLMReasonCert

Star

Official Implementation of ACL2024 paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://arxiv.org/abs/2402.11199).

framework evaluation knowledge-graph reasoning evaluation-framework llms faithfulness

Updated Jul 27, 2024
Python

AI4Bharat / Dhruva-Evaluation-Suite

Star

A tool to perform functional testing and performance testing of the Dhruva Platform

nlp tts locust nmt evaluation-metrics asr evaluation-framework

Updated Oct 18, 2023
Python

ghattab / MODELAR

Star

MODELAR: MODular and EvaLuative framework to improve surgical Augmented Reality visualization

visualization modular augmented-reality channels evaluation-framework augmented-reality-application depth-peeling surgical-planning

Updated Oct 27, 2021
C#

SouravD-Me / LLM-Evaluation-Dashboard

Star

A Visual Dashboard for Fundamental Benchmarking of LLMs

visualization python benchmarking natural-language-processing dashboard deep-learning jupyter-notebook data-analytics evaluation-framework large-language-models

Updated Feb 23, 2024
Jupyter Notebook

The AndroTest24 Study is the first comprehensive statistical study of existing Android GUI testing metrics. This repository provides the corresponding ① AndroTest24 App Benchmark ② Study Data ③ SATE (Statistical Android Testing Evaluation) Framework.

android testing statistical-analysis evaluation-framework test-metrics app-benchmark ase24

Updated Aug 30, 2024
Python

Improve this page

Add a description, image, and links to the evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation-framework

Here are 146 public repositories matching this topic...

IHIaadj / N-Compariw

Brudalaxe / BM25-VSM-Search-Engine

mit-ll-ai-technology / llm-sandbox

CalebGartner / PyExpr

Galactic-FaaS / BMR-Harness

e0397123 / AM-FM

andrewimpellitteri / llm_poli_compass

CSLiJT / awesome-lm-evaluation-methodologies

aigc-apps / PertEval

feup-infolab / army-ant

dkuehlwein / capgemini-gdsc

GatlenCulp / homebrew-vivaria

EvilPsyCHo / Open-LLM-Benchmark

MUSC-TBIC / etude-engine

brettdidonato / BSD_Evals

MinhVuong2000 / LLMReasonCert

AI4Bharat / Dhruva-Evaluation-Suite

ghattab / MODELAR

SouravD-Me / LLM-Evaluation-Dashboard

Yuanhong-Lan / AndroTest24

Improve this page

Add this topic to your repo