MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

Introduction

Welcome to MedOdyssey, a medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context "Needles in a Haystack'' task and a series of tasks specific to medical applications, together comprising 10 datasets. Here is the Architecture of MedOdyssey.

Dataset Statistics

Task	Annotation	# Examples	Avg. Len	MIC	NFI	CIR	Eval Metrics
En.NIAH	Auto & Human	20×7×5	179.2k/32	✔	✔	✘	Acc.
Zh.NIAH	Auto & Human	20×7×5	45.6k/10.2	✔	✔	✘	Acc.
En.Counting	Auto	4×7	179.0k/13.6	✔	✘	✔	Acc.
Zh.Counting	Auto	4×7	45.6k/12.3	✔	✘	✔	Acc.
En.KG	Auto & Human	100	186.4k/68.8	✔	✘	✔	P., R., F1.
Zh.KG	Auto & Human	100	42.5k/2.0	✔	✘	✔	P., R., F1.
En.Term	Auto	100	183.1k/11.7	✔	✘	✘	Acc.
Zh.Term	Auto	100	32.6k/7.0	✔	✘	✘	Acc.
Zh.Case	Auto & Human	100	47.7k/1.3	✔	✘	✘	Acc.
Zh.Table	Auto & Human	100	53.6k/1.4	✔	✘	✘	P., R., F1.

Here are the dataset statistics, where "MIC" is short for Maximum Identical Context, "NFI" is short for Novel Facts Injection, and "CIR" is short for Counter-intuitive Reasoning.

Baselines

We researched current state-of-the-art long-context LLMs and presented the performance of two kinds of baseline LLMs in MedOdyssey. For closed-source commercial LLMs, we call the official APIs to get the responses for each task. We also deployed open-source models for inference on our own. The LLMs and versions we selected are as follows:

GPT-4: Released in March 2023, GPT-4 is a state-of-the-art language model developed by OpenAI. It supports a context window length of 8,192 tokens, which was extended to 128k in the November 2023 update. (gpt-4-turbo-2024-04-09)
GPT-4o: An optimized variant of GPT-4, GPT-4o was introduced in May 2024, has a 128k context window, and has a knowledge cut-off date of October 2023. (gpt-4o-2024-05-13)
Claude 3: Launched by Anthropic in March 2024, the family includes three models in ascending order of capability: Haiku, Sonnet, and Opus, allowing users to select. The three models offer a 200k context window upon launch. (claude-3-haiku-20240307 and claude-3-sonnet-20240229)
Moonshot-v1: Released in 2023 by Moonshot AI, it emphasizes scalability and supports a context window of 128k tokens for generating very long texts. (moonshot-v1-128k)
ChatGLM3-6b-128k: Developed by ZHIPU·AI in 2024, it builds based on ChatGLM3-6B and better handles long contexts up to 128K tokens.
InternLM2: An open-source LLM is introduced in 2024 by Shanghai AI Lab, including 7b and 20b sizes. It initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, and has officially supported 200k inference technology.
Yi-6b-200k: Yi series models are the next generation of open-source large language models trained from scratch by 01.AI and the 6B version is open-sourced and available to the public in November 2023 and supports a context window length of 200k.
Yarn-Mistral-7b-128k: Developed by NousResearch and released in November 2023. It is further pre-trained on long context data for 1500 steps using the YaRN extension method based on Mistral-7B-v0.1 and supports a 128k token context window.

Overall Evaluation Results

Main Results of Needles in a Haystack

Notes: The default is the exact string-matching strategy and SSM is the subset string-matching strategy.

Citation

Thank you for your interest, if you use this project, please give us a 🌟 and cite the following paper:

@misc{2406.15019,
Author = {Yongqi Fan and Hongli Sun and Kui Xue and Xiaofan Zhang and Shaoting Zhang and Tong Ruan},
Title = {MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens},
Year = {2024},
Eprint = {arXiv:2406.15019},
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
dataset		dataset
evaluation_result		evaluation_result
figure		figure
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

Introduction

Dataset Statistics

Baselines

Overall Evaluation Results

Main Results of Needles in a Haystack

Citation

About

Releases

Packages

Contributors 2

Languages

License

JOHNNY-fans/MedOdyssey

Folders and files

Latest commit

History

Repository files navigation

MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

Introduction

Dataset Statistics

Baselines

Overall Evaluation Results

Main Results of Needles in a Haystack

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages