InternAgent: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification

[ Paper 📓 ] [ Apply Page 💡 ] [ Website 🏠 ]

From One Idea to Autonomous Experimentation

🔥 News

2025.09.29: 🔥 Our deep research agent, InternAgent-DR, demonstrates strong competitiveness across the GAIA, HLE, GPQA, and TRQA benchmarks, achieving state-of-the-art results on multiple tasks.
2025.09.12: 🔥 Our latest coding agent InternAgent-MLE has achieved the championship in MLE-bench with 36.44% medal rate, ranking #1 among all competing methods, see details at openai/mle-bench.
2025.08.06: 🔥 InternAgent now supports Intern-S1 which combines strong general-task capabilities with state-of-the-art performance on a wide range of scientific tasks. Check here for how to use Intern-S1.
2025.07.17: 🔥 The source code of InternAgent has been partially open-sourced. The complete version of InternAgent (covering 12 types of tasks for autonomous scientific research) will be open-sourced soon. This code repository can be used for full-cycle autonomous scientific research, ranging from hypothesis generation to automated experimental execution. It includes the source code for our initial version, covering paper retrieval, idea generation, coding, and experimental execution.
2025.07.10: NovelSeek has be renamed to InternAgent. This change embodies our hopeful vision for autonomous scientific research framework, and we hope it will empower all researchers to achieve great scientific discoveries.

📖 Overview

InternAgent can support 12 types of scientific research tasks ranging from the AI field to the science field, including reaction yield prediction, molecular dynamics, power flow estimation, time series forecasting, transcription prediction, enhancer activity prediction, sentiment classification, 2D image classification, 3D point classification, 2D semantic segmentation, 3D autonomous driving, large vision-language model fine-tuning.

🌟 Core Features

InternAgent covers three main capabilities: (1) Self-evolving idea generation with human-interactive feedback, (2) Idea-to-methodology construction, and (3) Evolutionary experimental planning and execution.

It is a unified, closed-loop multi-agent system designed to automate and accelerate innovative research across scientific domains. Through intelligent agent collaboration, our system enables end-to-end automation from idea generation and methodology construction to experimental execution, dramatically enhancing research efficiency and creativity.

💡 Self-Evolving Idea Generation with Human-Interactive Feedback

Autonomous generation, selection, and evolution of innovative research ideas through multi-agent collaboration
Supports interactive human feedback, enabling continuous refinement of ideas with expert insights
Dynamically integrates literature, code, and domain knowledge to inspire diverse innovation pathways

🏗️ Idea-to-Methodology Construction

Systematically transforms creative ideas into actionable and verifiable research methodologies
Integrates baseline code, literature, and expert knowledge to automatically generate comprehensive methodological frameworks
Supports iterative refinement and traceability of research methods

🛠️ Evolutionary Experimental Planning and Execution

Automates complex experimental workflow planning, code implementation, and debugging
Employs exception-guided intelligent debugging to automatically identify and resolve code issues
Enables adaptive evolution and continuous optimization of experimental plans

🤖 Multi-Agent Orchestration

Coordinates specialized agents such as Survey, Coding, Idea Innovation, and Assessment Agents and so on
Manages data flow, task scheduling, and human interaction points for efficient and coherent research processes
Supports extensibility and compatibility with diverse scientific tasks

InternAgent delivers an "end-to-end algorithmic innovation", empowering AI+X researchers to rapidly complete the full research loop—from idea to methodology to experimental validation—accelerating scientific discovery and breakthroughs.

🔬 Supported Research Tasks

Suzuki Yield Prediction
Molecular Dynamics Simulation
Enhancer Activity Prediction
Transcription Prediction for Perturbation Response
Power Flow Estimation
Time Series Forecasting
Semantic Segmentation
Image Classification
Sentiment Analysis
Point Cloud Classification
Autonomous Driving
VLM & LLM Fine-tuning
......

🎉 Benchmark Results

Results on 12 different types of research tasks

The results report both maximum performance and mean performance (i.e., the average across runs with performance gains) achieved by InternAgent and Dolphin. InternAgent consistently improves upon the baseline and outperforms Dolphin across all tasks, spanning AI and scientific task domains.

Max Performance (per task)

Task	Metric	Baseline	Dolphin	InternAgent
AutoRYP	R^2 ↑	27.6	31.8 (+4.2)	35.4 (+7.8)
AutoMD	Forces-MAE ↓	0.158	0.152	0.148
AutoPower	RMSE ↓	0.00473	0.00455	0.00426
AutoTSF	MAE ↓	0.4382	0.4627	0.4331
AutoTPPR	MSE ↓	0.197	0.173	0.146
AutoEAP	HK-PCC ↑	0.65	0.76	0.79
AutoSenCls	Acc ↑	91.0	92.5 (+1.5)	93.5 (+2.5)
Auto2DCls	Top-1 Acc ↑	81.2	82.0 (+0.8)	83.3 (+2.1)
Auto3DCls	OA ↑	91.0	93.9 (+2.9)	95.5 (+4.5)
Auto2DSeg	mIoU ↑	78.8	-	81.0 (+2.2)
AutoPCDet	mAP ↑	65.0	-	65.9 (+0.9)
AutoVLM	QA ↑	67.1	-	67.6 (+0.5)

Average Performance (per task)

Task	Metric	Baseline	Dolphin	InternAgent
AutoRYP	R^2 ↑	27.6	31.3 (+3.7)	33.5 (+5.9)
AutoMD	Forces-MAE ↓	0.158	0.155	0.152
AutoPower	RMSE ↓	0.00473	0.00459	0.00447
AutoTSF	MAE ↓	0.4382	-	0.4346
AutoTPPR	MSE ↓	0.197	0.179	0.170
AutoEAP	HK-PCC ↑	0.65	0.73	0.77
AutoSenCls	Acc ↑	91.0	91.8 (+0.8)	92.5 (+1.5)
Auto2DCls	Top-1 Acc ↑	81.2	81.8 (+0.6)	82.2 (+1.0)
Auto3DCls	OA ↑	91.0	92.0 (+1.0)	93.4 (+2.4)
Auto2DSeg	mIoU ↑	78.8	-	80.1 (+1.3)
AutoPCDet	mAP ↑	65.0	-	65.7 (+0.7)
AutoVLM	QA ↑	67.1	-	67.6 (+0.5)

👨‍💻 MLE-Bench: Record-Breaking Performance in Just 12 Hours!

InternAgent-MLE has achieved 36.44% medal rate on the MLE-Bench, securing the #1 position among all competing methods - and remarkably, this was accomplished in only 12 hours of running time!

Agent	Low == Lite (%)	Medium (%)	High (%)	All (%)	Running Time (hours)	Date
InternAgent-MLE deepseek-r1	62.12 ± 3.03	26.32 ± 2.63	24.44 ± 2.22	36.44 ± 1.18	12	2025-09-12
Neo multi-agent	48.48 ± 1.52	29.82 ± 2.32	24.44 ± 2.22	34.22 ± 0.89	36	2025-07-28
R&D-Agent o3 + GPT-4.1	51.52 ± 6.9	19.3 ± 5.5	26.67 ± 0	30.22 ± 1.5	24	2025-08-15
ML-Master deepseek-r1	48.5 ± 1.5	20.2 ± 2.3	24.4 ± 2.2	29.3 ± 0.8	12	2025-06-17
R&D-Agent o1-preview	48.18 ± 2.49	8.95 ± 2.36	18.67 ± 2.98	22.4 ± 1.1	24	2025-05-14
AIDE o1-preview	34.3 ± 2.4	8.8 ± 1.1	10.0 ± 1.9	16.9 ± 1.1	24	2024-10-08
AIDE gpt-4o-2024-08-06	19.0 ± 1.3	3.2 ± 0.5	5.6 ± 1.0	8.6 ± 0.5	24	2024-10-08
AIDE claude-3-5-sonnet-20240620	19.4 ± 4.9	2.6 ± 1.5	2.3 ± 2.3	7.5 ± 1.8	24	2024-10-08
OpenHands gpt-4o-2024-08-06	11.5 ± 3.4	2.2 ± 1.3	1.9 ± 1.9	5.1 ± 1.3	24	2024-10-08
AIDE llama-3.1-405b-instruct	8.3 ± 2.6	1.2 ± 0.8	0.0 ± 0.0	3.1 ± 0.9	24	2024-10-08
MLAB gpt-4o-2024-08-06	4.2 ± 1.5	0.0 ± 0.0	0.0 ± 0.0	1.3 ± 0.5	24	2024-10-08

🧪 GAIA, GPQA-diamond and HLE benchmarks

We benchmark InternAgent-DR on a series of benchmarks, including GAIA, HLE and GPQA, and achieved SOTA results.

Model/Framework	GAIA Avg	GPQA Avg	HLE text only	HLE All
Intern-S1	15.15	78.26	8.90	8.30
Deepseek-R1	18.78	82.32	8.60	-
o4-mini	16.97	78.28	14.50	14.28
GPT-5	-	85.35	25.85	24.76
OpenAI DR	67.36	-	-	26.60
Manus	73.30	-	-	-
Gemini Deep Research	-	-	-	26.90
MiroFlow	74.50	-	29.50	27.20
OWL	69.70	-	-	-
X-Masters	-	-	32.10	27.72
InternAgent-DR (Qwen-235B)	58.79	66.16	15.04	14.84
InternAgent-DR (o4-mini)	76.96	87.37	31.60	30.80

🚀 How to use the early version, Dolphin?

Installation

conda create -n dolphin python=3.11
conda activate dolphin

# Install PyPI requirements
pip install -r requirements.txt

Start Auto-Research using Dolphin

bash launch_dolphin.sh

# modify launch_dolphin.py line # line 189 if round > 0
# exp_base_file_list = [List your exp dir]

Note that you need to add api_key and specify the model and topic in launch_dolphin.sh. You can refer to the doc if you want to use self-deployed model.
Data for Point Classfication, Image Classification, and Sentiment Classification tasks can be downloaded here.

Citation

@article{team2025novelseek,
  title={NovelSeek: When Agent Becomes the Scientist--Building Closed-Loop System from Hypothesis to Verification},
  author={Team, NovelSeek and Zhang, Bo and Feng, Shiyang and Yan, Xiangchao and Yuan, Jiakang and Yu, Zhiyin and He, Xiaohan and Huang, Songtao and Hou, Shaowei and Nie, Zheng and others},
  journal={arXiv preprint arXiv:2505.16938},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
dolphin_utils		dolphin_utils
examples		examples
images		images
LICENSE		LICENSE
README.md		README.md
launch_dolphin.py		launch_dolphin.py
launch_dolphin.sh		launch_dolphin.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InternAgent: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification

🔥 News

📖 Overview

🌟 Core Features

💡 Self-Evolving Idea Generation with Human-Interactive Feedback

🏗️ Idea-to-Methodology Construction

🛠️ Evolutionary Experimental Planning and Execution

🤖 Multi-Agent Orchestration

🔬 Supported Research Tasks

🎉 Benchmark Results

Results on 12 different types of research tasks

Max Performance (per task)

Average Performance (per task)

👨‍💻 MLE-Bench: Record-Breaking Performance in Just 12 Hours!

🧪 GAIA, GPQA-diamond and HLE benchmarks

🚀 How to use the early version, Dolphin?

Installation

Start Auto-Research using Dolphin

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

Alpha-Innovator/InternAgent

Folders and files

Latest commit

History

Repository files navigation

InternAgent: When Agent Becomes the Scientist – Building Closed-Loop System from Hypothesis to Verification

🔥 News

📖 Overview

🌟 Core Features

💡 Self-Evolving Idea Generation with Human-Interactive Feedback

🏗️ Idea-to-Methodology Construction

🛠️ Evolutionary Experimental Planning and Execution

🤖 Multi-Agent Orchestration

🔬 Supported Research Tasks

🎉 Benchmark Results

Results on 12 different types of research tasks

Max Performance (per task)

Average Performance (per task)

👨‍💻 MLE-Bench: Record-Breaking Performance in Just 12 Hours!

🧪 GAIA, GPQA-diamond and HLE benchmarks

🚀 How to use the early version, Dolphin?

Installation

Start Auto-Research using Dolphin

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages