PULSE

[中文版] [English]

Model

Key Features

Chinese Medical Large Language Model
Large-scale Supervised Fine Tuning: PULSE model is fine-tuned on approximately 4,000,000 samples of data (about 9.6B tokens) from both the medical and general domain.
Variety of Medical-related Tasks: PULSE supports a wide range of natural language processing tasks in the medical field, including answering medical related questions, report interpretation, structured data extraction, diagnosis and treatment planning support, etc.

Download Link

PULSE-7b (this model is finetuned on bloomz-7b1-mt)
PULSE-20b (this model is finetuned on InternLM-20B)
Quantized models will be released soon. Regarding the larger models please contact us for collaboration.

Limitations

The models of this project is for research purpose in medical domain only. If the models, or any modified versions thereof, generate inaccurate information or be used in a service which results in misleading or harmful statements causing adverse effects, the responsibility lies with the service provider and is not associated with or attributable to this project.
We cannot guarantee the accuracy, completeness, or relevance of the information generated. We emphatically recommend that users consult qualified healthcare professionals for personalized medical advice and treatment plans.

Elo Evaluation

Model Name	AVG Rank	MedQA-USMLE	MedQA-Mainland	PromptCBLUE	WebMedQA	CheckupQA	MedicineQA	DialogSumm	MedTriage (F1)
GPT-4	1.25	1129	1117	1110	1116	1096	1098	1109	0.65
PULSE-Pro	1.75	1089	1092	1088	1119	1105	1083	1096	0.63
ChatGPT	4.00	1086	1057	1064	1053	1020	1029	1080	0.43
PULSE-20b	4.12	1042	1024	1039	1059	1049	1069	1076	0.40
Baichuan2	4.50	1024	1041	1065	1044	1062	1035	1069	0.33
ChatGLM3	5.62	1038	1062	997	1012	1003	1024	1021	0.06
HuatuoGPT2	7.62	955	993	985	963	983	1003	980	0.01
QiZhenGPT	8.38	955	959	945	989	1039	932	921	0.00
BenTsao	8.75	961	921	936	910	927	986	920	0.02
BianQue2	10.12	913	928	919	988	974	900	908	0.00
MING	10.75	902	909	924	867	862	960	918	0.01
DoctorGLM	11.12	906	896	930	879	880	880	905	0.00

Evaluation Method

In order to balance costs, we primarily utilize GPT4 for evaluation. As described in QLoRA, the comparative randomness in model comparisons solely based on GPT4 scores is substantial. This aligns with our observations. Therefore, we have adopted the widely used Elo Rating tournament evaluation method, as recommended by QLoRA.

Evaluation Datasets

Public Datasets [eval/data]

MedQA_USMLE: Extracting 150 samples from the USMLE/test subset of MedQA.
MedQA_Mainland: Extracting 150 samples from the Mainland/test subset of MedQA.
PromptCBLUE: Extracting 150 samples from the test subset of PromptCBLUE.
webMedQA: Extracting 150 samples from the test subset of webMedQA.

Private Dataset

CheckupQA: Numerical consultation dataset in physical examination scenarios. Evaluate the model's ability to understand and analyze medical-related values.
MedicineQA: Medication consultation dataset with reference documents, evaluate the model's ability in the RAG (retrieval-augmented generation) scenario.
DialogSumm: Summarize the doctor-patient conversations to evaluate the long text capabilities of the model.
MedTriage: Gives guidance suggestions based on user information, evaluate the model's ability to select the correct department from given candidates.

Evaluation Models

GPT4: OpenAI API "gpt-4-1106-preview"
ChatGPT: OpenAI API "gpt-3.5-turbo-1106"
PULSE_pro: >100B
PULSE_20b
Baichuan2
ChatGLM3
HuatuoGPT2 (Official Website)
QiZhenGPT (QiZhen-CaMA-13B-Checkpoint-12400)
BenTsao (LoRA for Huozi 1.0)
BianQue2 (BianQue-2.0)
MING
DoctorGLM (p-tuningv2)

Hyperparameter Selection

For cost considerations, we chose to perform 360 rounds of random evaluation on each dataset. The order in which models compete against each other in the PK (player versus player) matches was randomized to counteract any order-related bias, with a random seed set to 42. The implementation code for the Elo rating and other hyperparameters can be referred to in Vicuna's Elo code: link to Vicuna's Elo code. The Elo rating parameters used were K=4 and an initial rating of 1000.

Related Repository

Please refer to PULSE-EVAL for detailed code, data and results.
We also launched MedBench on OpenCompass, provides more evaluation metrics and datasets for evaluation of large language models in the medical field.

Fine-tuning

We provide PULSE-tuner to finetune PULSE model, based on LLaMA-Factory project.

Quantization

For new released PULSE-20b, please check LMDeploy for quantization solution.
We also provide GPTQ-for-PULSE for PULSE-7b model.

Inference

Hardware Requirements

The table below provides the required GPU memory size for local deployment of PULSE for inference with a batch size of 1.

Model Param	Quantization	GPU Memory
7B	FP16	14GB
7B	INT4	6GB
20B	FP16	40GB
20B	INT4	12GB

Installation

Download the contents of this repository to your local/remote server.

git clone https://github.com/openmedlab/PULSE
cd PULSE

Create a conda environment and install dependencies.

conda env create -f llm.yml
conda activate llm

The versions of torch and transformers should be higher than the suggested versions.

Examples

Web Demo

Gradio

python web_demo_gradio.py

Command Line Demo

You can run the cli_demo.py in the repository to start a simple command line demo.

python cli_demo.py

Use Cases

Medical Question Answering

Medical Licensing Examination

Report Interpretation

Diagnosis and Treatment Planning Support

Reject to Respond Non-related Questions

Acknowledgement

Shanghai AI Lab
SJTU - Qing Yuan Reseaerch Institute
ECUST - NLP&BigData Lab

Citation

@article{pulse2023,
      title={PULSE: Pretrained and Unified Language Service Engine}, 
      author={Xiaofan Zhang, Kui Xue, Shaoting Zhang},
      year={2023},
      url={https://github.com/openmedlab/PULSE}
}

License

The code of this project is licensed under Apache 2.0, and the model weights are licensed underGNU AGPL 3.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

PULSE

Table of Contents

Model

Key Features

Download Link

Limitations

Elo Evaluation

Evaluation Method

Evaluation Datasets

Public Datasets [eval/data]

Private Dataset

Evaluation Models

Hyperparameter Selection

Related Repository

Fine-tuning

Quantization

Inference

Hardware Requirements

Installation

Examples

Web Demo

Command Line Demo

Use Cases

Related Links

XrayPULSE

PULSE-COVID-19

Structured Data Extraction

Clinical Term Normalization

Knowledge Based Chatbot

Acknowledgement

Citation

License

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

PULSE

Table of Contents

Model

Key Features

Download Link

Limitations

Elo Evaluation

Evaluation Method

Evaluation Datasets

Public Datasets [eval/data]

Private Dataset

Evaluation Models

Hyperparameter Selection

Related Repository

Fine-tuning

Quantization

Inference

Hardware Requirements

Installation

Examples

Web Demo

Command Line Demo

Use Cases

Related Links

XrayPULSE

PULSE-COVID-19

Structured Data Extraction

Clinical Term Normalization

Knowledge Based Chatbot

Acknowledgement

Citation

License