- Chinese Medical Large Language Model
- Large-scale Supervised Fine Tuning: PULSE model is fine-tuned on approximately 4,000,000 samples of data (about 9.6B tokens) from both the medical and general domain.
- Variety of Medical-related Tasks: PULSE supports a wide range of natural language processing tasks in the medical field, including answering medical related questions, report interpretation, structured data extraction, diagnosis and treatment planning support, etc.
- PULSE-7b (this model is finetuned on bloomz-7b1-mt)
- PULSE-20b (this model is finetuned on InternLM-20B)
- Quantized models will be released soon. Regarding the larger models please contact us for collaboration.
- The models of this project is for research purpose in medical domain only. If the models, or any modified versions thereof, generate inaccurate information or be used in a service which results in misleading or harmful statements causing adverse effects, the responsibility lies with the service provider and is not associated with or attributable to this project.
- We cannot guarantee the accuracy, completeness, or relevance of the information generated. We emphatically recommend that users consult qualified healthcare professionals for personalized medical advice and treatment plans.
Model Name | AVG Rank | MedQA-USMLE | MedQA-Mainland | PromptCBLUE | WebMedQA | CheckupQA | MedicineQA | DialogSumm | MedTriage (F1) |
---|---|---|---|---|---|---|---|---|---|
GPT-4 | 1.25 | 1129 | 1117 | 1110 | 1116 | 1096 | 1098 | 1109 | 0.65 |
PULSE-Pro | 1.75 | 1089 | 1092 | 1088 | 1119 | 1105 | 1083 | 1096 | 0.63 |
ChatGPT | 4.00 | 1086 | 1057 | 1064 | 1053 | 1020 | 1029 | 1080 | 0.43 |
PULSE-20b | 4.12 | 1042 | 1024 | 1039 | 1059 | 1049 | 1069 | 1076 | 0.40 |
Baichuan2 | 4.50 | 1024 | 1041 | 1065 | 1044 | 1062 | 1035 | 1069 | 0.33 |
ChatGLM3 | 5.62 | 1038 | 1062 | 997 | 1012 | 1003 | 1024 | 1021 | 0.06 |
HuatuoGPT2 | 7.62 | 955 | 993 | 985 | 963 | 983 | 1003 | 980 | 0.01 |
QiZhenGPT | 8.38 | 955 | 959 | 945 | 989 | 1039 | 932 | 921 | 0.00 |
BenTsao | 8.75 | 961 | 921 | 936 | 910 | 927 | 986 | 920 | 0.02 |
BianQue2 | 10.12 | 913 | 928 | 919 | 988 | 974 | 900 | 908 | 0.00 |
MING | 10.75 | 902 | 909 | 924 | 867 | 862 | 960 | 918 | 0.01 |
DoctorGLM | 11.12 | 906 | 896 | 930 | 879 | 880 | 880 | 905 | 0.00 |
- In order to balance costs, we primarily utilize GPT4 for evaluation. As described in QLoRA, the comparative randomness in model comparisons solely based on GPT4 scores is substantial. This aligns with our observations. Therefore, we have adopted the widely used Elo Rating tournament evaluation method, as recommended by QLoRA.
Public Datasets [eval/data]
- MedQA_USMLE: Extracting 150 samples from the USMLE/test subset of MedQA.
- MedQA_Mainland: Extracting 150 samples from the Mainland/test subset of MedQA.
- PromptCBLUE: Extracting 150 samples from the test subset of PromptCBLUE.
- webMedQA: Extracting 150 samples from the test subset of webMedQA.
- CheckupQA: Numerical consultation dataset in physical examination scenarios. Evaluate the model's ability to understand and analyze medical-related values.
- MedicineQA: Medication consultation dataset with reference documents, evaluate the model's ability in the RAG (retrieval-augmented generation) scenario.
- DialogSumm: Summarize the doctor-patient conversations to evaluate the long text capabilities of the model.
- MedTriage: Gives guidance suggestions based on user information, evaluate the model's ability to select the correct department from given candidates.
- GPT4: OpenAI API "gpt-4-1106-preview"
- ChatGPT: OpenAI API "gpt-3.5-turbo-1106"
- PULSE_pro: >100B
- PULSE_20b
- Baichuan2
- ChatGLM3
- HuatuoGPT2 (Official Website)
- QiZhenGPT (QiZhen-CaMA-13B-Checkpoint-12400)
- BenTsao (LoRA for Huozi 1.0)
- BianQue2 (BianQue-2.0)
- MING
- DoctorGLM (p-tuningv2)
- For cost considerations, we chose to perform 360 rounds of random evaluation on each dataset. The order in which models compete against each other in the PK (player versus player) matches was randomized to counteract any order-related bias, with a random seed set to 42. The implementation code for the Elo rating and other hyperparameters can be referred to in Vicuna's Elo code: link to Vicuna's Elo code. The Elo rating parameters used were K=4 and an initial rating of 1000.
- Please refer to PULSE-EVAL for detailed code, data and results.
- We also launched MedBench on OpenCompass, provides more evaluation metrics and datasets for evaluation of large language models in the medical field.
We provide PULSE-tuner to finetune PULSE model, based on LLaMA-Factory project.
- For new released PULSE-20b, please check LMDeploy for quantization solution.
- We also provide GPTQ-for-PULSE for PULSE-7b model.
The table below provides the required GPU memory size for local deployment of PULSE for inference with a batch size of 1.
Model Param | Quantization | GPU Memory |
---|---|---|
7B | FP16 | 14GB |
7B | INT4 | 6GB |
20B | FP16 | 40GB |
20B | INT4 | 12GB |
- Download the contents of this repository to your local/remote server.
git clone https://github.com/openmedlab/PULSE
cd PULSE
- Create a conda environment and install dependencies.
conda env create -f llm.yml
conda activate llm
The versions of torch
and transformers
should be higher than the suggested versions.
Gradio
python web_demo_gradio.py
You can run the cli_demo.py
in the repository to start a simple command line demo.
python cli_demo.py
Medical Question Answering
Medical Licensing Examination
Report Interpretation
Diagnosis and Treatment Planning Support
Reject to Respond Non-related Questions
If you have other open-source projects that use or improve PULSE, welcome to submit a Pull Request to add them to the README or contact us through Issues.
An application that combines PULSE with an X-ray visual encoder, achieving multi-modal conversational capabilities.
A model fine-tuned based on the PULSE, incorporating an in-house corpus of COVID-19 knowledge databases from the Guangzhou Laboratory.
A structuring tool based on PULSE, designed to assist users in processing and analyzing free-text data. It offers features such as single selection, multiple selection, and information extraction.
An application based on PULSE for term normalization. The task of normalization is to map various clinical expressions of the same diagnosis, surgery, drug, examination, symptom, etc., to standard terminology.
A chatbot developed on PULSE, where users can add customized knowledge bases for their own application scenarios.
- Shanghai AI Lab
- SJTU - Qing Yuan Reseaerch Institute
- ECUST - NLP&BigData Lab
@article{pulse2023,
title={PULSE: Pretrained and Unified Language Service Engine},
author={Xiaofan Zhang, Kui Xue, Shaoting Zhang},
year={2023},
url={https://github.com/openmedlab/PULSE}
}
The code of this project is licensed under Apache 2.0, and the model weights are licensed underGNU AGPL 3.0.