Skip to content

Latest commit

 

History

History
308 lines (193 loc) · 15.6 KB

README_en.md

File metadata and controls

308 lines (193 loc) · 15.6 KB

PULSE

MOSS

Code License Model License Open in OpenXLab

[中文版] [English]

Table of Contents


Model

Key Features

  • Chinese Medical Large Language Model
  • Large-scale Supervised Fine Tuning: PULSE model is fine-tuned on approximately 4,000,000 samples of data (about 9.6B tokens) from both the medical and general domain.
  • Variety of Medical-related Tasks: PULSE supports a wide range of natural language processing tasks in the medical field, including answering medical related questions, report interpretation, structured data extraction, diagnosis and treatment planning support, etc.

Download Link

Limitations

  • The models of this project is for research purpose in medical domain only. If the models, or any modified versions thereof, generate inaccurate information or be used in a service which results in misleading or harmful statements causing adverse effects, the responsibility lies with the service provider and is not associated with or attributable to this project.
  • We cannot guarantee the accuracy, completeness, or relevance of the information generated. We emphatically recommend that users consult qualified healthcare professionals for personalized medical advice and treatment plans.

Elo Evaluation

Model Name AVG Rank MedQA-USMLE MedQA-Mainland PromptCBLUE WebMedQA CheckupQA MedicineQA DialogSumm MedTriage (F1)
GPT-4 1.25 1129 1117 1110 1116 1096 1098 1109 0.65
PULSE-Pro 1.75 1089 1092 1088 1119 1105 1083 1096 0.63
ChatGPT 4.00 1086 1057 1064 1053 1020 1029 1080 0.43
PULSE-20b 4.12 1042 1024 1039 1059 1049 1069 1076 0.40
Baichuan2 4.50 1024 1041 1065 1044 1062 1035 1069 0.33
ChatGLM3 5.62 1038 1062 997 1012 1003 1024 1021 0.06
HuatuoGPT2 7.62 955 993 985 963 983 1003 980 0.01
QiZhenGPT 8.38 955 959 945 989 1039 932 921 0.00
BenTsao 8.75 961 921 936 910 927 986 920 0.02
BianQue2 10.12 913 928 919 988 974 900 908 0.00
MING 10.75 902 909 924 867 862 960 918 0.01
DoctorGLM 11.12 906 896 930 879 880 880 905 0.00

Evaluation Method

  • In order to balance costs, we primarily utilize GPT4 for evaluation. As described in QLoRA, the comparative randomness in model comparisons solely based on GPT4 scores is substantial. This aligns with our observations. Therefore, we have adopted the widely used Elo Rating tournament evaluation method, as recommended by QLoRA.

Evaluation Datasets

Public Datasets [eval/data]

  • MedQA_USMLE: Extracting 150 samples from the USMLE/test subset of MedQA.
  • MedQA_Mainland: Extracting 150 samples from the Mainland/test subset of MedQA.
  • PromptCBLUE: Extracting 150 samples from the test subset of PromptCBLUE.
  • webMedQA: Extracting 150 samples from the test subset of webMedQA.

Private Dataset

  • CheckupQA: Numerical consultation dataset in physical examination scenarios. Evaluate the model's ability to understand and analyze medical-related values.
  • MedicineQA: Medication consultation dataset with reference documents, evaluate the model's ability in the RAG (retrieval-augmented generation) scenario.
  • DialogSumm: Summarize the doctor-patient conversations to evaluate the long text capabilities of the model.
  • MedTriage: Gives guidance suggestions based on user information, evaluate the model's ability to select the correct department from given candidates.

Evaluation Models

Hyperparameter Selection

  • For cost considerations, we chose to perform 360 rounds of random evaluation on each dataset. The order in which models compete against each other in the PK (player versus player) matches was randomized to counteract any order-related bias, with a random seed set to 42. The implementation code for the Elo rating and other hyperparameters can be referred to in Vicuna's Elo code: link to Vicuna's Elo code. The Elo rating parameters used were K=4 and an initial rating of 1000.

Related Repository

  • Please refer to PULSE-EVAL for detailed code, data and results.
  • We also launched MedBench on OpenCompass, provides more evaluation metrics and datasets for evaluation of large language models in the medical field.

Fine-tuning

We provide PULSE-tuner to finetune PULSE model, based on LLaMA-Factory project.

Quantization

Inference

Hardware Requirements

The table below provides the required GPU memory size for local deployment of PULSE for inference with a batch size of 1.

Model Param Quantization GPU Memory
7B FP16 14GB
7B INT4 6GB
20B FP16 40GB
20B INT4 12GB

Installation

  1. Download the contents of this repository to your local/remote server.
git clone https://github.com/openmedlab/PULSE
cd PULSE
  1. Create a conda environment and install dependencies.
conda env create -f llm.yml
conda activate llm

The versions of torch and transformers should be higher than the suggested versions.

Examples

Web Demo

Gradio

python web_demo_gradio.py

Command Line Demo

You can run the cli_demo.py in the repository to start a simple command line demo.

python cli_demo.py

Use Cases

Medical Question Answering

image

Medical Licensing Examination

image

Report Interpretation

image

Diagnosis and Treatment Planning Support

image

Reject to Respond Non-related Questions

image

Related Links

If you have other open-source projects that use or improve PULSE, welcome to submit a Pull Request to add them to the README or contact us through Issues.

XrayPULSE

An application that combines PULSE with an X-ray visual encoder, achieving multi-modal conversational capabilities.

openmedlab/XrayPULSE

image

PULSE-COVID-19

A model fine-tuned based on the PULSE, incorporating an in-house corpus of COVID-19 knowledge databases from the Guangzhou Laboratory.

openmedlab/PULSE-COVID-19

image

Structured Data Extraction

A structuring tool based on PULSE, designed to assist users in processing and analyzing free-text data. It offers features such as single selection, multiple selection, and information extraction.

JuneYaooo/llm_structure_tool

image

Clinical Term Normalization

An application based on PULSE for term normalization. The task of normalization is to map various clinical expressions of the same diagnosis, surgery, drug, examination, symptom, etc., to standard terminology.

JOHNNY-fans/HierNorm

image

Knowledge Based Chatbot

A chatbot developed on PULSE, where users can add customized knowledge bases for their own application scenarios.

JuneYaooo/medical_kb_chatbot

image

Acknowledgement

  • Shanghai AI Lab
  • SJTU - Qing Yuan Reseaerch Institute
  • ECUST - NLP&BigData Lab

Citation

@article{pulse2023,
      title={PULSE: Pretrained and Unified Language Service Engine}, 
      author={Xiaofan Zhang, Kui Xue, Shaoting Zhang},
      year={2023},
      url={https://github.com/openmedlab/PULSE}
}

License

The code of this project is licensed under Apache 2.0, and the model weights are licensed underGNU AGPL 3.0.