[ English | 中文 ]
Based on BigBio project, we have defined a set of lightweight, task-specific schema to help simplify programmatic access to common biomedical datasets. This schema should be implemented for each dataset in addition to a schema that preserves the original dataset format.
- DUTIR-BioNLP Data Schema Documentation
This is a simple container format with minimal nesting that supports a range of information extraction tasks:
- Named entity recognition (NER) (Example: Chinese/CMeEE-V2) and Named entity normalization/linking (NEN) (Example:English/ncbi_disease)
- Relation extraction (RE) (Example: Chinese/CMeIE-V2)
- Causal relation extraction (CRE)
- Event extraction (EE)
- Coreference resolution (COREF)
{
"id": "ABCDEFG",
"document_id": "XXXXXX",
"passages": [...],
"entities": [...],
"events": [...],
"coreferences": [...],
"relations": [...],
}
Schema Notes
id
fields appear at the top (i.e. document) level. They start with "0" that makes everyid
field in a dataset unique.document_id
should be a dataset provided document id. If not provided in the dataset, it can be set equal to the top levelid
.passages
captures document structure such as named sections. It can be empty fields if no document strcture.entities
,events
,coreferences
,relations
may be empty fields depending on the dataset and specific task.
Passages capture document structure, e.g., the title and abstact sections of a document. Passages
can be empty fields if no document strcture.
offsets
contain character offsets into the string that would be created from" ".join([passage["text"] for passage in passages])
offsets
andtext
are always lists to support discontinous spans. For continuous spans, they will have the formoffsets=[(lo,hi)], text=["text span"]
. For discontinuous spans, they will have the formoffsets=[(lo1,hi1), (lo2,hi2), ...], text=["text span 1", "text span 2", ...]
{
"id": "0",
"document_id": "227508",
"passages": [
{
"id": "0",
"type": "title",
"text": ["Naloxone reverses the antihypertensive effect of clonidine."],
"offsets": [[0, 59]]
},
{
"id": "1",
"type": "abstract",
"text": ["In unanesthetized, spontaneously hypertensive rats the decrease in blood pressure and heart rate produced by intravenous clonidine, 5 to 20 micrograms/kg, was inhibited or reversed by nalozone, 0.2 to 2 mg/kg. The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone. Naloxone alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence clonidine-suppressible binding of [3H]-dihydroergocryptine (1 nM). These findings indicate that in spontaneously hypertensive rats the effects of central alpha-adrenoceptor stimulation involve activation of opiate receptors. As naloxone and clonidine do not appear to interact with the same receptor site, the observed functional antagonism suggests the release of an endogenous opiate by clonidine or alpha-methyldopa and the possible role of the opiate in the central control of sympathetic tone."],
"offsets": [[60, 1075]]
},
],
}
normalized
sub-component may contain 1 or more normalized links to database entity identifiers. Here, "db_name" denotes the database name, "db_id" denotes the entity indentifier in this database.
"entities": [
{
"id": "0",
"offsets": [[0, 8]],
"text": ["Naloxone"],
"type": "Chemical",
"normalized": [{"db_name": "MESH", "db_id": "D009270"}]
},
...
],
type
is the relation type.arg1_id
is the first entity identifier of the relation.arg2_id
is the second entity identifier of the relation.
"relations": [
{
"id": "0",
"type": "chemical-induced disease",
"arg1_id": "10",
"arg2_id": "32"
}
]
type
is the relation type.arg1
is the first entity identifier of the relation. It can be an entity (the arg 1 "type" should be set to "entity") or a relation (the arg2 "type" should be set to "relation").arg2
is the second entity identifier of the relation. It can be an entity (the arg 1 "type" should be set to "entity") or a relation (the arg2 "type" should be set to "relation").
"causal": [
{
"id": "0",
"type": "条件关系",
"arg1": {"id":"3", "type":"entity"},
"arg2": {"id":"0", "type":"relation"}
}
]
"events": [
{
"id": "0",
"type": "Reaction",
"trigger": {
"offsets": [[0,6]],
"text": ["reacts"]
},
"arguments": [
{
"role": "Theme",
"ref_id": "5",
"text": "corpus luteum"
},
{
"role": "Instrument", # if the argument is not entity, set "ref_id" to "", only provide the text.
"ref_id": ""
"text": "...."
}
]
}
]
entity_ids
denotes the several entity names refer to the same actual entity.
"coreferences": [
{
"id": "0",
"entity_ids": ["1", "10", "23"],
},
...
]
This is a simple container format with minimal nesting that supports a range of text classification tasks. (Example file: Chinese/TCM-SD-TC)
{
"id": "ABCDEFG",
"document_id": "XXXXXX",
"text": "xxxxx",
"labels": [...]
}
Schema Notes
labels
is the label of the text type.
Example:
{
"id": "0",
"document_id": "32653511",
"text": "Potential benefits and risks of omega-3 fatty acids supplementation to patients with COVID-19.",
"labels": ["Treatment"],
}
This is a simple container format with minimal nesting that supports a range of question answering tasks.
{
"id": "ABCDEFG",
"document_id": "XXXXXX",
"question_id": "xxxxx"
"question": "...",
"type": "...",
"choices": [...],
"context": "...",
"answer": [...],
"long_answer": [...]
}
Schema Notes
id
fields appear at the top (i.e. document) level. They start with "0" that makes everyid
field in a dataset unique.document_id
should be a dataset provided document id. If not provided in the dataset, it can be set equal to the top levelid
.question_id
should be a dataset provided question id. If not provided in the dataset, it can be set equal to the upper leveldocument_id
.question
should be the question input. Some datasets' questions and contexts may be independent, please put the contexts intocontext
field.type
can be "yesno", "multiple_choice" or "dialogue".choices
whould be provided when the dataset type is "yesno" or "multiple_choice".answer
should be the answer output. Some "yesno" and "multiple_choice" datasets may provide addtional long answers to explain the choice, please put them into 'long_answer
' field.context
,long_answer
may be empty fields depending on the dataset and specific task.
(Example file: English/med_qa)
{
"id": "0",
"question_id": "0",
"document_id": "0",
"question": "A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?",
"type": "multiple_choice",
"choices": ["Ampicillin", "Ceftriaxone", "Ciprofloxacin", "Doxycycline", "Nitrofurantoin"],
"context": "",
"answer": ["Nitrofurantoin"],
"long_answer": []
}
(Example file: Chinese/BenTsao_livercacer)
{
"id": "2",
"question_id": "2",
"document_id": "2",
"question": "Hello doctor,I just got one side of my wisdom teeth removed both upper and lower six days ago, and I have another one scheduled after two days for the left side. The upper is 100 % fine, which they pulled. The lower they pulled the tooth without removing the root because it was too close to the nerve. I do have some stitching in this area currently on the lower cheek. Noticed a hard lump about 1 inch between my lower jawline and cheek area. Is this normal because of swelling or something else. Currently taking Amoxicillin 500 mg and Ibuprofen 600 mg. Please let me know what this hard lump is, also generally on the lower side where this lump is, remains sore.",
"type": "sqa",
"choices": [],
"context": "",
"answer": ["Hello. The lump is mostly a hard swelling which forms postsurgical removal of the wisdom tooth or maybe even by swollen lymph nodes due to infection. It takes a week or two for the swelling to subside. Sometimes, if the root is still inside, the infection may be remaining there causing swelling. Get an X-ray done to find out the exact problem and stronger antibiotics like Augmentin (Amoxicillin and Clavulanic acid) along with a painkiller for a week."],
"long_answer": []
}
(Example file: Chinese/TCM_Literature_QA)
{
"id": "0",
"document_id": "828",
"question_id": "0",
"question": "“干呕、吐涎沫、头痛者吴茱萸汤主之”这句话曾出现在哪本医学巨著中?",
"type": "cqa",
"choices": [],
"context": "反佐配伍的典范,始见于张仲景《伤寒杂病论》,其中记载“干呕、吐涎沫、头痛者吴茱萸汤主之”。患者病机为肝寒犯胃,浊气上逆所致头痛。胃阳不布产生涎沫随浊气上逆而吐出,肝脉与督脉交会于巅顶,肝经寒邪,循经上冲则头痛,以吴茱萸汤主治。可在吴茱萸汤中加入少许黄连反佐,用以防止方中吴茱萸、人参、干姜等品辛热太过,从而达到温降肝胃、泄浊通阳而止头痛的功效。后代医者多在清热剂和温里剂中运用此法。",
"answer": "《伤寒杂病论》",
"long_answer": []
}
This is a simple container format with minimal nesting that supports a range of multi-round dialogue tasks. (Example file: Chinese/MedDialog)
{
"id": "ABCDEFG",
"conversation_id": "XXXXXX",
"num_turns": int,
"context": "...",
"chat": [...],
}
Schema Notes
id
fields appear at the top (i.e. document) level. They start with "0" that makes everyid
field in a dataset unique.conversation_id
should be a dataset provided conversation id. If not provided in the dataset, it can be set equal to the top levelid
.num_turns
is the total number of the rounds in the chat.chat
should be chat context for each round.context
may be empty fields depending on the dataset and specific task.
{
"id": "159",
"conversation_id": "159",
"num_turns": 2,
"context": "",
"chat": [
{
"patient": "hello doctor, i get a cough for the last few days, which is heavy during night times. no raise in temperature but feeling tired with no travel history. no contact with any covid-19 persons. it has been four to five days and has drunk a lot of benadryl and took paracetamol too. doctors have shut the op so do not know what to do? please help.",
"doctor": "hello, i understand your concern. i just have a few more questions. does your cough has phlegm? any other symptoms like difficulty breathing? any other medical condition such as asthma, hypertension? are you a smoker? alcoholic beverage drinker?"
},
{
"patient": "thank you doctor, i have phlegm but not a lot. a tiny amount comes out most of the time. i have no difficulty in breathing. no medical conditions and not a smoker nor a drinker.",
"doctor": "hi, i would recommend you take n-acetylcysteine 200 mg powder dissolved in water three times a day. you may also nebulize using pnss (saline nebulizer) three times a day. this will help the phlegm to come out. i would also recommend you take vitamin c 500 mg and zinc to boost your immune system. if symptoms persist, worsen or new onset of symptoms has been noted, further consult is advised."
}
]
}
This is a simple container format with minimal nesting that supports a range of machine translation tasks. (Example file: English/paramed)
{
"id": "ABCDEFG",
"document_id": "XXXXXX",
"multilingual": [...]
}
Schema Notes
multilingual
sub-component may contain two or more normalized parallel texts.language
is the languge of this text. "zh" denotes Chinese, "en" denotes English.
{
"id": "0",
"document_id": "0",
"multilingual": [
{
"id": "0",
"text": "也许不能: 分析结果提示激素疗法在维持去脂体重方面作用很小。",
"language": "zh",
},
{
"id": "1",
"text": "probably not: analysis suggests minimal effect of HT in maintaining lean body mass.",
"language": "en",
}
]
}
This is a simple container format with minimal nesting that supports a range of text pairs tasks.
{
"id": "ABCDEFG",
"document_id": "XXXXXX",
"text_1": "...",
"text_2": "...",
"label": [...]
}
Schema Notes
id
fields appear at the top (i.e. document) level. They start with "0" that makes everyid
field in a dataset unique.document_id
should be a dataset provided document id. If not provided in the dataset, it can be set equal to the top levelid
.text_1
is the first text.text_2
is the second text.label
is the label of Text Pairs.
{
"id": "0",
"document_id": "0",
"text_1": "Pluto rotates once on its axis every 6.39 Earth days;",
"text_2": "Earth rotates on its axis once times in one day.",
"label": "neutral"
}
Schema Notes
text_1
is the premise.text_2
is the hypothesis.
{
"id": "0",
"document_id": "0",
"text_1": "Am I over weight (192.9) for my age (39)?",
"text_2": "I am a 39 y/o male currently weighing about 193 lbs. Do you think I am overweight?",
"label": "1"
}
Schema Notes
label
is the label of the semantic similarity between the texts are the same.
This is a simple container format with minimal nesting that supports a range of text generation tasks.
- Text Summarization
- Text to Struct
{
"id": "ABCDEFG",
"document_id": "XXXXXX",
"type": "...",
"input": "...",
"output": [...],
}
type
can be "txt2txt" or "txt2struct".
{
"id": "0",
"document_id": "1-131188152",
"type": "txt2txt",
"input": "SUBJECT: who and where to get cetirizine - D\nMESSAGE: I need/want to know who manufscturs Cetirizine. My Walmart is looking for a new supply and are not getting the recent",
"output": ["Who manufactures cetirizine?"],
}
Schema Notes
input
is the original text.output
is the summarization.
{
"id": "0",
"document_id": "10301581",
"type": "txt2struct",
"input": "医生:你好,咳嗽是连声咳吗?有痰吗?有没流鼻涕,鼻塞?\n医生: 咳嗽有几天了?\n患者: 有三天\n......",
"output": [{
"主诉": "晚上咳嗽,磨牙。",
"现病史": "患儿夜间咳嗽三天,磨牙,大便干。未服用药物。",
"辅助检查": "暂缺。",
"既往史": "不详。",
"诊断": "消化不良。",
"建议": "小儿消积止咳口服液,益生菌,到医院化验血常规。"
}],
}
Schema Notes
input
is the input text.output
is the struct output.