Separate and update READMEs

jayded · jayded · commit 1f977ecc721d · 2019-06-05T14:42:51.000-04:00
diff --git a/README.annotation_process.md b/README.annotation_process.md
@@ -0,0 +1,52 @@
+## Randomized Control Trials (RCTs), Prompts, and Answers
+There are three main types of data for this project, each of which will be described in depth in the following sections.
+
+### RCTs
+In this project, we use texts from RCTs, or randomized control trials. These RCTs are articles that directly compare two different treatments. For example, a given article might want to determine the effectiveness of ibuprofen in counteracting headaches in comparison to other treatments, such as tylenol. These papers often tend to compare multiple treatments (i.e. ibuprofen, tylenol), and the effects with respect to various outcomes (i.e. headaches, pain). 
+
+### Prompts
+A prompt will be of the given form: "With respect to *outcome*, characterize the reported difference between patients receiving *intervention* and those receiving *comparator*." The prompt has the 3 fill-in-the-blanks, each of which lines up nicely with the RCT. For instance, if we use the example described in the RCTs section, we get: 
+  * **Outcome** = 'number of headaches'
+  * **Intervention** = 'ibuprofen'
+  * **Comparator** = 'tylenol'
+  * "With respect to *number of headaches*, characterize the reported difference between patients receiving *ibuprofen* and those receiving *tylenol*"
+  
+A given article might have 10+ of these comparisons within. For example, if the RCT article also compared *ibuprofen* and *tylenol* with respect to *side effects*, this could also be used a prompt. 
+
+### Answers
+Given a prompt, we must characterize how the relationship of two different treatments with respect to an outcome. Let us use the prompt described previously: 
+  * "With respect to *number of headaches*, characterize the reported difference between patients receiving *ibuprofen* and those receiving  *tylenol*"
+
+There are three answers we could give: 'significantly increased', 'significantly decreased', 'no significant difference.' Take, for example, three sentences that *could* appear in an article, that would each result in a different outcome.
+  1. **Significantly increased**: "Ibprofen relieved 60 headaches, while tylenol relieved 120; therefore ibuprofen is worse than tylenol for reducing the number of headaches (p < 0.05)."
+      * This can be seen as an answer of significantly increased, since ibuprofen technically *increases* the chance of having a headache if you use it instead of tylenol. We can see this because more people benefited from the use of tylenol in comparison to ibuprofen.
+  2. **Significantly decreased**: "Ibuprofen reduced the 2-times the number of headaches than tylenol, and therefore reduced a greater number headaches (p < 0.05)."
+      * This is an answer of significantly decreased since ibuprofen **decreased** the number of headaches in comparison to tylenol.
+  3. **No significant difference**: "Ibuprofen relieved more headaches than tylenol, but the difference was not statistically significant"
+      * We only care about statistical significance here. In this case it is clear that there is no statistical difference between the two, warranting an answer of no significant difference.
+      
+As an answer, we would submit two things: 
+  1. The answer (significantly inc./dec./no-diff.).
+  2. A quote from the text that supports our answer (one of the sentences described above).
+  
+
+## Process Description
+Gathering the data is contained in 3 main processes: prompt generation, annotation, and verification. We hire M.D.s from Upwork to work on only one of the processes. We use flask and AWS to host servers for the M.D.s to work on.
+
+### Prompt Generation
+A prompt generator is hired to look at a set of articles taken from PUBMED central open access. Each of these articles are RCT that are comparing multiple treatments with respect to various outcomes. These prompt generators look to find triplets of outcomes-interventions-comparators that fill in the following sentence: "With respect to outcome, characterize the reported difference between patients receiving intervention and those receiving comparator." In order to find these prompts, prompt generators will generally find sentences describing the actual result of the prompt they find. Thus, we ask prompt generators to not only select an answer to the prompt (how the relationship of the intervention and outcome w/ respect to comparator is defined), but also the reasoning of how they achieved their answer. The answer will be one of 'significantly increased', 'significantly decreased', or 'no significant difference', while the reasoning will be a direct quote from the text. For each article, the prompt generator attempts to find a max of five unique prompts. 
+
+The prompt generator instructions can be found here: http://www.ccs.neu.edu/home/lehmer16/prompt-gen-instruction/templates/instructions.html
+
+### Annotator
+An annotator is given the article, and a prompt. The answer will be one of 'significantly increased', 'significantly decreased', or 'no significant difference', while the reasoning will be a direct quote from the text. The annotator only has access to the prompt and the article, and therefore must search the article for the evidence and the answer. Also, if the prompt is incoherent or simply invalid, then it can be marked as such. The annotators will also attemmpt to look for the answer in the abstract. If it is not available, then the annotators may look at the remaining sections of the article to find the answer.
+
+The annotator instructions can be found here: http://www.ccs.neu.edu/home/lehmer16/annotation-instruction-written/templates/instructions.html
+
+### Verifier
+The verifier is given the prompt, the article, the reasoning and answer of the annotator, and the reasoning and answer of the prompt generator. However, both pairs of reasoning and answers are presented as if they were both annotators. This is to ensure that the verifier does not potentially side with the prompt generator over the intuition that their answer would be more accurate. The verifier determines if all answers, and reasonings are valid. Similarly, the verifier will determine if the prompt is valid.
+
+The verifier instructions can be found here : http://www.ccs.neu.edu/home/lehmer16/Verification-Instructions/instructions.html
+
+A link to the description of the data can be found here: https://github.com/jayded/evidence-inference/tree/master/annotations
+
diff --git a/README.md b/README.md
@@ -6,69 +6,40 @@ The dataset consists of biomedical articles describing randomized control trials
 
 The dataset could be used for automatic data extraction of the results of a given RCT. This would enable readers to discover the effectiveness of different treatments without needing to read the paper.
 
-### Citation
+See [README.annotation_process.md](./README.annotation_process.md) for information about the annotation process.
 
-Eric Lehman, Jay DeYoung, Regina Barzilay, and Byron C. Wallace. Inferring Which Medical Treatments Work from Reports of Clinical Trials. In NAACL (2019).
+## Data
 
-When citing this project, please use the following bibtex citation:
-
-@inproceedings{TBD,
-  title = {{Inferring Which Medical Treatments Work from Reports of Clinical Trials}},
-  author = {Lehman, Eric and DeYoung, Jay and Barzilay, Regina and Wallace, Byron C.},
-  booktitle = {North American Chapter of the Association for Computational Linguistics (NAACL)},
-  year = {2019}
-}
-
-
-## Randomized Control Trials (RCTs), Prompts, and Answers
-There are three main types of data for this project, each of which will be described in depth in the following sections.
-
-### RCTs
-In this project, we use texts from RCTs, or randomized control trials. These RCTs are articles that directly compare two different treatments. For example, a given article might want to determine the effectiveness of ibuprofen in counteracting headaches in comparison to other treatments, such as tylenol. These papers often tend to compare multiple treatments (i.e. ibuprofen, tylenol), and the effects with respect to various outcomes (i.e. headaches, pain). 
+Raw documents are generated in both the PubMed nxml format and a plain text version suitable for human and machine readability (you can use your favorite tokenizer and model). Annotations are described in detail in [the annotation description](./annotations/README.md).
 
-### Prompts
-A prompt will be of the given form: "With respect to *outcome*, characterize the reported difference between patients receiving *intervention* and those receiving *comparator*." The prompt has the 3 fill-in-the-blanks, each of which lines up nicely with the RCT. For instance, if we use the example described in the RCTs section, we get: 
-  * **Outcome** = 'number of headaches'
-  * **Intervention** = 'ibuprofen'
-  * **Comparator** = 'tylenol'
-  * "With respect to *number of headaches*, characterize the reported difference between patients receiving *ibuprofen* and those receiving *tylenol*"
-  
-A given article might have 10+ of these comparisons within. For example, if the RCT article also compared *ibuprofen* and *tylenol* with respect to *side effects*, this could also be used a prompt. 
+We distribute annotation in a csv format ([prompts](./annotations/prompts_merged.csv) and [labels](./annotations/annotations_merged.csv)). If you prefer to work with a json format, we provide a [script](./evidence_inference/preprocess/convert_annotations_to_json.py) to convert from the csv format.
 
-### Answers
-Given a prompt, we must characterize how the relationship of two different treatments with respect to an outcome. Let us use the prompt described previously: 
-  * "With respect to *number of headaches*, characterize the reported difference between patients receiving *ibuprofen* and those receiving  *tylenol*"
+## Reproduction
 
-There are three answers we could give: 'significantly increased', 'significantly decreased', 'no significant difference.' Take, for example, three sentences that *could* appear in an article, that would each result in a different outcome.
-  1. **Significantly increased**: "Ibprofen relieved 60 headaches, while tylenol relieved 120; therefore ibuprofen is worse than tylenol for reducing the number of headaches (p < 0.05)."
-      * This can be seen as an answer of significantly increased, since ibuprofen technically *increases* the chance of having a headache if you use it instead of tylenol. We can see this because more people benefited from the use of tylenol in comparison to ibuprofen.
-  2. **Significantly decreased**: "Ibuprofen reduced the 2-times the number of headaches than tylenol, and therefore reduced a greater number headaches (p < 0.05)."
-      * This is an answer of significantly decreased since ibuprofen **decreased** the number of headaches in comparison to tylenol.
-  3. **No significant difference**: "Ibuprofen relieved more headaches than tylenol, but the difference was not statistically significant"
-      * We only care about statistical significance here. In this case it is clear that there is no statistical difference between the two, warranting an answer of no significant difference.
-      
-As an answer, we would submit two things: 
-  1. The answer (significantly inc./dec./no-diff.).
-  2. A quote from the text that supports our answer (one of the sentences described above).
-  
+See [SETUP.md](./SETUP.md) for information about how to configure and reproduce primary paper results.
 
-## Process Description
-Gathering the data is contained in 3 main processes: prompt generation, annotation, and verification. We hire M.D.s from Upwork to work on only one of the processes. We use flask and AWS to host servers for the M.D.s to work on.
+## Citation
 
-### Prompt Generation
-A prompt generator is hired to look at a set of articles taken from PUBMED central open access. Each of these articles are RCT that are comparing multiple treatments with respect to various outcomes. These prompt generators look to find triplets of outcomes-interventions-comparators that fill in the following sentence: "With respect to outcome, characterize the reported difference between patients receiving intervention and those receiving comparator." In order to find these prompts, prompt generators will generally find sentences describing the actual result of the prompt they find. Thus, we ask prompt generators to not only select an answer to the prompt (how the relationship of the intervention and outcome w/ respect to comparator is defined), but also the reasoning of how they achieved their answer. The answer will be one of 'significantly increased', 'significantly decreased', or 'no significant difference', while the reasoning will be a direct quote from the text. For each article, the prompt generator attempts to find a max of five unique prompts. 
+### Standard Form Citation
 
-The prompt generator instructions can be found here: http://www.ccs.neu.edu/home/lehmer16/prompt-gen-instruction/templates/instructions.html
+Eric Lehman, Jay DeYoung, Regina Barzilay, and Byron C. Wallace. 2019. Inferring which medical treatments work from reports of clinical trials. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3705–3717, Minneapolis, Minnesota. Association for Computational Linguistics.
 
-### Annotator
-An annotator is given the article, and a prompt. The answer will be one of 'significantly increased', 'significantly decreased', or 'no significant difference', while the reasoning will be a direct quote from the text. The annotator only has access to the prompt and the article, and therefore must search the article for the evidence and the answer. Also, if the prompt is incoherent or simply invalid, then it can be marked as such. The annotators will also attemmpt to look for the answer in the abstract. If it is not available, then the annotators may look at the remaining sections of the article to find the answer.
-
-The annotator instructions can be found here: http://www.ccs.neu.edu/home/lehmer16/annotation-instruction-written/templates/instructions.html
-
-### Verifier
-The verifier is given the prompt, the article, the reasoning and answer of the annotator, and the reasoning and answer of the prompt generator. However, both pairs of reasoning and answers are presented as if they were both annotators. This is to ensure that the verifier does not potentially side with the prompt generator over the intuition that their answer would be more accurate. The verifier determines if all answers, and reasonings are valid. Similarly, the verifier will determine if the prompt is valid.
-
-The verifier instructions can be found here : http://www.ccs.neu.edu/home/lehmer16/Verification-Instructions/instructions.html
-
-A link to the description of the data can be found here: https://github.com/jayded/evidence-inference/tree/master/annotations
+### Bibtex Citation
+When citing this project, please use the following bibtex citation:
 
+<pre>
+@inproceedings{lehman-etal-2019-inferring,
+    title = "Inferring Which Medical Treatments Work from Reports of Clinical Trials",
+    author = "Lehman, Eric  and
+      DeYoung, Jay  and
+      Barzilay, Regina  and
+      Wallace, Byron C.",
+    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
+    month = jun,
+    year = "2019",
+    address = "Minneapolis, Minnesota",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/N19-1371",
+    pages = "3705--3717",
+}
+</pre>
diff --git a/SETUP.md b/SETUP.md
@@ -2,7 +2,7 @@ Download embeddings to "embeddings" from
 http://evexdb.org/pmresources/vec-space-models/PubMed-w2v.bin
 
 You should create a virtualenv satisfying requirements.txt. We recommend using
-conda. You will need to install a very recent PyTorch (1.0-nightly or later).
+conda. You will need to install a very recent PyTorch (1.1 or later).
 
 Experiments were run on a mix of 1080Tis, K40ms, K80ms, and P100s.
 You should be able to (approximately) reproduce the main experiments via the