Merge pull request #130 from JHU-CLSP/jan18-2

improve the readme and change the task names for better readability
JHU-CLSP · Mar 14, 2024 · 13ded0f · 13ded0f
2 parents 83c5c76 + c570562
commit 13ded0f
Show file tree

Hide file tree

Showing 229 changed files with 1,566 additions and 122,575 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# Web-Grounded Natural Language Instructions
+# TurkingTest: Challenging AI Agents with Web-based Tasks 
 
 <hr>
 
@@ -9,12 +9,15 @@ visual information.
 Here are two example tasks:
 ![Screen Shot 2023-02-20 at 12 21 22 PM](https://user-images.githubusercontent.com/2441454/220168815-10c22ddd-2deb-422f-b41e-2203bee25e25.png)
 
+You can also see a demo of the oracle model on this page: https://turkingtest.github.io/
+
 **Where can I see more tasks?**
 You can see the instructions for each task [here](https://jhu-clsp.github.io/turk-instructions/mturk.html).
 Note, in this visualization, the variables are not filled in with any variables.
 During the evaluation, the variables are filled in with the input instances.
 We have prepared the necessary scripts for simulating the interaction with the templates (see below).
 
+**Note:** The repository expects Python 3.10 or later.
 
 Background
 --- 
@@ -81,16 +84,6 @@ following visualization:
 
 ![Screenshot](data/screenshot.png)
 
-# Obtaining Statistics
-
-Overall, the repository contains about xx tasks. Of this, about 20 tasks are part of our evaluation. You can see the
-evaluation tasks [here](data/splits/evaluation_tasks.txt).
-
-The data contains a variety of input fields, though their distribution is not uniform. Here is the distribution of the
-input fields:
-![field-dist.png](data%2Ffield-dist.png)
-Last but not least, the data contains various modalities of data. Take a look at our paper (bottom of the page) for the details. 
-
 
 # Interacting with the tasks and evaluating the oracle baselines
 <img style="float: right;" src="data/llm-python-browser-interaction.png" width="30%">
@@ -106,49 +99,52 @@ the browser. The Python library will then return the results back to the AI syst
 A quick way to see how the model execution is expected to look like is to run our oracle baseline which has access to the ground-truth labels. 
 Run the script for evaluating the baseline by passing in the names of the
 tasks: 
-```python
-# prepare the evaluation 
-eval = __import__('4_run_evaluation')
-
-evaluation = eval.Evaluation(
-    solver_type="oracle", # the choice of the model 
-    tasks="all", # whether to look over all tasks 
-    do_eval=True, # whether to evaluate the model
-    dump_features=False, # whether to dump input-output features that will be used for training models  
-    report_field_stats=True, # whether to report the field statistics after the run is finished 
-    headless=False # whether to show the browser 
-)
-
-# execute 
-evaluation.enumerate_tasks(max_instance_count=1)
+```bash
+python3 4_run_evaluation.py --solver_type oracle  --tasks test_easy  --max_instance_count 20
 ```
+
+This open a browser and show the execusion of the oracle model on the test tasks. 
 Under the hood, the oracle model is generate a sequence of commands (Python commands from our action library) that ultimately get executed on each task (web page). The picture below shows this idea: 
 
 ![Screen Shot 2023-02-20 at 12 22 37 PM](https://user-images.githubusercontent.com/2441454/220168960-9080b552-446b-4385-bca3-7f662ce95e20.png)
 
+After this is done, the code will dump a JSON and a CSV file containing the results of the oracle model.
+If you want to run this faster without the visualization, you can set `--headless` flag.
+
 **Note:** To use Chrome as your webdriver, you need to first download the ChromeDriver executable from the ChromeDriver website and make sure it’s in your system’s `PATH`.
 
-## Dumping the training features 
+## The existing baseline models and evaluating them 
+You can find a list of baseine models here: [baselines.py](src%2Fevaluation%2Fbaselines.py)
+You can run these existing baselines by specifying the `solver_type` argument in the above script.
+You can also add your own models by adding a new class to the `baselines.py` file.
+
+## Optional: Dumping the input/output data for offline training/evaluation  
+There are scenarios that you may want to build a model that is trained on the input/output pairs of the tasks.
+But doing this is not efficient when you need to interact with thr browser for each task.
+As a result, we have created a functionality that allows you to dump the features of the tasks.
 The oracle model can be used to dump the training features that can be used for training other models.
 All need to be done is to set `dump_features=True` in the above script.
 You can find our script in `src/5_dump_features.py` that dumps the features for all tasks.
-
-## Training models
 The dumped features contain both visual content as well as the HTML content of the tasks.
 One can basically use either source of signals depending on the model.
-
 The output of these models will be strings (sequence of Python actions) that will be executed on the browser.
+
+Upon training an offline training of the model, you can also generate predictions of your model and evaluate these predictions. 
+This functionality is implemented in `4_run_evaluation.py` by specifying the solver to be "offline_predictions". 
+In this setting, you also will need to pass in the address of a file that contains the predictions of your model.  
 
-## Evaluating the predictions of a model 
-TODO 
 
 # Citation
 If you fnd this data useful, please cite this repository.
-
-<!-- 
-Publication 
---- 
-Feel free to cite us.  -->
+```bibtex
+@article{turkingtest2024xu,
+     title={TurkingTest: A Challenge Benchmark for Web Agents},
+     author={Xu, Kevin and Kordi, Yeganeh and Sanders, Kate and Wang, Yizhong and Byerly, Adam and Zhang, Jack and Van Durme, Benjamin and Khashabi, Daniel},
+     year={2024},
+     eprint={TBD},
+     archivePrefix={arXiv},
+}
+```
 
 License
 --- 

diff --git a/src/claude_scores_2024-02-25-09-43-55.json → ...gs/claude_scores_2024-02-25-09-43-55.json b/src/claude_scores_2024-02-25-09-43-55.json → ...gs/claude_scores_2024-02-25-09-43-55.json
diff --git a/..._test_easy_scores_2024-02-25-09-43-55.csv → ..._test_easy_scores_2024-02-25-09-43-55.csv b/..._test_easy_scores_2024-02-25-09-43-55.csv → ..._test_easy_scores_2024-02-25-09-43-55.csv
diff --git a/src/field_stats_out.txt → data/output_logs/field_stats_out.txt b/src/field_stats_out.txt → data/output_logs/field_stats_out.txt
diff --git a/..._test_easy_scores_2024-02-16-03-29-39.csv → ..._test_easy_scores_2024-02-16-03-29-39.csv b/..._test_easy_scores_2024-02-16-03-29-39.csv → ..._test_easy_scores_2024-02-16-03-29-39.csv
diff --git a/...gpt4-text_scores_2024-02-16-03-29-39.json → ...gpt4-text_scores_2024-02-16-03-29-39.json b/...gpt4-text_scores_2024-02-16-03-29-39.json → ...gpt4-text_scores_2024-02-16-03-29-39.json
diff --git a/src/relevant_text_claude_7.txt → data/output_logs/relevant_text_claude_7.txt b/src/relevant_text_claude_7.txt → data/output_logs/relevant_text_claude_7.txt
diff --git a/src/relevant_text_gpt_1.txt → data/output_logs/relevant_text_gpt_1.txt b/src/relevant_text_gpt_1.txt → data/output_logs/relevant_text_gpt_1.txt
diff --git a/src/task_field_statistics_eval.csv → ...utput_logs/task_field_statistics_eval.csv b/src/task_field_statistics_eval.csv → ...utput_logs/task_field_statistics_eval.csv
diff --git a/src/text_claude_7.txt → data/output_logs/text_claude_7.txt b/src/text_claude_7.txt → data/output_logs/text_claude_7.txt
diff --git a/src/text_claude_7_1.txt → data/output_logs/text_claude_7_1.txt b/src/text_claude_7_1.txt → data/output_logs/text_claude_7_1.txt
diff --git a/src/text_claude_7_2.txt → data/output_logs/text_claude_7_2.txt b/src/text_claude_7_2.txt → data/output_logs/text_claude_7_2.txt
diff --git a/data/splits/evaluation_tasks_easy.txt b/data/splits/evaluation_tasks_easy.txt
@@ -1,20 +1,20 @@
-ROT Details [m=50] rocstories - 0 - 99
+Ethical rule-of-thumb quality
 Associate countries and languages with Ethnologue
 Radiology Report Sentence Classification
-Reddit In-group Analysis Comment annotation 3
+Reddit In-group Analysis
 Scalar Adjectives Identification
-Script KD eval LONG V2 - disc result eval 1
-Sherlock IMG 2 TXT Eval 15
-wikiHow step-goal linking pilot cleanse-url
-Arch - Rel Eval 3
-Compression HIT
-Formalize this
-JiminyCricket-HumanVal-b10
-winogrande validation (grammar) additional_ph
-atomic_event2event-effects 4
-ANLI Generation Eval - Reflective_ACL_1
-ATOMIC - Object Rationale 13
-Missing Adjective FITB
-Simplicity HIT
-Step 3 Creating Answers Given Questions 23
+Goal feasibility
+Image captioning
+WikiHow goal linking
+Story Relations
+Sentence Compression
+Formalize sentence
+Text Game Eval
+Winogrande plausiblity
+Event effect
+ANLI Generation
+ATOMIC - Object Rationale
+Missing Adjective
+Simplicity rating
+Creating Answers to Questions
 Word Formality Annotation
diff --git a/data/splits/evaluation_tasks_hard.txt b/data/splits/evaluation_tasks_hard.txt
@@ -3,15 +3,15 @@ Compile list of area chairs
 Elicitation obj
 Elicitation subj
 Gun violence structured extraction
-Rationale Generation 5
+Rationale Generation
 Photo Collection GVDB
-ESNLI Rationale Generation 4
+ESNLI Rationale Generation
 JJ-NN HIT
-COMET2020 ATOMIC Inference Vp 5
+COMET2020 ATOMIC Inference Vp
 neural-pop (PLAN evaluation) t5-human-test b
-VQA Rationale Generation 5
+VQA Rationale Generation
 Lattice
-What breaks the flow - no categories 4
+What breaks the flow - no categories
 Spanish Word Alignment
-BiSECT Human Evaluation II (2)
-NER - Task scruples 26,200 - 30,922
+BiSECT Human Evaluation
+NER - Scruples
diff --git a/gpt/field-dist.png b/gpt/field-dist.png
diff --git a/gpt/gpt4v.py b/gpt/gpt4v.py
diff --git a/text_claude_7.txt → model_output/text_claude_7_3.txt b/text_claude_7.txt → model_output/text_claude_7_3.txt
diff --git a/src/4_run_evaluation.py b/src/4_run_evaluation.py
@@ -81,11 +81,7 @@
     )
 
     # Check if task is empty
-    if args.task: 
+    if args.task:
         eval.enumerate_tasks(max_instance_count, task=args.task)
     else:
         eval.enumerate_tasks(max_instance_count)
-    # Debugging mode
-    # eval.enumerate_tasks(max_instance_count, task="ethics_sbic dialogue 2nd 0", first_instance_only=True)
-    # Collecting example code: python 4_run_evaluation.py --no-do_eval --headless > extract.txt
-    # eval.enumerate_tasks(max_instance_count, task="ethics_sbic dialogue 2nd 0", first_instance_only=True, input_name="norm")
diff --git a/src/evaluation/baselines.py b/src/evaluation/baselines.py
@@ -37,20 +37,6 @@ def solve(self, input: Input, **kwargs):
         # TODO decide what the output of this function should be
         raise NotImplementedError("This method should be implemented by the subclass.")
 
-    def get_action_list(self):
-        """
-        This function returns the list of actions that can be performed on a HTML page as implemented in the Actions class.
-        This list is particularly useful for designing "tool" (actin)-augmented web-browsing agents.
-        """
-        # get the list of methods in the Actions class
-        action_list = [method for method in dir(MyActions) if not method.startswith('_')]
-        # include their docstrings as well
-        action_list = [(method, getattr(MyActions, method).__doc__) for method in action_list]
-        return action_list
-
-
-
-
 
 class NewBaseline(Baseline):
 
@@ -177,6 +163,7 @@ def solve(self, input: Input, **kwargs):
 class RandomBaseline(Baseline):
     """
     This baseline randomly selects an action from the list of actions that can be performed on a HTML page.
+    Because this is somewhat of a complex implementation, we prefer to use the `DoNothingBaseline` instead.
     """
 
     def solve(self, input: Input, **kwargs):
@@ -246,6 +233,7 @@ def solve(self, input: Input, **kwargs):
             print("random choices options:", options)
             return random.choice(options)
 
+
 class DoNothingBaseline(Baseline):
     """
     This baseline randomly does nothing!
@@ -354,7 +342,7 @@ def solve(self, input: Input, **kwargs) -> None:
 
         # extract HTML
         html = self.get_html(input, kwargs['url'])
-        
+
         command = self.model.get_text_baseline_action(input.name, html, self.num_demonstrations, self.use_relevant_html)
 
         # find the index of "self.actions(" and drop anything before it.
@@ -452,4 +440,4 @@ def solve(self, input: Input, **kwargs) -> None:
             print(f"{Fore.BLUE}Executing one action: {command}")
             exec(command)
         except Exception as error:
-            print(f"{Fore.RED}Failed to execute an action {command}, error: {error}")
+            print(f"{Fore.RED}Failed to execute an action {command}, error: {error}")
diff --git a/src/evaluation_class.py b/src/evaluation_class.py
@@ -286,7 +286,7 @@ def extract_input_values_from_url(self, url, task_name, input_names=None) -> Lis
                         inputs.append(element)
                 except:
                     # the reason that we have try-catch here is becuase elements exists in CSV but they're not created
-                    # in HTML (they're created dynamically via JS). An exmaple task is "HTER - longer sentences -27 Sep 1129"
+                    # in HTML (they're created dynamically via JS). An exmaple task is "HTER - longer sentences"
                     print(f"{Fore.RED}Could not find input field with name `{name}`")
         else:
             inputs = self.driver.find_elements(By.XPATH, '//input | //textarea | //select')

diff --git a/src/run_single.py b/src/run_single.py
@@ -11,7 +11,7 @@
 import logging
 
 TURKLE_URL = "http://localhost:8000"
-TEST_NAME = "DI Rationale Gen. evaluation - single 2"
+TEST_NAME = "DI Rationale Gen. evaluation"
 SPECIFIED_INDEX = 0 
 RUN_ALL = False
 

diff --git a/src/test_claude.py b/src/test_claude.py
diff --git a/src/tests.py b/src/tests.py
@@ -11,9 +11,6 @@
 
 def test_actions():
     baseline = Baseline(driver=evaluation.driver, actions=evaluation.actions)
-    action_list = baseline.get_action_list()
-    print(action_list)
-    assert len(action_list) > 0, f"The action list should not be empty: {action_list}"
 
     # Dummy input
     dummy_input = Input(

diff --git a/src/utils/clean_csv.py b/src/utils/clean_csv.py
@@ -28,7 +28,7 @@ def clean_checkboxes(csv_file):
 # cleaning cases where the checkbox solutions are like q1.1 q1.2 q1.3 where these are question 1 answer 1 answer 2 answer 3 etc.
 # and there is False in all the ones except a True in the right answer, like True in answer 2
 # Replace with q1 value of 2
-# This was used to get the answers (last few columns) for the wikiHow Step Membership Task
+# This was used to get the answers (last few columns) for the WikiHow Step Membership Task
 def clean_checkboxes_true(csv_file):
     true = [True, "True"]
     df = pd.read_csv(csv_file, low_memory=False)