Skip to content

Commit

Permalink
Merge pull request #130 from JHU-CLSP/jan18-2
Browse files Browse the repository at this point in the history
improve the readme and change the task names for better readability
  • Loading branch information
danyaljj authored Mar 14, 2024
2 parents 83c5c76 + c570562 commit 13ded0f
Show file tree
Hide file tree
Showing 229 changed files with 1,566 additions and 122,575 deletions.
70 changes: 33 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Web-Grounded Natural Language Instructions
# TurkingTest: Challenging AI Agents with Web-based Tasks

<hr>

Expand All @@ -9,12 +9,15 @@ visual information.
Here are two example tasks:
![Screen Shot 2023-02-20 at 12 21 22 PM](https://user-images.githubusercontent.com/2441454/220168815-10c22ddd-2deb-422f-b41e-2203bee25e25.png)

You can also see a demo of the oracle model on this page: https://turkingtest.github.io/

**Where can I see more tasks?**
You can see the instructions for each task [here](https://jhu-clsp.github.io/turk-instructions/mturk.html).
Note, in this visualization, the variables are not filled in with any variables.
During the evaluation, the variables are filled in with the input instances.
We have prepared the necessary scripts for simulating the interaction with the templates (see below).

**Note:** The repository expects Python 3.10 or later.

Background
---
Expand Down Expand Up @@ -81,16 +84,6 @@ following visualization:

![Screenshot](data/screenshot.png)

# Obtaining Statistics

Overall, the repository contains about xx tasks. Of this, about 20 tasks are part of our evaluation. You can see the
evaluation tasks [here](data/splits/evaluation_tasks.txt).

The data contains a variety of input fields, though their distribution is not uniform. Here is the distribution of the
input fields:
![field-dist.png](data%2Ffield-dist.png)
Last but not least, the data contains various modalities of data. Take a look at our paper (bottom of the page) for the details.


# Interacting with the tasks and evaluating the oracle baselines
<img style="float: right;" src="data/llm-python-browser-interaction.png" width="30%">
Expand All @@ -106,49 +99,52 @@ the browser. The Python library will then return the results back to the AI syst
A quick way to see how the model execution is expected to look like is to run our oracle baseline which has access to the ground-truth labels.
Run the script for evaluating the baseline by passing in the names of the
tasks:
```python
# prepare the evaluation
eval = __import__('4_run_evaluation')

evaluation = eval.Evaluation(
solver_type="oracle", # the choice of the model
tasks="all", # whether to look over all tasks
do_eval=True, # whether to evaluate the model
dump_features=False, # whether to dump input-output features that will be used for training models
report_field_stats=True, # whether to report the field statistics after the run is finished
headless=False # whether to show the browser
)

# execute
evaluation.enumerate_tasks(max_instance_count=1)
```bash
python3 4_run_evaluation.py --solver_type oracle --tasks test_easy --max_instance_count 20
```

This open a browser and show the execusion of the oracle model on the test tasks.
Under the hood, the oracle model is generate a sequence of commands (Python commands from our action library) that ultimately get executed on each task (web page). The picture below shows this idea:

![Screen Shot 2023-02-20 at 12 22 37 PM](https://user-images.githubusercontent.com/2441454/220168960-9080b552-446b-4385-bca3-7f662ce95e20.png)

After this is done, the code will dump a JSON and a CSV file containing the results of the oracle model.
If you want to run this faster without the visualization, you can set `--headless` flag.

**Note:** To use Chrome as your webdriver, you need to first download the ChromeDriver executable from the ChromeDriver website and make sure it’s in your system’s `PATH`.

## Dumping the training features
## The existing baseline models and evaluating them
You can find a list of baseine models here: [baselines.py](src%2Fevaluation%2Fbaselines.py)
You can run these existing baselines by specifying the `solver_type` argument in the above script.
You can also add your own models by adding a new class to the `baselines.py` file.

## Optional: Dumping the input/output data for offline training/evaluation
There are scenarios that you may want to build a model that is trained on the input/output pairs of the tasks.
But doing this is not efficient when you need to interact with thr browser for each task.
As a result, we have created a functionality that allows you to dump the features of the tasks.
The oracle model can be used to dump the training features that can be used for training other models.
All need to be done is to set `dump_features=True` in the above script.
You can find our script in `src/5_dump_features.py` that dumps the features for all tasks.

## Training models
The dumped features contain both visual content as well as the HTML content of the tasks.
One can basically use either source of signals depending on the model.

The output of these models will be strings (sequence of Python actions) that will be executed on the browser.

Upon training an offline training of the model, you can also generate predictions of your model and evaluate these predictions.
This functionality is implemented in `4_run_evaluation.py` by specifying the solver to be "offline_predictions".
In this setting, you also will need to pass in the address of a file that contains the predictions of your model.

## Evaluating the predictions of a model
TODO

# Citation
If you fnd this data useful, please cite this repository.

<!--
Publication
---
Feel free to cite us. -->
```bibtex
@article{turkingtest2024xu,
title={TurkingTest: A Challenge Benchmark for Web Agents},
author={Xu, Kevin and Kordi, Yeganeh and Sanders, Kate and Wang, Yizhong and Byerly, Adam and Zhang, Jack and Van Durme, Benjamin and Khashabi, Daniel},
year={2024},
eprint={TBD},
archivePrefix={arXiv},
}
```

License
---
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
32 changes: 16 additions & 16 deletions data/splits/evaluation_tasks_easy.txt
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
ROT Details [m=50] rocstories - 0 - 99
Ethical rule-of-thumb quality
Associate countries and languages with Ethnologue
Radiology Report Sentence Classification
Reddit In-group Analysis Comment annotation 3
Reddit In-group Analysis
Scalar Adjectives Identification
Script KD eval LONG V2 - disc result eval 1
Sherlock IMG 2 TXT Eval 15
wikiHow step-goal linking pilot cleanse-url
Arch - Rel Eval 3
Compression HIT
Formalize this
JiminyCricket-HumanVal-b10
winogrande validation (grammar) additional_ph
atomic_event2event-effects 4
ANLI Generation Eval - Reflective_ACL_1
ATOMIC - Object Rationale 13
Missing Adjective FITB
Simplicity HIT
Step 3 Creating Answers Given Questions 23
Goal feasibility
Image captioning
WikiHow goal linking
Story Relations
Sentence Compression
Formalize sentence
Text Game Eval
Winogrande plausiblity
Event effect
ANLI Generation
ATOMIC - Object Rationale
Missing Adjective
Simplicity rating
Creating Answers to Questions
Word Formality Annotation
14 changes: 7 additions & 7 deletions data/splits/evaluation_tasks_hard.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@ Compile list of area chairs
Elicitation obj
Elicitation subj
Gun violence structured extraction
Rationale Generation 5
Rationale Generation
Photo Collection GVDB
ESNLI Rationale Generation 4
ESNLI Rationale Generation
JJ-NN HIT
COMET2020 ATOMIC Inference Vp 5
COMET2020 ATOMIC Inference Vp
neural-pop (PLAN evaluation) t5-human-test b
VQA Rationale Generation 5
VQA Rationale Generation
Lattice
What breaks the flow - no categories 4
What breaks the flow - no categories
Spanish Word Alignment
BiSECT Human Evaluation II (2)
NER - Task scruples 26,200 - 30,922
BiSECT Human Evaluation
NER - Scruples
Binary file removed gpt/field-dist.png
Binary file not shown.
47 changes: 0 additions & 47 deletions gpt/gpt4v.py

This file was deleted.

File renamed without changes.
6 changes: 1 addition & 5 deletions src/4_run_evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,11 +81,7 @@
)

# Check if task is empty
if args.task:
if args.task:
eval.enumerate_tasks(max_instance_count, task=args.task)
else:
eval.enumerate_tasks(max_instance_count)
# Debugging mode
# eval.enumerate_tasks(max_instance_count, task="ethics_sbic dialogue 2nd 0", first_instance_only=True)
# Collecting example code: python 4_run_evaluation.py --no-do_eval --headless > extract.txt
# eval.enumerate_tasks(max_instance_count, task="ethics_sbic dialogue 2nd 0", first_instance_only=True, input_name="norm")
20 changes: 4 additions & 16 deletions src/evaluation/baselines.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,6 @@ def solve(self, input: Input, **kwargs):
# TODO decide what the output of this function should be
raise NotImplementedError("This method should be implemented by the subclass.")

def get_action_list(self):
"""
This function returns the list of actions that can be performed on a HTML page as implemented in the Actions class.
This list is particularly useful for designing "tool" (actin)-augmented web-browsing agents.
"""
# get the list of methods in the Actions class
action_list = [method for method in dir(MyActions) if not method.startswith('_')]
# include their docstrings as well
action_list = [(method, getattr(MyActions, method).__doc__) for method in action_list]
return action_list





class NewBaseline(Baseline):

Expand Down Expand Up @@ -177,6 +163,7 @@ def solve(self, input: Input, **kwargs):
class RandomBaseline(Baseline):
"""
This baseline randomly selects an action from the list of actions that can be performed on a HTML page.
Because this is somewhat of a complex implementation, we prefer to use the `DoNothingBaseline` instead.
"""

def solve(self, input: Input, **kwargs):
Expand Down Expand Up @@ -246,6 +233,7 @@ def solve(self, input: Input, **kwargs):
print("random choices options:", options)
return random.choice(options)


class DoNothingBaseline(Baseline):
"""
This baseline randomly does nothing!
Expand Down Expand Up @@ -354,7 +342,7 @@ def solve(self, input: Input, **kwargs) -> None:

# extract HTML
html = self.get_html(input, kwargs['url'])

command = self.model.get_text_baseline_action(input.name, html, self.num_demonstrations, self.use_relevant_html)

# find the index of "self.actions(" and drop anything before it.
Expand Down Expand Up @@ -452,4 +440,4 @@ def solve(self, input: Input, **kwargs) -> None:
print(f"{Fore.BLUE}Executing one action: {command}")
exec(command)
except Exception as error:
print(f"{Fore.RED}Failed to execute an action {command}, error: {error}")
print(f"{Fore.RED}Failed to execute an action {command}, error: {error}")
2 changes: 1 addition & 1 deletion src/evaluation_class.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ def extract_input_values_from_url(self, url, task_name, input_names=None) -> Lis
inputs.append(element)
except:
# the reason that we have try-catch here is becuase elements exists in CSV but they're not created
# in HTML (they're created dynamically via JS). An exmaple task is "HTER - longer sentences -27 Sep 1129"
# in HTML (they're created dynamically via JS). An exmaple task is "HTER - longer sentences"
print(f"{Fore.RED}Could not find input field with name `{name}`")
else:
inputs = self.driver.find_elements(By.XPATH, '//input | //textarea | //select')
Expand Down
2 changes: 1 addition & 1 deletion src/run_single.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
import logging

TURKLE_URL = "http://localhost:8000"
TEST_NAME = "DI Rationale Gen. evaluation - single 2"
TEST_NAME = "DI Rationale Gen. evaluation"
SPECIFIED_INDEX = 0
RUN_ALL = False

Expand Down
26 changes: 0 additions & 26 deletions src/test_claude.py

This file was deleted.

3 changes: 0 additions & 3 deletions src/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@

def test_actions():
baseline = Baseline(driver=evaluation.driver, actions=evaluation.actions)
action_list = baseline.get_action_list()
print(action_list)
assert len(action_list) > 0, f"The action list should not be empty: {action_list}"

# Dummy input
dummy_input = Input(
Expand Down
2 changes: 1 addition & 1 deletion src/utils/clean_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ def clean_checkboxes(csv_file):
# cleaning cases where the checkbox solutions are like q1.1 q1.2 q1.3 where these are question 1 answer 1 answer 2 answer 3 etc.
# and there is False in all the ones except a True in the right answer, like True in answer 2
# Replace with q1 value of 2
# This was used to get the answers (last few columns) for the wikiHow Step Membership Task
# This was used to get the answers (last few columns) for the WikiHow Step Membership Task
def clean_checkboxes_true(csv_file):
true = [True, "True"]
df = pd.read_csv(csv_file, low_memory=False)
Expand Down
Loading

0 comments on commit 13ded0f

Please sign in to comment.