Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove extra period in task BigCodeBench/16 #38

Merged
merged 2 commits into from
Sep 10, 2024

Conversation

hvaara
Copy link
Contributor

@hvaara hvaara commented Sep 1, 2024

This removes the extra period from the test and canonical_solution entries in the task BigCodeBench/16.

TODO

  • Get initial feedback.
  • Are we still targeting v0.2.0 for the Python library?
  • Which version of the dataset on hf.co are we targeting?
  • Test generation and eval with current solution before submitting.

Fixes #30

cc @terryyz

Copy link
Contributor Author

@hvaara hvaara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple questions.

tools/fix_020.py Outdated Show resolved Hide resolved
tools/fix_020.py Outdated Show resolved Hide resolved
@terryyz
Copy link
Collaborator

terryyz commented Sep 2, 2024

Thank you, @hvaara, for the PR. Replied.

@hvaara
Copy link
Contributor Author

hvaara commented Sep 3, 2024

Still have to test generation and eval. If that looks good I'm ready to merge.

@hvaara
Copy link
Contributor Author

hvaara commented Sep 5, 2024

Tested with the following, please verify I did this correctly.

# export OPENAI_API_KEY=REDACTED  # Set OpenAI API key and uncomment line

git diff fix_v0110.py
diff --git a/tools/fix_v0110.py b/tools/fix_v0110.py
index a6aadac..e8004ef 100644
--- a/tools/fix_v0110.py
+++ b/tools/fix_v0110.py
@@ -34,32 +34,20 @@ if __name__ == "__main__":
     hard_ds_dict = load_dataset(BIGCODEBENCH_HARD_HF)
     ds = ds_dict[BIGCODEBENCH_VERSION]
     hard_ds = hard_ds_dict[BIGCODEBENCH_VERSION]
-    function_id = [16, 37]
+    function_id = [16]

     new_ds = ds.map(map_ds)
     new_ds.to_json("BigCodeBench.jsonl")
     ds_dict[BIGCODEBENCH_NEW_VERSION] = new_ds
-    ds_dict.push_to_hub(BIGCODEBENCH_HF)
+    #ds_dict.push_to_hub(BIGCODEBENCH_HF)

     new_hard_ds = hard_ds.map(map_ds)
     new_hard_ds.to_json("BigCodeBench-Hard.jsonl")
     hard_ds_dict[BIGCODEBENCH_NEW_VERSION] = new_hard_ds
-    hard_ds_dict.push_to_hub(BIGCODEBENCH_HARD_HF)
-
+    #hard_ds_dict.push_to_hub(BIGCODEBENCH_HARD_HF)
+
     for i in function_id:
         old_sample = ds.select([i])
         new_sample = new_ds.select([i])
         old_sample.to_json("old.jsonl")
         new_sample.to_json("new.jsonl")
-        api.upload_file(
-            path_or_fileobj="old.jsonl",
-            path_in_repo=f"{i}/old.jsonl",
-            repo_id=BIGCODEBENCH_UPDATE,
-            # repo_type="dataset"
-        )
-        api.upload_file(
-            path_or_fileobj="new.jsonl",
-            path_in_repo=f"{i}/new.jsonl",
-            repo_id=BIGCODEBENCH_UPDATE,
-            # repo_type="dataset"
-        )

Generate task definition for altered task

python fix_v0110.py

Generate output from OpenAI

BIGCODEBENCH_OVERRIDE_PATH=$(pwd)/new.jsonl OPENAI_API_KEY=${OPENAI_API_KEY:?} bigcodebench.generate --model gpt-4o-mini --split complete --subset full --resume --backend openai

Sanitize the output

bigcodebench.sanitize --samples samples.jsonl --calibrate

Run the eval (Please note, this is potentially unsafe as it will run the code generated from the LLM on your machine. Use Docker or a throwaway VM in order to mitigate this risk).

BIGCODEBENCH_OVERRIDE_PATH=$(pwd)/new.jsonl bigcodebench.evaluate --split complete --samples samples-sanitized-calibrated.jsonl
Load from ground-truth from /home/hvaara/.cache/bigcodebench/b2db11e110a0b35b31132e278af5d7cc.pkl
Reading samples...
1it [00:00, 21.58it/s]
100%|███████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.81s/it]
BigCodeBench-Complete-calibrated (Hard)
Groundtruth pass rate: 1.000
pass@1: 1.000

Verify output from eval

cat samples-sanitized-calibrated_hard_eval_results.json
{
  "date": "2024-09-05 12:40",
  "eval": {
    "BigCodeBench/16": [
      {
        "task_id": "BigCodeBench/16",
        "solution": "import os\nimport glob\nimport subprocess\ndef task_func(directory, backup_dir='/path/to/backup'):\n\n    pass\nimport os\nimport glob\nimport subprocess\ndef task_func(directory, backup_dir='/path/to/backup'):\n    \"\"\"\n    Backup all '.log' files in a specified directory to a tar.gz file and delete the original files after backup.\n    The backup file is named 'logs_backup.tar.gz' and placed in the specified backup directory.\n    \n    Parameters:\n    - directory (str): The directory that contains the log files to be backed up.\n    - backup_dir (str, optional): The directory where the backup file will be saved.\n                                  Default is '/path/to/backup'.\n    \n    Returns:\n    - str: The path to the backup file if logs are found, otherwise returns a message 'No logs found to backup'.\n    \n    Raises:\n    - FileNotFoundError: If the specified directory does not exist.\n    \"\"\"\n    \n    # Check if the specified directory exists\n    if not os.path.exists(directory):\n        raise FileNotFoundError(f\"The specified directory '{directory}' does not exist.\")\n    \n    # Find all .log files in the specified directory\n    log_files = glob.glob(os.path.join(directory, '*.log'))\n    \n    # Check if there are any log files to backup\n    if not log_files:\n        return 'No logs found to backup'\n    \n    # Create the backup directory if it does not exist\n    os.makedirs(backup_dir, exist_ok=True)\n    \n    # Define the backup file path\n    backup_file = os.path.join(backup_dir, 'logs_backup.tar.gz')\n    \n    # Create a tar.gz file containing all the log files\n    with subprocess.Popen(['tar', '-czf', backup_file] + log_files, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc:\n        stdout, stderr = proc.communicate()\n        if proc.returncode != 0:\n            raise Exception(f\"Error during backup: {stderr.decode().strip()}\")\n    \n    # Delete the original log files after successful backup\n    for log_file in log_files:\n        os.remove(log_file)\n    \n    return backup_file",
        "status": "pass",
        "details": {}
      }
    ]
  }
}

If this LGTY I think we're ready to merge and I can take a look at the other issues as well.

@hvaara hvaara requested a review from terryyz September 5, 2024 12:50
@terryyz terryyz merged commit b0676a7 into bigcode-project:main Sep 10, 2024
@hvaara hvaara deleted the bcb16-period-fix branch September 10, 2024 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 [TaskRemoval/TaskRepair] - 16 Extra period in tests vs prompt
2 participants