-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add poetry dataset setup #2730
add poetry dataset setup #2730
Conversation
❌ pre-commit failed. |
❌ pre-commit failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to remove the input CSV from the Git repo and include code in the script to download it from original source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some minor comments, but otherwise this is very well done. You have added all the fixes requested previously.
Please add the dataset to the parent __init__.py
and the HF dataset card to the script's README.md and it should be good to go!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more comments
Added poetry dataset to init
❌ pre-commit failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to remove the input CSV from the Git repo and include code in the script to download it from original source
❌ pre-commit failed. |
Dataset Description
This dataset contains around 14,000 poems from the PoetryFoundation.org site. They are converted to question:response pairs, using the tags as topics.
5% of the dataset is titling requests -- the user provides a poem and asks the assistant to title it.
Languages
English
Dataset Structure
This dataset follows the OA format, which is:
INSTRUCTION (string): The user asks for a poem (from a variety of premade prompts) with topics (tags). If the given poem has no tags, the user asks for a poem on it's own.
RESPONSE (string): The assistant replies with the poem and title (from a variety of premade prompts).
SOURCE (string): The source is PoetryFoundation.org and the poet's name.
METADATA (JSON String):
{"author": "author of the original poem",
"title": "title of the poem",
"tags": "tags from poetry foundation."}
Preparing the Dataset
The dataset can be created with prepare.py. Make sure to install the required libraries in requirements.txt!
Contributions
Created by Check
Original dataset source - https://www.kaggle.com/datasets/tgdivy/poetry-foundation-poems
You can view it on my huggingface here: https://huggingface.co/datasets/checkai/instruction-poems
(this time i ran pre-commit so it should be good :D )