Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data loader for HF oasst1 #2951

Merged
merged 17 commits into from
Jun 13, 2023

Conversation

grgau
Copy link
Contributor

@grgau grgau commented Apr 28, 2023

Added a PR as requested in #2828

Changes in this PR:

  • Add to oasst_export configurations the attributes hf_dataset_name and use_hf_dataset to declare which HF oasst dataset to use
  • Add the loading of hf_dataset_name in load_oasst_export function, making it possible to work with datasets directly from HuggingFace and not just local datasets.

Very happy to contribute to this project in some way!! expecting your review 😄

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

1 similar comment
@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@github-actions
Copy link

github-actions bot commented May 8, 2023

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

grgau and others added 5 commits May 7, 2023 23:39
works towards LAION-AI#2819

---------

Co-authored-by: Andreas Köpf <andreas.koepf@provisio.com>
In the [PR to introduce RM for the dataset entry
class](LAION-AI#2867) I forgot
that if we have RM, we'll have multiple answers per question so `[Q1,
(A1, A12)]` but I just introduced questions and answers as `list` types
and therefore we could not connect a question to an answer accordingly,
e.g. `questions=[Q1, Q2]` and `answers=[A11, A12, A21, A22]` there was
no way to figure out that `A11, A12` belong to `Q1` and `A21, A22` to
`Q2`. So I introduced a `list[list[str]` type for the answers, so that
we can connect question and answers by indices: `questions=[Q1, Q2]` and
`answers=[[A11, A12], [A21, A22]`, so `answers[0]` belong to
`questions[0]`. Note that this is backwards compatible since `answers`
is a union type of `list[str] | list[list[str]]`. Also added tests for
this

Ran `python check_dataset_appearances.py -d webgpt --cache_dir .cache
--mode rm`
and found one entry with an empty question:
```python
DatasetEntry(questions=[''],
                      answers=[['Lebensraum is a German geopolitical concept that means "living space." The term was originally used to support colonialism, and was later adapted by Nazi leader Adolf Hitler to support his quest for German expansion to the east . German geographer and ethnographer Friedrich Ratzel first published an essay called "Der Lebensraum" ("The Living Space") in 1901, in which he posited that all people, animals, and plants need to expand their living space in order to survive . According to Ratzel, species that successfully adapted to one location would spread naturally to others . Hitler believed that Germany required Lebensraum in order to survive, and this conviction that this living space could be gained only in the east and, specifically, from Russia, shaped his policy after his take-over of power in Germany in 1933 . The Nazi Generalplan Ost policy (\'Master Plan for the East\') was based on the tenets of Lebensraum . It stipulated that Germany required a Lebensraum necessary for its survival and that most of the indigenous populations of Central and Eastern Europe would have to be removed permanently (either through mass deportation to Siberia, extermination, or enslavement) .', 'There are several ways to unblock blocked websites. One way is to use a good web-based proxy server . Another way is to type in the URL of the blocked site you want to access in the address bar, and then press Go or Enter . The web content will be sent to the proxy server where it can then be viewed from your device . This may make browsing a bit slower, but you should still be able to access any of your favorite websites . Another way to unblock blocked websites is to use a VPN (Virtual Private Network) . A VPN can be used to access region-restricted websites, shield your web browsing activities on public WiFi networks, and more .']],
                  context=None,
                  lang=None,
                  length=None,
                  quality=None,
                  humor=None,
                  creativity=None
)
```
So this was the result:
```bash
'Found the following occurances in TRAIN webgpt:'
{re.compile('^[\\s\\n]*$'): ['']}
```
@grgau grgau force-pushed the 2828_add_oasst1_data_loader branch from 92a7952 to 9a3edb2 Compare May 8, 2023 02:40
@github-actions
Copy link

github-actions bot commented May 8, 2023

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@grgau
Copy link
Contributor Author

grgau commented May 8, 2023

Had an idea, still working on this

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@grgau grgau requested a review from andreaskoepf May 21, 2023 04:56
@grgau
Copy link
Contributor Author

grgau commented Jun 1, 2023

@olliestanley any updates about it?

@olliestanley
Copy link
Collaborator

@olliestanley any updates about it?

Will need reviewing by ML code owner(s), I am not one of them. I'll remind them

@andreaskoepf
Copy link
Collaborator

andreaskoepf commented Jun 13, 2023

.. loading needs to restore the ORIGINAL format. Sorry, forgot about it.

@grgau
Copy link
Contributor Author

grgau commented Jun 13, 2023

@andreaskoepf I will work on this

@andreaskoepf
Copy link
Collaborator

I started already working on it myself …

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to go .. would be great if someone else could review the changes quickly..

@andreaskoepf andreaskoepf merged commit 463d729 into LAION-AI:main Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add data loader for HF oasst1
4 participants