Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXAMPLE] Add CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation example #953

Open
plaguss opened this issue Sep 9, 2024 · 6 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@plaguss
Copy link
Contributor

plaguss commented Sep 9, 2024

Is your feature request related to a problem? Please describe.
We could create an example replicating this paper, we have most of the pieces and seems quite interesting:
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Describe the solution you'd like
Add a replica/example of the paper in the documentation (maybe adding relevant steps/tasks if applies).

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@plaguss plaguss added documentation Improvements or additions to documentation enhancement New feature or request labels Sep 9, 2024
@bikash119
Copy link
Contributor

hi @plaguss : Can I take this up?

@plaguss
Copy link
Contributor Author

plaguss commented Sep 28, 2024

Hi @bikash119, sure! Other than the special prompts this task would need, it would need a data sampler, specifically from a vector database. Instead of that, for testing I think it would be easier if you play around with a simpler sampler. You can use something like the one in this PR: https://github.com/argilla-io/distilabel/pull/925/files#diff-528f40365aa1e97bc8332703af9a49626c919e5e42145eb12e443fc4021fc1a2. It's not merged yet, but you can copy that.

@bikash119
Copy link
Contributor

hi @plaguss , Thank you for pointing the resources. As I go through the code, I am trying to internalize the code. Nothing with the code, just that I am bit slow here.
On another node, as CRAFT needs a vector storage to query the corpus based on few-shots, do you think it will be beneficial to put effort in looking at sqlite-vec and if possible incorporating it as a step in CRAFT task.
Please let me know your thoughts.

@plaguss
Copy link
Contributor Author

plaguss commented Oct 1, 2024

Hi! We are still working on the best way of building this. For example, #1006 is still a work in progress, but I would say say LanceDB is a good choice. If you find a way of extending this to other possible databases that would be perfect, but put much effort on that. I would suggest to take a look at the implementation in the PR to see if that would work here for CRAFT, otherwise we will have to update it 😄

@bikash119
Copy link
Contributor

Thank you @plaguss for the direction. Will continue with data sampler for CRAFT. Is it possible to ping you on discord, if that sounds ok?

@plaguss
Copy link
Contributor Author

plaguss commented Oct 3, 2024

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants