-
-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Qustion] QA-Corpus mapping #854
Comments
Hello @kun432 raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df) This code is nothing wrong. So In conclusion,
|
Wohh...
|
@thap2331 Okay we will. |
@vkehfdl1 Thank you. Well, I think I got it. It's like Let's say with step2 notebook, as an example. First, parse a pdf with Parser.
this makes "/content/parse_project_dir/0/0.parquet" as 1 "parsed" parquet file. Then chunk it with Chunker with this YAML config.
this makes 4 "chunked" parquet files depends on the combination of chunk_method and chunk_size.
Then for generating QA, prepare Raw and Corpus
In this point, these all codes below makes the same "parsed" instance, I think.
|
@kun432 The So,
This is the basic structure. raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)
# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df) This code is not intended. The chunked parquet file must be in @thap2331 I assigned to you thanks:) |
@vkehfdl1 hmm...kind of lost again. I think I need more time to think... will read the docs and try. Thanks, anyway. |
Here is the brief image of our Data Creation structure. I hope it helps. For RAG, here are the steps.
|
I'm confused with that step2 notebook and tutorial document are a little different.
In step2 notebook, QA is generated from these 2 instances:
seems initial_qa and initial corpus are created from one of "chunked" parquet data, and another optimized qa data and corpus are created from the combination of another "chunked" parquet and initial corpus and qa.
OTOH, in tutorial documents(https://docs.auto-rag.com/data_creation/tutorial.html) seems:
in this tutorial, optimized qa data and corpus are created from the combination of "initial_raw_df", which is the same as "parsed" data, and initial corpus and qa.
seems "different data" are used and I'm confused.
so my basic question: is this code in step2 notebook correct?
The text was updated successfully, but these errors were encountered: