Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Qustion] QA-Corpus mapping #854

Open
kun432 opened this issue Oct 17, 2024 · 8 comments
Open

[Qustion] QA-Corpus mapping #854

kun432 opened this issue Oct 17, 2024 · 8 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@kun432
Copy link

kun432 commented Oct 17, 2024

I'm confused with that step2 notebook and tutorial document are a little different.

In step2 notebook, QA is generated from these 2 instances:

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, raw_instance)

initial_qa = (
    corpus_instance.sample(random_single_hop, n=3)
    (snip)
)

initial_qa.to_parquet('/content/initial_qa.parquet', '/content/initial_corpus.parquet')

new_corpus_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
new_corpus_instance = Corpus(new_corpus_df, raw_instance)

new_qa = initial_qa.update_corpus(new_corpus_instance)
new_qa.to_parquet("/content/new_qa.parquet", "/content/new_corpus.parquet")

seems initial_qa and initial corpus are created from one of "chunked" parquet data, and another optimized qa data and corpus are created from the combination of another "chunked" parquet and initial corpus and qa.

OTOH, in tutorial documents(https://docs.auto-rag.com/data_creation/tutorial.html) seems:

# initial_raw seems "parsed" data
initial_raw = Raw(initial_raw_df)

# initial chunk
initial_corpus = initial_raw.chunk(
    "llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5
)
llm = OpenAI()
initial_qa = (
    initial_corpus.sample(random_single_hop, n=3)
   (snip)
)
initial_qa.to_parquet("./initial_qa.parquet", "./initial_corpus.parquet")

# chunk optimization, now we have other variations of "chunk"s
chunker = Chunker.from_parquet("./initial_raw.parquet", "./chunk_project_dir")
chunker.start_chunking("./chunking.yaml")

# corpus-qa mapping
raw = Raw(initial_raw_df)   # <-- "parsed" data, right?
corpus = Corpus(initial_corpus_df, raw)
qa = QA(initial_qa_df, corpus)

new_qa = qa.update_corpus(Corpus(new_corpus_df, raw))

in this tutorial, optimized qa data and corpus are created from the combination of "initial_raw_df", which is the same as "parsed" data, and initial corpus and qa.

seems "different data" are used and I'm confused.

so my basic question: is this code in step2 notebook correct?

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)
@vkehfdl1
Copy link
Contributor

Hello @kun432

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

This code is nothing wrong. So Raw instance is basically a parsed data. You can make it using Parser and parse.yaml. https://docs.auto-rag.com/data_creation/parse/parse.html
Or you can use Huggingface Space to get an raw.parquet.
Of course you can make it on your own with pandas, if you have already parsed data.

In conclusion, raw.parquet is the 'parsed' file. So load it from parquet file or dataframe is no matter.

  • Raw : parsed data
  • Corpus : chunked data
  • QA : Question & Answer dataset based on the corpus

@thap2331
Copy link

Wohh...
This needs to be somewhere in the doc...Please...

Raw : parsed data
Corpus : chunked data
QA : Question & Answer dataset based on the corpus

@vkehfdl1
Copy link
Contributor

@thap2331 Okay we will.
I will make this as documentation issue

@vkehfdl1 vkehfdl1 added the documentation Improvements or additions to documentation label Oct 20, 2024
@vkehfdl1 vkehfdl1 self-assigned this Oct 20, 2024
@kun432
Copy link
Author

kun432 commented Oct 20, 2024

@vkehfdl1 Thank you.

Well, I think I got it. It's like Raw make its input parquet/df to "parsed" instance regardless of whether its input parquet/df is chunked or not, right?

Let's say with step2 notebook, as an example.

First, parse a pdf with Parser.

from autorag.parser import Parser

parser = Parser(data_path_glob="/content/raw_documents/sample.pdf", project_dir="/content/parse_project_dir")
parser.start_parsing("/content/parse.yaml")

this makes "/content/parse_project_dir/0/0.parquet" as 1 "parsed" parquet file.

Then chunk it with Chunker with this YAML config.

modules:
  - module_type: llama_index_chunk
    chunk_method: [ Token, Sentence ]
    chunk_size: [ 1024, 512 ]
    chunk_overlap: 24
    add_file_name: en
from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="/content/parse_project_dir/0/0.parquet", project_dir="/content/chunk_project_dir")
chunker.start_chunking("/content/chunk.yaml")

this makes 4 "chunked" parquet files depends on the combination of chunk_method and chunk_size.

  • /content/chunk_project_dir/0/0.parquet
  • /content/chunk_project_dir/0/1.parquet
  • /content/chunk_project_dir/0/2.parquet
  • /content/chunk_project_dir/0/3.parquet

Then for generating QA, prepare Raw and Corpus

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, raw_instance)

In this point, these all codes below makes the same "parsed" instance, I think.

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)
# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df)
# choose "parsed" parquet before chunked
raw_df = pd.read_parquet("/content/parse_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

@thap2331
Copy link

@thap2331 Okay we will. I will make this as documentation issue

@vkehfdl1 you can assign this to me. Or, I can get this going. If I do, I will assign this to you.

@vkehfdl1
Copy link
Contributor

vkehfdl1 commented Oct 21, 2024

@kun432 The Parse instance gets only 'non-chunked' dataframe. The chunked dataframe must be in the Corpus instance.
Yes you can make Raw instance with 'non-chunked' dataframe, but it is not intended.

So,

  • Raw : parsed data
  • Corpus : chunked data
  • QA : Question & Answer dataset based on the corpus

This is the basic structure.
All of your code is correct and working, but

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df)

This code is not intended. The chunked parquet file must be in Corpus instance.

@thap2331 I assigned to you thanks:)
Go to the docs/ folder and edit .md files.

@kun432
Copy link
Author

kun432 commented Oct 21, 2024

@vkehfdl1 hmm...kind of lost again. I think I need more time to think... will read the docs and try. Thanks, anyway.

@vkehfdl1
Copy link
Contributor

@kun432 Image

Here is the brief image of our Data Creation structure. I hope it helps.

For RAG, here are the steps.

  1. Parse raw doucments to the texts. => This will be the Raw instance.
  2. Chunk the parsed texts into the small passages (chunks) => This will be the Corpus instance.
  3. Generate Question & Answers from the corpus dataset => This will be the QA instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants