[Qustion] QA-Corpus mapping #854

kun432 · 2024-10-17T01:59:15Z

I'm confused with that step2 notebook and tutorial document are a little different.

In step2 notebook, QA is generated from these 2 instances:

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, raw_instance)

initial_qa = (
    corpus_instance.sample(random_single_hop, n=3)
    (snip)
)

initial_qa.to_parquet('/content/initial_qa.parquet', '/content/initial_corpus.parquet')

new_corpus_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
new_corpus_instance = Corpus(new_corpus_df, raw_instance)

new_qa = initial_qa.update_corpus(new_corpus_instance)
new_qa.to_parquet("/content/new_qa.parquet", "/content/new_corpus.parquet")

seems initial_qa and initial corpus are created from one of "chunked" parquet data, and another optimized qa data and corpus are created from the combination of another "chunked" parquet and initial corpus and qa.

OTOH, in tutorial documents(https://docs.auto-rag.com/data_creation/tutorial.html) seems:

# initial_raw seems "parsed" data
initial_raw = Raw(initial_raw_df)

# initial chunk
initial_corpus = initial_raw.chunk(
    "llama_index_chunk", chunk_method="token", chunk_size=128, chunk_overlap=5
)
llm = OpenAI()
initial_qa = (
    initial_corpus.sample(random_single_hop, n=3)
   (snip)
)
initial_qa.to_parquet("./initial_qa.parquet", "./initial_corpus.parquet")

# chunk optimization, now we have other variations of "chunk"s
chunker = Chunker.from_parquet("./initial_raw.parquet", "./chunk_project_dir")
chunker.start_chunking("./chunking.yaml")

# corpus-qa mapping
raw = Raw(initial_raw_df)   # <-- "parsed" data, right?
corpus = Corpus(initial_corpus_df, raw)
qa = QA(initial_qa_df, corpus)

new_qa = qa.update_corpus(Corpus(new_corpus_df, raw))

in this tutorial, optimized qa data and corpus are created from the combination of "initial_raw_df", which is the same as "parsed" data, and initial corpus and qa.

seems "different data" are used and I'm confused.

so my basic question: is this code in step2 notebook correct?

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

The text was updated successfully, but these errors were encountered:

vkehfdl1 · 2024-10-18T06:51:20Z

Hello @kun432

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

This code is nothing wrong. So Raw instance is basically a parsed data. You can make it using Parser and parse.yaml. https://docs.auto-rag.com/data_creation/parse/parse.html
Or you can use Huggingface Space to get an raw.parquet.
Of course you can make it on your own with pandas, if you have already parsed data.

In conclusion, raw.parquet is the 'parsed' file. So load it from parquet file or dataframe is no matter.

Raw : parsed data
Corpus : chunked data
QA : Question & Answer dataset based on the corpus

thap2331 · 2024-10-20T00:32:46Z

Wohh...
This needs to be somewhere in the doc...Please...

Raw : parsed data
Corpus : chunked data
QA : Question & Answer dataset based on the corpus

vkehfdl1 · 2024-10-20T08:42:58Z

@thap2331 Okay we will.
I will make this as documentation issue

kun432 · 2024-10-20T14:11:38Z

@vkehfdl1 Thank you.

Well, I think I got it. It's like Raw make its input parquet/df to "parsed" instance regardless of whether its input parquet/df is chunked or not, right?

Let's say with step2 notebook, as an example.

First, parse a pdf with Parser.

from autorag.parser import Parser

parser = Parser(data_path_glob="/content/raw_documents/sample.pdf", project_dir="/content/parse_project_dir")
parser.start_parsing("/content/parse.yaml")

this makes "/content/parse_project_dir/0/0.parquet" as 1 "parsed" parquet file.

Then chunk it with Chunker with this YAML config.

modules:
  - module_type: llama_index_chunk
    chunk_method: [ Token, Sentence ]
    chunk_size: [ 1024, 512 ]
    chunk_overlap: 24
    add_file_name: en

from autorag.chunker import Chunker

chunker = Chunker.from_parquet(parsed_data_path="/content/parse_project_dir/0/0.parquet", project_dir="/content/chunk_project_dir")
chunker.start_chunking("/content/chunk.yaml")

this makes 4 "chunked" parquet files depends on the combination of chunk_method and chunk_size.

/content/chunk_project_dir/0/0.parquet
/content/chunk_project_dir/0/1.parquet
/content/chunk_project_dir/0/2.parquet
/content/chunk_project_dir/0/3.parquet

Then for generating QA, prepare Raw and Corpus

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

corpus_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
corpus_instance = Corpus(corpus_df, raw_instance)

In this point, these all codes below makes the same "parsed" instance, I think.

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df)

# choose "parsed" parquet before chunked
raw_df = pd.read_parquet("/content/parse_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

thap2331 · 2024-10-20T17:17:19Z

@thap2331 Okay we will. I will make this as documentation issue

@vkehfdl1 you can assign this to me. Or, I can get this going. If I do, I will assign this to you.

vkehfdl1 · 2024-10-21T06:47:16Z

@kun432 The Parse instance gets only 'non-chunked' dataframe. The chunked dataframe must be in the Corpus instance.
Yes you can make Raw instance with 'non-chunked' dataframe, but it is not intended.

So,

Raw : parsed data
Corpus : chunked data
QA : Question & Answer dataset based on the corpus

This is the basic structure.
All of your code is correct and working, but

raw_df = pd.read_parquet("/content/chunk_project_dir/0/0.parquet")
raw_instance = Raw(raw_df)

# choose different "chunked" parquet
raw_df = pd.read_parquet("/content/chunk_project_dir/0/1.parquet")
raw_instance = Raw(raw_df)

This code is not intended. The chunked parquet file must be in Corpus instance.

@thap2331 I assigned to you thanks:)
Go to the docs/ folder and edit .md files.

kun432 · 2024-10-21T08:14:49Z

@vkehfdl1 hmm...kind of lost again. I think I need more time to think... will read the docs and try. Thanks, anyway.

vkehfdl1 · 2024-10-21T08:20:30Z

@kun432

Here is the brief image of our Data Creation structure. I hope it helps.

For RAG, here are the steps.

Parse raw doucments to the texts. => This will be the Raw instance.
Chunk the parsed texts into the small passages (chunks) => This will be the Corpus instance.
Generate Question & Answers from the corpus dataset => This will be the QA instance.

vkehfdl1 added the documentation Improvements or additions to documentation label Oct 20, 2024

vkehfdl1 self-assigned this Oct 20, 2024

vkehfdl1 assigned thap2331 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qustion] QA-Corpus mapping #854

[Qustion] QA-Corpus mapping #854

kun432 commented Oct 17, 2024 •

edited by vkehfdl1

Loading

vkehfdl1 commented Oct 18, 2024

thap2331 commented Oct 20, 2024

vkehfdl1 commented Oct 20, 2024

kun432 commented Oct 20, 2024

thap2331 commented Oct 20, 2024

vkehfdl1 commented Oct 21, 2024 •

edited

Loading

kun432 commented Oct 21, 2024

vkehfdl1 commented Oct 21, 2024

[Qustion] QA-Corpus mapping #854

[Qustion] QA-Corpus mapping #854

Comments

kun432 commented Oct 17, 2024 • edited by vkehfdl1 Loading

vkehfdl1 commented Oct 18, 2024

thap2331 commented Oct 20, 2024

vkehfdl1 commented Oct 20, 2024

kun432 commented Oct 20, 2024

thap2331 commented Oct 20, 2024

vkehfdl1 commented Oct 21, 2024 • edited Loading

kun432 commented Oct 21, 2024

vkehfdl1 commented Oct 21, 2024

kun432 commented Oct 17, 2024 •

edited by vkehfdl1

Loading

vkehfdl1 commented Oct 21, 2024 •

edited

Loading