Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How can I retrieve image or files from AutoRAG? / 이미지나 파일 등을 어떻게 RAG에 저장할 수 있을까요? #844

Open
TheClevers opened this issue Oct 14, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@TheClevers
Copy link

Hi, I'm building a kind of modular RAG system. But my data includes some images, tables ( I used html to save tables), files.
To embed images and files, they are translated into texts by OCR. And original sources are saved as url.

I found AutoRAG is the system I need, but I can't figure out how I should use those html texts and file links.

Can you suggest any hint or idea how I can utilize image and files in modular RAG?

Another Question, I didn't understand the mechanism of PASS module yet. How do you judge which answer is the best answer?

Thanks for reading.

@vkehfdl1 vkehfdl1 added the question Further information is requested label Oct 14, 2024
@vkehfdl1
Copy link
Contributor

vkehfdl1 commented Oct 14, 2024

@TheClevers
안녕하세요! 먼저 AutoRAG를 사용해주셔서 감사합니다.
차트나 그래프 등과 같은 이미지를 포함한 multi-modal RAG를 구성하고 싶으신 것인가요?
일단, 이미지 같은 경우에는 아직 AutoRAG에서 지원하지는 않지만 @bwook00 님께서 빠른 시일 안에 추가 feature로 추가할 예정입니다.
html text와 같은 경우에는 LlamaIndex의 가이드를 참고하셔서 HTML을 잘 파싱한 이후에, raw.parquet의 형태로 만들어 data creation 단계부터 사용하시면 됩니다. 관련 문서
HTML로 되어있는 표 같은 경우에는, 표만 따로 chunk로 만드는 형식도 괜찮을 것 같습니다.

두 번째로, "pass module"은 아무것도 안하는 모듈이라고 보면 됩니다.
예를 들어서, passage reranker를 시험할 때에 아예 리랭커를 "사용하지 않는" 옵션에 대해서도 시험해 보고 싶을 수 있습니다. 그럴 때 pass reranker를 활용하면, 리랭커를 사용하지 않는 경우를 가정해 실험해 볼 수 있습니다.

아래는 영어 번역입니다.


Hello! First of all, thank you for using AutoRAG.

Are you trying to set up a multi-modal RAG that includes images such as charts and graphs? For now, AutoRAG does not support image data yet, but @bwook00 plans to add this feature soon.

As for HTML text, you can follow LlamaIndex's guide to parse HTML properly and then convert it into a raw.parquet format to use from the data creation stage. You can refer to the related documentation. For tables in HTML, it could also work to chunk out just the tables separately.

Secondly, the "pass module" can be considered as a module that does nothing. For example, when testing a passage reranker, you might want to test the option of "not using" a reranker at all. In such cases, the pass reranker allows you to experiment assuming the reranker is not used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants