Shreyas/unstructured #900

shreyaspimpalgaonkar · 2024-08-20T20:00:42Z

🚀	This description was created by Ellipsis for commit `c776bf1`

Summary:

This PR updates dependencies and integrates a new unstructured parsing and chunking strategy, modifying configurations, providers, and services.

Key points:

Update python-pptx to ^1.0.1 in pyproject.toml.
Add unstructured dependency with all-docs extras.
Modify chunk method in py/core/base/providers/chunking.py to accept Union[str, DocumentExtraction].
Update ParsingConfig to include chunking_config.
Add unstructured.toml for parsing and chunking configurations.
Pass chunking_config in R2RConfig.
Handle unstructured_api in create_parsing_provider.
Enhance error logging in py/core/pipes/ingestion/chunking_pipe.py.
Log errors and handle DocumentExtraction in ChunkingPipe.
Handle DocumentExtraction in R2RChunkingProvider.
Yield text directly in UnstructuredChunkingProvider if partitioned.
Support API-based parsing in UnstructuredParsingProvider.

Generated with ❤️ by ellipsis.dev

ellipsis-dev

❌ Changes requested. Reviewed everything up to f8aa82a in 35 seconds

More details

Looked at 368 lines of code in 11 files
Skipped 0 files when reviewing.
Skipped posting 3 drafted comments based on config settings.

1. r2r/providers/chunking/unstructured_chunking.py:29

Draft comment:
Ensure that parsed_document.data is already chunked or processed appropriately before yielding it directly when partitioned_by_unstructured is True. This could lead to incorrect processing if the data is not in the expected format.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment is speculative and asks for verification, which violates the rules. It does not point out a definite issue but rather suggests a potential one. The code seems to handle the condition where partitioned_by_unstructured is True by directly yielding the data, which is likely intentional.
I might be overlooking a potential issue with the data format when partitioned_by_unstructured is True, but the comment does not provide a definite problem, only a speculative one.
Even if there is a potential issue, the comment does not provide a clear, actionable change, and thus does not meet the criteria for a useful comment.
Remove the comment as it is speculative and does not point out a definite issue, violating the rules.

2. r2r/providers/parsing/unstructured_parsing.py:88

Draft comment:
Ensure that element is always a dictionary before iterating over it. Add a check or handle cases where element might not be a dictionary to avoid runtime errors.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The code already includes a check to ensure 'element' is a dictionary before iterating over it. The comment seems redundant because the check is already implemented. The comment does not suggest any additional code change that is required.
I might be overlooking a scenario where 'element' could be something other than a dictionary, but the existing check seems to handle this.
The existing check for 'element' being a dictionary should suffice, and the comment does not add any new information or suggest a necessary change.
The comment is unnecessary because the code already checks if 'element' is a dictionary. It should be removed.

3. r2r/main/assembly/factory.py:97

Draft comment:
Consider explicitly passing a boolean value for use_api when initializing UnstructuredParsingProvider for better clarity and to avoid potential logical errors.
Reason this comment was not posted:
Confidence changes required: 50%
In r2r/main/assembly/factory.py, the create_parsing_provider method has a potential issue with the way it initializes UnstructuredParsingProvider. The boolean logic for use_api is directly passed, which might not be clear. It would be better to explicitly pass a boolean value for clarity.

Workflow ID: wflow_q4rCzkTx0UksAh9x

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

r2r/main/services/ingestion_service.py

ellipsis-dev

👍 Looks good to me! Incremental review on a702131 in 18 seconds

More details

Looked at 27 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. py/core/main/services/ingestion_service.py:385

Draft comment:
The exc_info=True parameter was removed from the logger.error call, which may reduce the amount of information logged during an error. Consider whether this change is intentional and if the stack trace is needed for debugging.
Reason this comment was not posted:
Confidence changes required: 50%
The PR description mentions removing exc_info=True from the logger.error call, but this change is not reflected in the code. This discrepancy should be addressed.

Workflow ID: wflow_a3Zb0ogzO0A3Fnki

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev

👍 Looks good to me! Incremental review on c776bf1 in 25 seconds

More details

Looked at 92 lines of code in 4 files
Skipped 0 files when reviewing.
Skipped posting 2 drafted comments based on config settings.

1. py/core/providers/chunking/r2r_chunking.py:11

Draft comment:
Duplicate import of ChunkingProvider and Method. Remove this line to clean up the code.
Reason this comment was not posted:
Confidence changes required: 10%
The import statement for ChunkingProvider and Method is duplicated, which is unnecessary and can be removed to clean up the code.

2. py/core/providers/chunking/unstructured_chunking.py:26

Draft comment:
Ensure parsed_document is either a string or DocumentExtraction to prevent runtime errors. Consider adding a type check and raise an appropriate error if the type is unexpected.
Reason this comment was not posted:
Confidence changes required: 30%
The chunk method in UnstructuredChunkingProvider should handle cases where parsed_document is not a string or DocumentExtraction. This could lead to runtime errors if unexpected types are passed.

Workflow ID: wflow_grAcjbX7Sep07FsN

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

* Feature/merge graphrag group mgmt (#876) * add group ids to document abstraction, first steps * extend group permissions * up * add tests for new group features * up * fixup auth * onboard extensive regression tests * adding regression tests * finish tests * rm selenium * test observability * uncomment tests * checkin first set of group tests * modify search, passing vector tests * checkin work * full delete logic * update search to use new filters * check in * Clean up * Check in * add search * tests/test_end_to_end.py::test_ingest_txt_document passing * cleanup logging * make schemas explicit * move to run logger abstraction * cleanup some test workflows * revive tests * tweak to pass tests * tweak rrf * finish hybrid search cleanup * fixup on regr tests, regen payloads * refresh payloads * refactor api model * Feature/refactor api model (#868) * cleanup imports * flake and cleanup * coherent global import / export structure * add ingestion response models * add management response models * cleanups * checkin work on routes * remove request models * last fixes * merge * add user / group gating * working test groups * updating client --------- Co-authored-by: NolanTrem <34580718+NolanTrem@users.noreply.github.com> * Clean up API (#878) * Get running * fixes in sdk * Add in more fixes * Feature/merge dev owen changes (#880) * add group ids to document abstraction, first steps * extend group permissions * up * add tests for new group features * up * fixup auth * onboard extensive regression tests * adding regression tests * finish tests * rm selenium * test observability * uncomment tests * checkin first set of group tests * modify search, passing vector tests * checkin work * full delete logic * update search to use new filters * check in * Clean up * Check in * add search * tests/test_end_to_end.py::test_ingest_txt_document passing * cleanup logging * make schemas explicit * move to run logger abstraction * cleanup some test workflows * revive tests * tweak to pass tests * tweak rrf * finish hybrid search cleanup * fixup on regr tests, regen payloads * refresh payloads * refactor api model * Feature/refactor api model (#868) * cleanup imports * flake and cleanup * coherent global import / export structure * add ingestion response models * add management response models * cleanups * checkin work on routes * remove request models * last fixes * merge * add user / group gating * working test groups * updating client * rename service to restructure * add get documents for group endpoint * fix client bugs * return delete format * merge cleanups * merge * finalize --------- Co-authored-by: NolanTrem <34580718+NolanTrem@users.noreply.github.com> * Shreyas/graphrag test (#881) * add group ids to document abstraction, first steps * extend group permissions * up * add tests for new group features * up * fixup auth * onboard extensive regression tests * adding regression tests * finish tests * rm selenium * test observability * uncomment tests * checkin first set of group tests * modify search, passing vector tests * checkin work * full delete logic * update search to use new filters * check in * Clean up * Check in * add search * tests/test_end_to_end.py::test_ingest_txt_document passing * cleanup logging * make schemas explicit * move to run logger abstraction * cleanup some test workflows * revive tests * tweak to pass tests * tweak rrf * finish hybrid search cleanup * fixup on regr tests, regen payloads * refresh payloads * refactor api model * Feature/refactor api model (#868) * cleanup imports * flake and cleanup * coherent global import / export structure * add ingestion response models * add management response models * cleanups * checkin work on routes * remove request models * last fixes * merge * add user / group gating * sync * enrich * up * fix global search * rag * remove client.py * rm configs * rm configs --------- Co-authored-by: emrgnt-cmplxty <owen@algofi.org> Co-authored-by: NolanTrem <34580718+NolanTrem@users.noreply.github.com> Co-authored-by: emrgnt-cmplxty <68796651+emrgnt-cmplxty@users.noreply.github.com> * Feature/fix embedding pipe (#882) * up * fixup concurrency * fix ollama embeddings * fix batching with ollama * checkin all cleanups * rm kg cruft (#884) * rm kg cruft * tweaks * tweak 2 (#885) * Feature/fix retrieval endpoint cruft (#887) * tweak 2 * fix retrieval endpoint descriptions * Python SDK (#886) Clean up Python SDK and routes * Separate out SDK, add js and go sdk to monorepo (#888) * Add r2r-js sdk * Add go sdk * Pull out python sdk * remove venv * Update packages * Check in fixes * Remove alembic dependencies * Feature/merge w nolan (#894) * cleanup hybrid search * cleanups in * Fix structure * Make graspologic optional * fix rag stream (#895) * add py r2r (#896) * Clean up (#897) * fix agent (#898) * define `RAGAgentResponse` (#899) * Shreyas/unstructured (#900) * api + oss lib * rm pdb * rm poetry lock * update version * fixes * Feature/cleanup client obj logic (#901) * define `RAGAgentResponse` * cleanup client logic * Shreyas/tests (#889) * init * tests * rename service * api model * add * merge * rm restructure router * print descriptions * Refactor CLI (#903) * Rm files readded by git (#904) * Remove Execution Wrapper (#905) * Rm files readded by git * Fix merge botch * Feature/fix auth revive tests rebased (#906) * adding the client touch ups * fix auth, revive tests * add back tests * uncomment run auth workflow * decruft * refresh test kg * fixup toml (#908) * Feature/fix ingestion update (#909) * fixup toml * fix update * Fix CLI Tests (#912) Fix CLI tests * Shreyas/kg runtime cfg (#913) add kg runtime config * rename kgenrichmentresponse (#914) * Feature/add nltk hybrid expansion rebased (#917) * expand hybrid search with nltk * cleanups * cleanup hybrid search * format * add setup.py * update * add script (#918) * Fix bug in document chunks (#921) * Fix bug in update files (#923) * Shreyas/unstructured (#922) * fix dockerfiles * adding config * fix paths * mv unstructured dep to docker * clean * Update docker_utils.py * Update unstructured_parsing.py * Update r2r_chunking.py * Update app_entry.py * Feature/repair logging (#925) * fixing logs * fix * rm double logging (#929) * Configs (#926) * Fix config logic * Update config * Clean up cli entry point * Disable SSL when installing nltk wordnet (#930) * Fix analytics endpoint * Update OpenAI sdk calls (#933) * Feature/revive advanced rag (#932) * rm double logging * revive advanced rag examples * merge (#934) * sync model (#935) * Feature/remove version from ingestion end pt (#936) * sync model * remove ability to set version * tweak versions impl * fix version bug * Move docker (#938) * Move docker * remove from root * Clean up sdk/restructure.py * Fix js tests, completion scoring (#939) * Shreyas/unstructured docker image (#940) unstructured docker image * Update JS (#941) * Update models (#942) * Feature/complete group logic (#945) * fix group logic * up * Fix Dockerbuild, Symlink Readme (#944) * Add back tast prompt override and include title if availible * Fix docker, sym link readme * Fix compose file path * Shreyas/KG Search Result model (#937) * return type to kg_search_result * add model * local and global results * modify config * refresh should not be gated by auth (#946) * Linting sync (#947) * Remove email from refresh (#948) * Fix link to image * Feature/rm print cruft rebase (#953) * refresh should not be gated by auth * rm print cruft * black and sort * merge * rm * update api return type * Update Actions (#954) * Update Github Actions (#956) * Update Actions * Update actions * Shreyas/kgsearchresult model (#957) * return type to kg_search_result * add model * local and global results * modify config * add models * up * fix config path * fix models * Login and refresh token bug (#959) * Update Actions * Fix bug in login with refresh token * Point pytest to linux (#960) * collection docs (#955) * Feature/merge dev to main (#962) * merge dev and main * git rm * add back collection fix * fix docker builds (#963) * Running unstructured docker + code cleanups (#964) * Small bugfixes on prompts, return types (#965) * Fix failing CLI tests * NPM publish action * remove tarball * Feature/fix dev tests (#966) * update auth tests * fix tests * back and sort * decruft * revert back to gpt-4o --------- Co-authored-by: NolanTrem <34580718+NolanTrem@users.noreply.github.com> Co-authored-by: Shreyas Pimpalgaonkar <shreyas.gp.7@gmail.com>

shreyaspimpalgaonkar added 4 commits August 20, 2024 12:54

api + oss lib

eb8b109

rm pdb

b011b08

rm poetry lock

c1d9000

Merge remote-tracking branch 'origin/dev' into shreyas/unstructured

f8aa82a

ellipsis-dev bot reviewed Aug 20, 2024

View reviewed changes

r2r/main/services/ingestion_service.py Outdated Show resolved Hide resolved

shreyaspimpalgaonkar added 2 commits August 20, 2024 13:04

Merge remote-tracking branch 'origin/dev' into shreyas/unstructured

9e220c3

update version

a702131

ellipsis-dev bot reviewed Aug 20, 2024

View reviewed changes

fixes

c776bf1

ellipsis-dev bot reviewed Aug 20, 2024

View reviewed changes

shreyaspimpalgaonkar merged commit 675a18c into dev Aug 20, 2024
1 check failed

shreyaspimpalgaonkar deleted the shreyas/unstructured branch August 20, 2024 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shreyas/unstructured #900

Shreyas/unstructured #900

shreyaspimpalgaonkar commented Aug 20, 2024 •

edited by ellipsis-dev bot

Loading

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

Shreyas/unstructured #900

Shreyas/unstructured #900

Conversation

shreyaspimpalgaonkar commented Aug 20, 2024 • edited by ellipsis-dev bot Loading

Summary:

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

shreyaspimpalgaonkar commented Aug 20, 2024 •

edited by ellipsis-dev bot

Loading