Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/fix zerox ingestion #1659

Merged
merged 5 commits into from
Dec 5, 2024
Merged

Conversation

emrgnt-cmplxty
Copy link
Contributor

@emrgnt-cmplxty emrgnt-cmplxty commented Dec 5, 2024

Important

Adjust concurrency limits, improve ingestion workflows, and enhance PDF parsing and retrieval services.

  • Concurrency and Limits:
    • Adjust ingestion_concurrency_limit to 16 and kg_concurrency_limit to 4 in orchestration.py.
    • Update concurrency limits in full.toml, full_azure.toml, and r2r.toml.
  • Ingestion Workflow:
    • Modify parse() in ingestion_workflow.py to set retries to 0.
    • Add logic to handle both collection_id and document_id in kg_workflow.py.
  • PDF Parsing:
    • Introduce _create_temp_dir() in pdf_parser.py for unique temporary directories.
    • Adjust PDF image conversion and processing logic.
  • Retrieval Service:
    • Fix message handling in completion() in retrieval_service.py.
    • Update search_rag_pipe.py to handle different KGSearchResultType cases.
  • Miscellaneous:
    • Update version to 3.3.2 in pyproject.toml.
    • Correct typos and improve logging in various files.

This description was created by Ellipsis for 906cab2. It will automatically update as commits are pushed.

@emrgnt-cmplxty emrgnt-cmplxty marked this pull request as ready for review December 5, 2024 01:45
@emrgnt-cmplxty emrgnt-cmplxty merged commit 01659ea into main Dec 5, 2024
12 of 30 checks passed
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 906cab2 in 2 minutes and 9 seconds

More details
  • Looked at 669 lines of code in 13 files
  • Skipped 0 files when reviewing.
  • Skipped posting 6 drafted comments based on config settings.
1. py/core/base/providers/orchestration.py:18
  • Draft comment:
    The kg_concurrency_limit has been changed to 4 here, but in the full.toml and full_azure.toml files, it is set to 8. Ensure consistency across the codebase.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The change in the concurrency limit from 8 to 4 in the OrchestrationConfig class should be reflected in the configuration files for consistency.
2. py/core/configs/full.toml:19
  • Draft comment:
    Typo in kg_creation_concurrency_lipmit. It should be kg_creation_concurrency_limit. This typo is also present in full_azure.toml.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
3. py/core/configs/full_azure.toml:57
  • Draft comment:
    The ingestion_concurrency_limit is set to 4 here, but in orchestration.py, it is set to 16. Ensure consistency across the codebase.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The ingestion_concurrency_limit in full_azure.toml is set to 4, which is inconsistent with the orchestration.py file where it is set to 16. This inconsistency should be addressed.
4. py/core/main/orchestration/hatchet/ingestion_workflow.py:59
  • Draft comment:
    The parse function now has retries=0. Ensure this is consistent with other similar functions if applicable.
  • Reason this comment was not posted:
    Confidence changes required: 30%
    The parse function in ingestion_workflow.py was updated to have retries=0. This change should be consistent across similar functions if applicable.
5. py/core/main/services/ingestion_service.py:118
  • Draft comment:
    The version check was removed from ingest_file_ingress. Ensure this change is intentional and consistent with the overall logic.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The ingest_file_ingress function in ingestion_service.py was updated to remove the version check. Ensure this change is intentional and consistent with the overall logic.
6. py/core/main/services/retrieval_service.py:136
  • Draft comment:
    Ensure that messages are converted to dictionaries before passing to aget_completion, as done in this function.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The completion function in retrieval_service.py was updated to convert messages to dictionaries before passing them to aget_completion. This change should be reflected in the retrieval_router.py to ensure consistency.

Workflow ID: wflow_JhiG8UFMwTpOyWLJ


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@emrgnt-cmplxty emrgnt-cmplxty deleted the feature/fix-zerox-ingestion branch December 5, 2024 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant