Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document summary to extraction process #1682

Merged
merged 1 commit into from
Dec 11, 2024
Merged

Conversation

NolanTrem
Copy link
Collaborator

@NolanTrem NolanTrem commented Dec 10, 2024

Analysis over the Nobel prizes in science demonstrates that having access to document summaries during knowledge graph extraction improves quality and relevance. The summary-assisted extraction delivered better focus and coherence across three key dimensions: entity selection, relationship mapping, and community structure.

Key Statistics:

  • Summary-assisted: 189 entities, 101 relationships, 4 communities (47.25 entities/community)
  • Baseline: 183 entities, 121 relationships, 13 communities (14.08 entities/community)

While the baseline extraction captured more relationships, the summary-assisted version produced more meaningful connections focused on core Nobel Prize concepts, particularly in representing the intersection of neural networks and statistical physics. These findings validate the inclusion of document summaries in the extraction pipeline for improved knowledge representation.


Important

Add document summaries to the knowledge graph extraction process for improved entity and relationship extraction, and update related prompt templates.

  • Behavior:
    • Add document summary retrieval in augment_document_info() in ingestion_service.py and _extract_kg() in kg_service.py.
    • Use document summaries in entity and relationship extraction processes.
  • Prompts:
    • Update graphrag_entity_description.yaml to include document_summary in entity description generation.
    • Update graphrag_relationships_extraction_few_shot.yaml to use document_summary for relationship extraction.
  • Migration:
    • Add user_count and document_count columns to collections in c45a9cf6a8a4_add_user_and_document_count_to_.py.

This description was created by Ellipsis for 97c5ab8. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 97c5ab8 in 49 seconds

More details
  • Looked at 255 lines of code in 6 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 drafted comments based on config settings.
1. py/core/main/services/ingestion_service.py:252
  • Draft comment:
    The # FIXME: Why are we hardcoding the model here? comment indicates a potential issue with hardcoding the model in the GenerationConfig. Consider making the model configurable to allow flexibility and avoid potential issues with model updates.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The PR introduces a new feature to include document summaries in the extraction process. This involves changes in multiple files, including services, pipes, and prompt templates. The changes seem to be consistent with the PR description, which aims to improve the quality and relevance of knowledge graph extraction by using document summaries. However, there are some areas that need attention.
2. py/core/main/services/kg_service.py:1004
  • Draft comment:
    The use of # type: ignore in await self.providers.database.document_handler.get_documents_overview suggests that there might be type-checking issues. Consider addressing the root cause of the type mismatch instead of ignoring it to ensure type safety.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The PR introduces a new feature to include document summaries in the extraction process. This involves changes in multiple files, including services, pipes, and prompt templates. The changes seem to be consistent with the PR description, which aims to improve the quality and relevance of knowledge graph extraction by using document summaries. However, there are some areas that need attention.
3. py/core/main/services/kg_service.py:1010
  • Draft comment:
    The code assumes that response["results"][0].summary exists. Consider adding a check to ensure that response["results"] is not empty before accessing the summary to avoid potential IndexError.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The PR introduces a new feature to include document summaries in the extraction process. This involves changes in multiple files, including services, pipes, and prompt templates. The changes seem to be consistent with the PR description, which aims to improve the quality and relevance of knowledge graph extraction by using document summaries. However, there are some areas that need attention.
4. py/core/pipes/kg/description.py:81
  • Draft comment:
    The use of # type: ignore in await self.database_provider.document_handler.get_documents_overview suggests that there might be type-checking issues. Consider addressing the root cause of the type mismatch instead of ignoring it to ensure type safety.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The PR introduces a new feature to include document summaries in the extraction process. This involves changes in multiple files, including services, pipes, and prompt templates. The changes seem to be consistent with the PR description, which aims to improve the quality and relevance of knowledge graph extraction by using document summaries. However, there are some areas that need attention.

Workflow ID: wflow_pEUCD9M90euNajwV


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@NolanTrem NolanTrem merged commit d84aa43 into main Dec 11, 2024
12 of 30 checks passed
@NolanTrem NolanTrem deleted the Nolan/summaryInExtraction branch December 11, 2024 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant