Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Skip resolve_timestamp_path when using --data in local search #1049

Closed
wants to merge 2 commits into from

Conversation

ka2048
Copy link

@ka2048 ka2048 commented Aug 29, 2024

Description

The errors during the indexing process that led to rebuilding once again have been quite frustrating, so I specified the --resume parameter in the CLI to avoid repeatedly consuming tokens to build the index. However, instead of using the timestamp in the example, I customized a name, like acc (since there are too many previously generated timestamp directories under this directory, a custom name makes it more noticeable). After successfully completing the index build, I attempted to use global search with the --data parameter, and it worked well. However, when I used local search with the same --data parameter, I received a ValueError.

For example, the CLI command is as follows:

python -m graphrag.query \
--root ./ragtest \
--data ./ragtest/output/acc/artifacts \
--method local \
"Who is Scrooge, and what are his main relationships?"

The ValueError originates from graphrag/config/resolve_timestamp_path.py at lines 76-78:

if len(timestamp_dirs) == 0:
    msg = f"No timestamp directories found in {parent_dir} that match {pattern.pattern}."
    raise ValueError(msg)

This is very confusing since I didn’t use a timestamp as the value for the --data parameter, so why is it prompting that none were found?
Upon inspecting the code, I discovered that in the method _configure_paths_and_settings within the graphrag/query/cli.py file, a conditional check is already performed based on the --data parameter:

def _configure_paths_and_settings(
    data_dir: str | None,
    root_dir: str | None,
    config_filepath: str | None,
) -> tuple[str, str | None, GraphRagConfig]:
    config = _create_graphrag_config(root_dir, config_filepath)
    if data_dir is None and root_dir is None:
        msg = "Either data_dir or root_dir must be provided."
        raise ValueError(msg)
    if data_dir is None:
        base_dir = Path(str(root_dir)) / config.storage.base_dir
        data_dir = str(resolve_timestamp_path(base_dir))
    return data_dir, root_dir, config

Further on, the local search methods in both graphrag/query/cli.py and graphrag/query/api.py repeatedly call resolve_timestamp_path, which led to the ValueError mentioned earlier.

As a result, I made modifications to the code and submitted this PR.

I modified the local_search function’s parameters to pass in the data_path parameter, which was previously handled in _configure_paths_and_settings in cli.py, as resolved_base_dir. This is only used for the lancedb_dir variable.

Related Issues

I did not find any related issues.

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Thank you for taking the time to review this PR.

I appreciate your consideration of these changes.

* Improved local search logic by avoiding redundant execution of `resolve_timestamp_path` when `--data` argument is specified
* Prevented potential errors caused by repeated calls

---------
@ka2048 ka2048 requested review from a team as code owners August 29, 2024 08:05
dworthen added a commit that referenced this pull request Aug 30, 2024
- Provide a consistent way to load configuration
- Resolve potential timestamp directories upfront
    upon config object creation
- Add unit tests for resolving timestamp directories
- Resolves #599
- Resolves #1049
@dworthen dworthen mentioned this pull request Aug 30, 2024
4 tasks
KylinMountain pushed a commit to KylinMountain/graphrag-server that referenced this pull request Sep 12, 2024
* Consistent config load_config

- Provide a consistent way to load configuration
- Resolve potential timestamp directories upfront
    upon config object creation
- Add unit tests for resolving timestamp directories
- Resolves microsoft#599
- Resolves microsoft#1049

* fix formatting issues

* remove unnecessary path resolution

* fix smoke tests

* update prompts to use load_config

* Update none checks

* Update none checks

* Update searching for config method signature

* Update unit tests

* fix formatting issues
KylinMountain added a commit to KylinMountain/graphrag-server that referenced this pull request Sep 12, 2024
* Bump micromatch from 4.0.5 to 4.0.8 in /docsite (microsoft#1013)

Bumps [micromatch](https://github.com/micromatch/micromatch) from 4.0.5 to 4.0.8.
- [Release notes](https://github.com/micromatch/micromatch/releases)
- [Changelog](https://github.com/micromatch/micromatch/blob/4.0.8/CHANGELOG.md)
- [Commits](micromatch/micromatch@4.0.5...4.0.8)

---
updated-dependencies:
- dependency-name: micromatch
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump ruff from 0.5.7 to 0.6.2 (microsoft#1014)

Bumps [ruff](https://github.com/astral-sh/ruff) from 0.5.7 to 0.6.2.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.5.7...0.6.2)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* Ensure entity types to be str in prompt tune (microsoft#1015)

* Fix weight casting during graph extraction (microsoft#1016)

* Fix weight casting during graph extraction

* Format

* Format

* Update developer guide (microsoft#1029)

* Add missing config parameter for prompt tuning docs (microsoft#1017)

* Improve search type hint (microsoft#1031)

* update get_local_search_engine and get_global_search_engine return annotation

* add semversioner file

* reorder imports

* fix pyright errors

* revert change and ignore previous pyright error

---------

Co-authored-by: wanhua.gu <wanhua.gu@wiz.ai>
Co-authored-by: longyunfeigu <2514553187@qq.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* Patch "past" dependency issues (microsoft#1033)

* Patch "past" dependency issues

* Semver

* Release v0.3.2 (microsoft#1034)

* Update VectorStoreSearchResult score value range (microsoft#937)

update VectorStoreSearchResult score comment

Co-authored-by: wanhua.gu <wanhua.gu@wiz.ai>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* Add source URL to the package (microsoft#927)

Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* Fix/text unit code cleanup (microsoft#1040)

* Optimized _build_text_unit_context function for improved time and space complexity

Refactored the _build_text_unit_context function to enhance its performance and efficiency. Key optimizations include:

1. Set for Text Unit IDs: Replaced list-based membership checks with a set (text_unit_ids_set) to achieve constant-time complexity for membership checks, reducing overall time complexity.
2. Direct Attribute Removal: Utilized pop with a default value (None) to directly remove attributes entity_order and num_relationships from text units, minimizing overhead and avoiding potential KeyError.
3. Default Dictionary for Entity Orders: Implemented defaultdict for managing entity orders, simplifying the ranking process and improving readability.

These improvements result in a more efficient function with better performance, especially when handling large datasets or numerous selected entities. The refactoring ensures that the core functionality remains unchanged while enhancing both time and space complexity.

* Format

* Ruff fixes

* semver

---------

Co-authored-by: arjun-234 <arjun.darji@yudiz.com>
Co-authored-by: Arjun D. <103405661+arjun-234@users.noreply.github.com>

* Fix INIT_YAML embeddings default settings (microsoft#1039)

Co-authored-by: Thanh Long Phan <long.phan@dida.do>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* Bump pytest-asyncio from 0.23.8 to 0.24.0 (microsoft#1022)

Bumps [pytest-asyncio](https://github.com/pytest-dev/pytest-asyncio) from 0.23.8 to 0.24.0.
- [Release notes](https://github.com/pytest-dev/pytest-asyncio/releases)
- [Commits](pytest-dev/pytest-asyncio@v0.23.8...v0.24.0)

---
updated-dependencies:
- dependency-name: pytest-asyncio
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* Bump json-repair from 0.26.0 to 0.28.4 (microsoft#1044)

Bumps [json-repair](https://github.com/mangiucugna/json_repair) from 0.26.0 to 0.28.4.
- [Release notes](https://github.com/mangiucugna/json_repair/releases)
- [Commits](mangiucugna/json_repair@0.26.0...v0.28.4)

---
updated-dependencies:
- dependency-name: json-repair
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump lancedb from 0.11.0 to 0.12.0 (microsoft#1024)

Bumps [lancedb](https://github.com/lancedb/lancedb) from 0.11.0 to 0.12.0.
- [Release notes](https://github.com/lancedb/lancedb/releases)
- [Changelog](https://github.com/lancedb/lancedb/blob/main/release_process.md)
- [Commits](lancedb/lancedb@python-v0.11.0...python-v0.12.0)

---
updated-dependencies:
- dependency-name: lancedb
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump textual from 0.76.0 to 0.78.0 (microsoft#1038)

Bumps [textual](https://github.com/Textualize/textual) from 0.76.0 to 0.78.0.
- [Release notes](https://github.com/Textualize/textual/releases)
- [Changelog](https://github.com/Textualize/textual/blob/main/CHANGELOG.md)
- [Commits](Textualize/textual@v0.76.0...v0.78.0)

---
updated-dependencies:
- dependency-name: textual
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix/entity extraction strategy (microsoft#1046)

* fix strategy config in entity_extraction

* update init content

---------

Co-authored-by: KylinMountain <kose2livs@gmail.com>

* fix for issue 515 (microsoft#925)

* fix for issue 515

* semver impact document

---------

Co-authored-by: Kanishk Tyagi <kanishktyagi@Kanishks-MacBook-Pro.local>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* docs: update manual_prompt_tuning.md (microsoft#963)

paramater -> parameter

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* Update indexer_adapters.py (microsoft#895)

Update the lines 71 and 72
before:
entity_df["community"] = entity_df["community"].fillna(-1)
entity_df["community"] = entity_df["community"].astype(int)
after:
entity_df.loc[:, "community"] = entity_df["community"].fillna(-1)
entity_df.loc[:, "community"] = entity_df["community"].astype(int)

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* Fix circular dependency on prompt tune api (microsoft#1054)

* Bump notebook from 7.2.1 to 7.2.2 (microsoft#1055)

Bumps [notebook](https://github.com/jupyter/notebook) from 7.2.1 to 7.2.2.
- [Release notes](https://github.com/jupyter/notebook/releases)
- [Changelog](https://github.com/jupyter/notebook/blob/@jupyter-notebook/tree@7.2.2/CHANGELOG.md)
- [Commits](https://github.com/jupyter/notebook/compare/@jupyter-notebook/tree@7.2.1...@jupyter-notebook/tree@7.2.2)

---
updated-dependencies:
- dependency-name: notebook
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>

* Bump jupyterlab from 4.2.4 to 4.2.5 (microsoft#1056)

Bumps [jupyterlab](https://github.com/jupyterlab/jupyterlab) from 4.2.4 to 4.2.5.
- [Release notes](https://github.com/jupyterlab/jupyterlab/releases)
- [Changelog](https://github.com/jupyterlab/jupyterlab/blob/@jupyterlab/lsp@4.2.5/CHANGELOG.md)
- [Commits](https://github.com/jupyterlab/jupyterlab/compare/@jupyterlab/lsp@4.2.4...@jupyterlab/lsp@4.2.5)

---
updated-dependencies:
- dependency-name: jupyterlab
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update Prompt Tuning docs (microsoft#1057)

* Update Prompt Tuning docs

* Semver

* Update bash example in docs for prompt tune (microsoft#1059)

* Semver

* Update bash command

* Fix img for autotune (microsoft#1060)

* Fix img for autotune

* Add line breaks to tune docs

* More line breaks

* Fix img width (microsoft#1061)

* Consistent config load_config (microsoft#1065)

* Consistent config load_config

- Provide a consistent way to load configuration
- Resolve potential timestamp directories upfront
    upon config object creation
- Add unit tests for resolving timestamp directories
- Resolves microsoft#599
- Resolves microsoft#1049

* fix formatting issues

* remove unnecessary path resolution

* fix smoke tests

* update prompts to use load_config

* Update none checks

* Update none checks

* Update searching for config method signature

* Update unit tests

* fix formatting issues

* fix setting base_dir to full paths when not using file system. (microsoft#1096)

* fix setting base_dir to full paths when not using file system.

* add general resolve_path

* Clean and organize run index code (microsoft#1090)

* Create entypoint for cli and api (microsoft#1067)

* Add cli and api entrypoints for update index

* Semver

* Update docs

* Run tests on feature branch main

* Better /main handling in tests

* Clean and organize run index code

* Ruff fix

* Pyright fix

* Format fixes

* Pyright fix

* Format

* Fix integ tests

* Fix ruff

* Reorganize and clean up

* Load query from blob (microsoft#1095)

* Moved query loading from file to helper function

* added loading parquets from blob to function

* resolved adlfs async error

* debugging cleanup and small fixes

* added connection string support

* semversioner and ruff fixes

* completed testing for merge with main

* more ruff changes

* fixed unbound vars warning

* rewrote function to use storage utils

* removed unused vars

---------

Co-authored-by: Kenny Zhang <zhangken@microsoft.com>

* Update create_pipeline_config.py (microsoft#1108)

* Update create_pipeline_config.py

Order switched to ensure that user settings at runtime take precedence.

* Updated semversioner.

* release v0.3.3 (microsoft#1116)

* Deep copy txt units on local search to avoid race conditions (microsoft#1118)

* Deep copy txt units on local search to avoid race conditions

* Format

* Fix summarization including empty descriptions (microsoft#1124)

* Fix summarization including empty descriptions

* Update

* Release v0.3.4 (microsoft#1125)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: wanhua.gu <wanhua.gu@wiz.ai>
Co-authored-by: longyunfeigu <2514553187@qq.com>
Co-authored-by: Konstantin Gukov <gukkos@gmail.com>
Co-authored-by: arjun-234 <arjun.darji@yudiz.com>
Co-authored-by: Arjun D. <103405661+arjun-234@users.noreply.github.com>
Co-authored-by: TLP <104315397+TLongP@users.noreply.github.com>
Co-authored-by: Thanh Long Phan <long.phan@dida.do>
Co-authored-by: fantom845 <77169323+fantom845@users.noreply.github.com>
Co-authored-by: Kanishk Tyagi <kanishktyagi@Kanishks-MacBook-Pro.local>
Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
Co-authored-by: guangxiangdebizi <154864206+guangxiangdebizi@users.noreply.github.com>
Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
Co-authored-by: KennyZhang1 <90438893+KennyZhang1@users.noreply.github.com>
Co-authored-by: Kenny Zhang <zhangken@microsoft.com>
Co-authored-by: Doug Orbaker <107270698+dorbaker@users.noreply.github.com>
@ka2048 ka2048 deleted the fix-local-search-with-data branch September 12, 2024 09:14
@ka2048 ka2048 restored the fix-local-search-with-data branch September 12, 2024 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants