-
-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: chunk initial symlinks on re-ingest #882
Conversation
WalkthroughThe changes in the pull request modify the Changes
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (6)
src/program/program.py (6)
400-402
: Consider memory optimization for large librariesConverting the entire generator to a list using
items_list = [item for item in items]
could consume significant memory for large libraries. Consider processing the generator in chunks directly.-items_list = [item for item in items if isinstance(item, (Movie, Show))] -total_items = len(items_list) +items_list = [] +for item in items: + if isinstance(item, (Movie, Show)): + items_list.append(item) +total_items = len(items_list)
407-415
: Consider making chunk size configurableThe chunk size is hardcoded to 100 items. Consider making it configurable via environment variable or settings, similar to how workers are configured, to allow tuning based on system resources and library size.
-chunk_size = 100 +chunk_size = int(os.getenv("SYMLINK_CHUNK_SIZE", 100))
426-428
: Improve duplicate symlink error messageThe error message for duplicate symlinks could be more informative by including both the duplicate paths.
-errors.append(f"Duplicate symlink directory found for {item.log_string}") +errors.append(f"Duplicate symlink directory found for {item.log_string} with IMDB ID: {item.imdb_id}")
434-438
: Add retry mechanism for failed enhancementsConsider implementing a retry mechanism for items that fail enhancement, as temporary network issues could cause failures.
+from tenacity import retry, stop_after_attempt, wait_exponential +@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10)) def _enhance_item(self, item: MediaItem) -> MediaItem | None: try: enhanced_item = next(self.services[TraktIndexer].run(item, log_msg=False)) return enhanced_item except StopIteration: return None
Line range hint
454-465
: Enhance error reporting with statisticsConsider adding error statistics to provide a better overview of the initialization process.
if errors: logger.error("Errors encountered during initialization") + error_stats = { + "duplicate_symlinks": len([e for e in errors if "Duplicate symlink" in e]), + "duplicate_db": len([e for e in errors if "Duplicate item found in database" in e]), + "enhancement_failures": len([e for e in errors if "Failed to enhance" in e]), + "invalid_directories": len([e for e in errors if "not a valid directory" in e]) + } + logger.error(f"Error statistics: {error_stats}") for error in errors: logger.error(error)
Line range hint
400-465
: Add tests for chunked processing implementationAs mentioned in the PR description, tests are missing for these changes. Consider adding the following test cases:
- Test chunked processing with various chunk sizes
- Test duplicate handling
- Test error scenarios
- Test progress reporting
Would you like me to help create these test cases?
src/program/program.py
Outdated
# Commit after each chunk | ||
session.commit() | ||
|
||
progress.update(task, log="Finished Indexing Symlinks!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding transaction rollback on errors
While committing after each chunk is good, consider adding explicit transaction rollback on errors to ensure data consistency.
-# Commit after each chunk
-session.commit()
+try:
+ # Commit after each chunk
+ session.commit()
+except Exception as e:
+ session.rollback()
+ logger.error(f"Failed to commit chunk: {e}")
+ errors.append(f"Failed to commit chunk: {e}")
Committable suggestion skipped: line range outside the PR's diff.
I cant test the rollback part as I dont have a way to force it to break 🤔 But I tested this on my 30k item library and it finished within 1 minute. 🥳 |
Pull Request Check List
Resolves: #issue-number-here
Description:
Summary by CodeRabbit
New Features
Bug Fixes