Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix recursion error during file attachment parsing of deep nested paths #173

Merged
merged 15 commits into from
May 9, 2023

Conversation

ionite34
Copy link
Member

@ionite34 ionite34 commented Mar 12, 2023

Fixes: #172

  • Now catches the potential RecursionError from glob / pathlib. On such event we return an exit code None, and stdout:
FileParsingError: Exceeded directory depth limit while parsing attachments

@coveralls
Copy link

coveralls commented Mar 13, 2023

Coverage Status

Coverage: 90.845% (+5.6%) from 85.255% when pulling 90910bd on file-scan-recursion-fix into 9acc6f5 on main.

@ionite34 ionite34 marked this pull request as ready for review March 13, 2023 18:23
},
timeout=self.files_timeout,
)
with time_limit(self.files_timeout):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that asserts this time outs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test parsing symlink files for timeout, but that doesn't determine if the timeout occured due to the time.monotic() comparison timeout or by the signal. The signal timeout would be if a very large single file was parsed, or pathlib ends up in some loop.

But also added 2 new tests for time_limit successfully interrupting time.sleep and a long-running built-in C function (math.factorial) in 1af5bf0

Comment on lines +29 to +34
signal.alarm(timeout)

try:
yield
finally:
signal.alarm(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever. I don't know enough about signals to say whether this is an appropriate use, but I think it's low risk to try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work pretty accurately for interrupting time.sleep and built-in C functions in these tests 1af5bf0, so it seems okay for our usecase, and avoids the potential broken pipes issue with multiprocessing.

start_time = time.monotonic()
added = 0
files = glob.iglob(pattern, root_dir=str(self.output), recursive=True, include_hidden=False)
for file in (Path(self.output, f) for f in files):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elegant use of a generator expression here!


def files_list(
self,
limit: int,
pattern: str,
exclude_files: dict[Path, float] | None = None,
preload_dict: bool = False,
timeout: float | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using float("inf") (aka math.inf) instead of None to specify no timeout? Maybe a bit odd at first glance, but I think it makes sense, and it does avoid dealing with None checks.

Copy link
Member Author

@ionite34 ionite34 Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the None is still useful on performance grounds since we currently bypass the time.monotonic call and comparison when timeout is None.

@ionite34 ionite34 requested a review from MarkKoz March 18, 2023 18:56
@@ -289,7 +289,7 @@ def python3(
args, None, "TimeoutError: Exceeded time limit while parsing attachments"
)
except Exception as e:
log.error(f"Unexpected {type(e).__name__} while parse attachments: {e}")
log.exception(f"Unexpected {type(e).__name__} while parse attachments", exc_info=e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exc_info is redundant when log.exception is used within an except block.

@ChrisLovering ChrisLovering merged commit 9804a10 into main May 9, 2023
@ChrisLovering ChrisLovering deleted the file-scan-recursion-fix branch May 9, 2023 21:08
@MarkKoz MarkKoz mentioned this pull request May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BOT-3EN: snekbox hits a recusion exceded exception in pathlib when parsing lots of nested directories and 500s
4 participants