Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Full cache #75

Closed
wants to merge 4 commits into from
Closed

Conversation

BrunoMSantos
Copy link
Collaborator

I've been holding this in different forms for quite a long time. Most of the complication and rebase conflicts were actually on the test folder and honestly I wasn't happy with any of it anyway, so I'm skipping it for now. We can at least have a discussion about the actual caching behaviour, which amounts to very little code changes in fact.

Note that I have no idea how to cleanly do the unit testing for this (ideas are welcome), but I did test it in an ad hoc fashion and it seems to work well.

I can also say that the changes did not measurably impact performance on my machine for the existing test set. For what it's worth, it actually scored better by a few seconds (cumulative) over 100 runs of make test. Then again I wasn't expecting any difference either.

@BrunoMSantos BrunoMSantos requested a review from jnikula November 23, 2021 01:36
if nmatches == 0:
self.logger.warning('No matching docstring.',
location=(self.env.docname, self.lineno))
elif nmatches > 1:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, this needs to be auto and ... or it would trigger in the cautodoc directive when issued with globs or multiple files.

@jnikula
Copy link
Owner

jnikula commented Nov 23, 2021

Interesting! I'm afraid I might not have time to look into it until the weekend.

@BrunoMSantos
Copy link
Collaborator Author

Interesting! I'm afraid I might not have time to look into it until the weekend.

No worries!

of cached entries that match the :rst:dir:`c:autodoc:clang` options.
Note that this includes any file outside the :data:`cautodoc_root`
search path that has been previously cached by a directive which
explicitly included it.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept thinking about this. This last sentence is a lie in the current version... Only files in or nested cautodoc_root will ever be considered in this case, even if others were cached in between.

nfiles = 0

filenames = glob.glob(os.path.join(self.env.config.cautodoc_root, '*.[ch]'))
filenames += glob.glob(os.path.join(self.env.config.cautodoc_root, '**/*.[ch]'))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, encountering a directive without filenames specified will scan all the *.[ch] files in source tree. For every directive.

The obvious improvement would be to scan all the files only once at extension setup. However, I still have really mixed feelings about on the one hand restricting to *.[ch] suffix and on the other hand scanning the entire source tree, and parsing all the files if there's just one directive without a filename specified.

Say, if you have a moderately big project, but only have a handful of files to document, you'd end up running clang on every file in the project across the entire project.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, encountering a directive without filenames specified will scan all the *.[ch] files in source tree. For every directive.

No, we take the file names on every directive, which might be a bit wasteful, but it's just walking the file system here. Clang will only be called once per file (not considering the issue you raised with parallel builds).

That said, we can take the file list only once though for a small performance improvement indeed.

The obvious improvement would be to scan all the files only once at extension setup.

At setup we don't know if there are any directives that do not list a file name. Essentially, this means that we will always have worst case parsing even if the user never uses the auto discovery explicitly, which would be against your last point and which I'm also against.

This may be worked around if we instead have to pass all the filenames we're interested in through the configuration file instead of a directory. Or maybe a white / black list of source files.

And as for the globing pattern(s), we could default to *.[ch] but make it user configurable.

However, I still have really mixed feelings about on the one hand restricting to *.[ch] suffix and on the other hand scanning the entire source tree, and parsing all the files if there's just one directive without a filename specified.

Say, if you have a moderately big project, but only have a handful of files to document, you'd end up running clang on every file in the project across the entire project.

Hmm, I don't think we can have it both ways, but that's why the documentation should be very clear that omitting the file names will likely trade performance for convenience.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say, if you have a moderately big project, but only have a handful of files to document, you'd end up running clang on every file in the project across the entire project.

Hmm, I don't think we can have it both ways, but that's why the documentation should be very clear that omitting the file names will likely trade performance for convenience.

Well, I guess a possible compromise would be to stop parsing when we get a match. But if anyone is ever going to rely on auto discovery, they should be assured that there is no ambiguity in their docs. And returning the 1st hit might miss the fact that there are multiple symbols with the same name. And the likelihood of such duplicate symbol names only increase with project size.

def run(self):
clang_args = self.__get_clang_args()
cache = self.env.temp_data.setdefault('cautodoc_cache', {})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my limited understanding of Sphinx parallel build (sphinx-build -j N argument) is that self.env is not shared during document reading i.e. each document has its own copy, and the data gets merged at a later stage.

See documentation for parallel_read_safe under https://www.sphinx-doc.org/en/master/extdev/index.html#extension-metadata as well as env-merge-info event documentation at https://www.sphinx-doc.org/en/master/extdev/appapi.html#event-env-merge-info

I can't be certain without actually digging into Sphinx source and/or testing, but I think this implementation is still a per-document cache, not a global cache. I'd be happy to be proven wrong, but this was my conclusion also when I opted for the simple cache.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see the todo example extension at https://www.sphinx-doc.org/en/master/development/tutorials/todo.html
I think the only ways to benefit from global cache is to either

  1. Disable parallel build in the extension setup, so we can use the cache synchronously across the documents.
    or
  2. Switch to a three phase build where the first phase simply gathers the directives and the filenames to parse, the second phase is the env-merge-info merges them, and the third phase is the doctree-resolved event where stuff gets parsed and output generated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I didn't test with parallel build indeed so I failed to notice this. I also assumed that part was fine because this is how you handled the single file cache, which apparently does not cache as much as I thought it did. Of course this needs to be looked into before going any further.

Thanks for the references. I'll see what can be done about this.

@BrunoMSantos BrunoMSantos deleted the full-cache branch December 11, 2021 17:43
@BrunoMSantos BrunoMSantos restored the full-cache branch December 11, 2021 17:43
@BrunoMSantos
Copy link
Collaborator Author

Sorry, I didn't mean to close this, it was a confusion on my part with the automatic mirror from GitLab... still learning how it all fits together. Feel free to keep the discussion going of course.

I'll open a new PR once I sort out the parallel build issue at least.

@BrunoMSantos BrunoMSantos reopened this Dec 11, 2021
We need this method to return a bit more context to the caller so we
know when to do automatic name resolution or not. Returning a list
instead of yielding each file name also allows us to issue an additional
warning for the CAutoDocDirective's implementation when no file is
matched.
@BrunoMSantos
Copy link
Collaborator Author

BrunoMSantos commented Oct 23, 2022

Nothing to see here! I just pushed some of the smaller changes I had worked on ages ago so that they lived elsewhere other than my hard drive. I forgot github automatically updates PRs on code push... I haven't touched this in ages and don't remember what state it is in. Judging from CI, not a good one...

Maybe we should just close it for the time being?

@jnikula
Copy link
Owner

jnikula commented Oct 27, 2022

It's not like anything gets lost if we close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants