-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Full cache #75
[RFC] Full cache #75
Conversation
hawkmoth/__init__.py
Outdated
if nmatches == 0: | ||
self.logger.warning('No matching docstring.', | ||
location=(self.env.docname, self.lineno)) | ||
elif nmatches > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah, this needs to be auto and ...
or it would trigger in the cautodoc
directive when issued with globs or multiple files.
Interesting! I'm afraid I might not have time to look into it until the weekend. |
No worries! |
doc/directives.rst
Outdated
of cached entries that match the :rst:dir:`c:autodoc:clang` options. | ||
Note that this includes any file outside the :data:`cautodoc_root` | ||
search path that has been previously cached by a directive which | ||
explicitly included it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept thinking about this. This last sentence is a lie in the current version... Only files in or nested cautodoc_root
will ever be considered in this case, even if others were cached in between.
nfiles = 0 | ||
|
||
filenames = glob.glob(os.path.join(self.env.config.cautodoc_root, '*.[ch]')) | ||
filenames += glob.glob(os.path.join(self.env.config.cautodoc_root, '**/*.[ch]')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, encountering a directive without filenames specified will scan all the *.[ch]
files in source tree. For every directive.
The obvious improvement would be to scan all the files only once at extension setup. However, I still have really mixed feelings about on the one hand restricting to *.[ch]
suffix and on the other hand scanning the entire source tree, and parsing all the files if there's just one directive without a filename specified.
Say, if you have a moderately big project, but only have a handful of files to document, you'd end up running clang on every file in the project across the entire project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, encountering a directive without filenames specified will scan all the
*.[ch]
files in source tree. For every directive.
No, we take the file names on every directive, which might be a bit wasteful, but it's just walking the file system here. Clang will only be called once per file (not considering the issue you raised with parallel builds).
That said, we can take the file list only once though for a small performance improvement indeed.
The obvious improvement would be to scan all the files only once at extension setup.
At setup we don't know if there are any directives that do not list a file name. Essentially, this means that we will always have worst case parsing even if the user never uses the auto discovery explicitly, which would be against your last point and which I'm also against.
This may be worked around if we instead have to pass all the filenames we're interested in through the configuration file instead of a directory. Or maybe a white / black list of source files.
And as for the globing pattern(s), we could default to *.[ch]
but make it user configurable.
However, I still have really mixed feelings about on the one hand restricting to
*.[ch]
suffix and on the other hand scanning the entire source tree, and parsing all the files if there's just one directive without a filename specified.Say, if you have a moderately big project, but only have a handful of files to document, you'd end up running clang on every file in the project across the entire project.
Hmm, I don't think we can have it both ways, but that's why the documentation should be very clear that omitting the file names will likely trade performance for convenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say, if you have a moderately big project, but only have a handful of files to document, you'd end up running clang on every file in the project across the entire project.
Hmm, I don't think we can have it both ways, but that's why the documentation should be very clear that omitting the file names will likely trade performance for convenience.
Well, I guess a possible compromise would be to stop parsing when we get a match. But if anyone is ever going to rely on auto discovery, they should be assured that there is no ambiguity in their docs. And returning the 1st hit might miss the fact that there are multiple symbols with the same name. And the likelihood of such duplicate symbol names only increase with project size.
def run(self): | ||
clang_args = self.__get_clang_args() | ||
cache = self.env.temp_data.setdefault('cautodoc_cache', {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my limited understanding of Sphinx parallel build (sphinx-build -j N
argument) is that self.env
is not shared during document reading i.e. each document has its own copy, and the data gets merged at a later stage.
See documentation for parallel_read_safe
under https://www.sphinx-doc.org/en/master/extdev/index.html#extension-metadata as well as env-merge-info
event documentation at https://www.sphinx-doc.org/en/master/extdev/appapi.html#event-env-merge-info
I can't be certain without actually digging into Sphinx source and/or testing, but I think this implementation is still a per-document cache, not a global cache. I'd be happy to be proven wrong, but this was my conclusion also when I opted for the simple cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also see the todo example extension at https://www.sphinx-doc.org/en/master/development/tutorials/todo.html
I think the only ways to benefit from global cache is to either
- Disable parallel build in the extension setup, so we can use the cache synchronously across the documents.
or - Switch to a three phase build where the first phase simply gathers the directives and the filenames to parse, the second phase is the
env-merge-info
merges them, and the third phase is thedoctree-resolved
event where stuff gets parsed and output generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I didn't test with parallel build indeed so I failed to notice this. I also assumed that part was fine because this is how you handled the single file cache, which apparently does not cache as much as I thought it did. Of course this needs to be looked into before going any further.
Thanks for the references. I'll see what can be done about this.
eb9a0a5
to
f47b8df
Compare
Sorry, I didn't mean to close this, it was a confusion on my part with the automatic mirror from GitLab... still learning how it all fits together. Feel free to keep the discussion going of course. I'll open a new PR once I sort out the parallel build issue at least. |
We need this method to return a bit more context to the caller so we know when to do automatic name resolution or not. Returning a list instead of yielding each file name also allows us to issue an additional warning for the CAutoDocDirective's implementation when no file is matched.
f47b8df
to
7adf147
Compare
Nothing to see here! I just pushed some of the smaller changes I had worked on ages ago so that they lived elsewhere other than my hard drive. I forgot github automatically updates PRs on code push... I haven't touched this in ages and don't remember what state it is in. Judging from CI, not a good one... Maybe we should just close it for the time being? |
It's not like anything gets lost if we close this one. |
I've been holding this in different forms for quite a long time. Most of the complication and rebase conflicts were actually on the test folder and honestly I wasn't happy with any of it anyway, so I'm skipping it for now. We can at least have a discussion about the actual caching behaviour, which amounts to very little code changes in fact.
Note that I have no idea how to cleanly do the unit testing for this (ideas are welcome), but I did test it in an ad hoc fashion and it seems to work well.
I can also say that the changes did not measurably impact performance on my machine for the existing test set. For what it's worth, it actually scored better by a few seconds (cumulative) over 100 runs of
make test
. Then again I wasn't expecting any difference either.