-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close #6525: linkcheck: Add linkcheck_ignore_redirects and linkcheck_warn_redirects #9234
Changes from all commits
05eb2ca
707319a
ce9e2e6
c97e9b1
4201a84
fe89a07
676834b
48c80b1
2887dd0
988a79d
5e5bca9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -272,8 +272,12 @@ def process_result(self, result: CheckResult) -> None: | |
except KeyError: | ||
text, color = ('with unknown code', purple) | ||
linkstat['text'] = text | ||
logger.info(color('redirect ') + result.uri + | ||
color(' - ' + text + ' to ' + result.message)) | ||
if self.config.linkcheck_allowed_redirects: | ||
logger.warning('redirect ' + result.uri + ' - ' + text + ' to ' + | ||
result.message, location=(filename, result.lineno)) | ||
else: | ||
logger.info(color('redirect ') + result.uri + | ||
color(' - ' + text + ' to ' + result.message)) | ||
self.write_entry('redirected ' + text, result.docname, filename, | ||
result.lineno, result.uri + ' to ' + result.message) | ||
else: | ||
|
@@ -496,13 +500,23 @@ def check_uri() -> Tuple[str, str, int]: | |
new_url = response.url | ||
if anchor: | ||
new_url += '#' + anchor | ||
# history contains any redirects, get last | ||
if response.history: | ||
|
||
if allowed_redirect(req_url, new_url): | ||
return 'working', '', 0 | ||
tk0miya marked this conversation as resolved.
Show resolved
Hide resolved
|
||
elif response.history: | ||
# history contains any redirects, get last | ||
code = response.history[-1].status_code | ||
return 'redirected', new_url, code | ||
else: | ||
return 'redirected', new_url, 0 | ||
|
||
def allowed_redirect(url: str, new_url: str) -> bool: | ||
for from_url, to_url in self.config.linkcheck_allowed_redirects.items(): | ||
if from_url.match(url) and to_url.match(new_url): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The specification from the issue also asks for a mean to reuse match groups from the IIUC, the issue asks something similar to: from_match = from_url.match(url)
if from_match:
to_url = from_match.expand(to_url)
if to_url.match(new_url):
return True I am unsure if that is feasible at all. From the
>>> import re
>>> to_url = r"^https://example.org/\w{2}/\1"
>>> match = re.match(r'https://example\.org/(.*)', 'https://example.org/the-real-page/')
>>> match.expand(to_url)
Traceback (most recent call last):
File "/usr/lib/python3.9/sre_parse.py", line 1039, in parse_template
this = chr(ESCAPES[this][1])
KeyError: '\\w'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/re.py", line 322, in _expand
template = sre_parse.parse_template(template, pattern)
File "/usr/lib/python3.9/sre_parse.py", line 1042, in parse_template
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \w at position 21 There’s always the option of documenting that the canonical URI cannot contain special sequences (e.g. Another idea to tighten the validation would be to use named groups in both the source URI and canonical URI, and compare the named groups. Something like: # conf.py
linkcheck_ignore_redirects = {
'https://sphinx-doc.org/en/(?P<page>.*)': 'https://sphinx-doc.org/en/master/(?P<page>.*)',
}
# linkcheck
from_match = from_url.match(url)
to_match = to_url.match(to)
if from_match.groupdict() == to_match.group_dict():
return True That may not be as robust as one might expect. For example, I can imagine somebody documenting their regexp with capturing groups. IDK if the extra validation is worth the effort? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I would be inclined to use There may also need to be a regex and non-regex version because the need to escape every There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For a page that does not exist, the server should return a 404 in the first place. I understand not all web servers behave that way, but let’s keep in mind we’re dealing with poorly configured web servers here. I wouldn’t advise implementing a brittle solution to work around bad behaving servers. I consider the
The comparison of groups seems more robust to me. One can adjust the groups to capture the same portion on both sides. It needs some work to experiment (e.g. try to implement the validation in the #6525 with that system). I’ll try it in the days to come.
I think it would mostly be confusing. There would be two ways of doing the same thing. Besides, there’s a strong probability the unexpected redirect URL will be different enough that the regexp won’t match, even if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As @nomis says, the idea here is that it is possible to accept redirect targets matching a certain pattern to be considered "working" and, if they don't match, then they should still show up as "redirected". I've had a play with the example that @francoisfreitag posted and think that we can make this work by escaping non-backreference escapes: >>> import re
>>> to_url = r"^https://example.org/\w{2}/\1"
>>> match = re.match(r'https://example\.org/(.*)', 'https://example.org/the-real-page/')
>>> match.expand(to_url)
Traceback (most recent call last):
File "/usr/lib/python3.9/sre_parse.py", line 1039, in parse_template
this = chr(ESCAPES[this][1])
KeyError: '\\w'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/re.py", line 322, in _expand
template = sre_parse.parse_template(template, pattern)
File "/usr/lib/python3.9/sre_parse.py", line 1042, in parse_template
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \w at position 21
>>> to_url2 = re.sub(r'(\\[^0-9g])', r'\\\1', to_url)
>>> to_url
'^https://example.org/\\w{2}/\\1'
>>> to_url2
'^https://example.org/\\\\w{2}/\\1'
>>> match.expand(to_url2)
'^https://example.org/\\w{2}/the-real-page/' So |
||
return True | ||
|
||
return False | ||
|
||
def check(docname: str) -> Tuple[str, str, int]: | ||
# check for various conditions without bothering the network | ||
if len(uri) == 0 or uri.startswith(('#', 'mailto:', 'tel:')): | ||
|
@@ -667,11 +681,25 @@ def rewrite_github_anchor(app: Sphinx, uri: str) -> Optional[str]: | |
return None | ||
|
||
|
||
def compile_linkcheck_allowed_redirects(app: Sphinx, config: Config) -> None: | ||
"""Compile patterns in linkcheck_allowed_redirects to the regexp objects.""" | ||
for url, pattern in list(app.config.linkcheck_allowed_redirects.items()): | ||
try: | ||
app.config.linkcheck_allowed_redirects[re.compile(url)] = re.compile(pattern) | ||
except re.error as exc: | ||
logger.warning(__('Failed to compile regex in linkcheck_allowed_redirects: %r %s'), | ||
exc.pattern, exc.msg) | ||
finally: | ||
# Remove the original regexp-string | ||
app.config.linkcheck_allowed_redirects.pop(url) | ||
|
||
|
||
def setup(app: Sphinx) -> Dict[str, Any]: | ||
app.add_builder(CheckExternalLinksBuilder) | ||
app.add_post_transform(HyperlinkCollector) | ||
|
||
app.add_config_value('linkcheck_ignore', [], None) | ||
app.add_config_value('linkcheck_allowed_redirects', {}, None) | ||
app.add_config_value('linkcheck_auth', [], None) | ||
app.add_config_value('linkcheck_request_headers', {}, None) | ||
app.add_config_value('linkcheck_retries', 1, None) | ||
|
@@ -684,6 +712,8 @@ def setup(app: Sphinx) -> Dict[str, Any]: | |
app.add_config_value('linkcheck_rate_limit_timeout', 300.0, None) | ||
|
||
app.add_event('linkcheck-process-uri') | ||
|
||
app.connect('config-inited', compile_linkcheck_allowed_redirects, priority=800) | ||
app.connect('linkcheck-process-uri', rewrite_github_anchor) | ||
|
||
return { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
exclude_patterns = ['_build'] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
`local server1 <http://localhost:7777/path1>`_ | ||
`local server2 <http://localhost:7777/path2>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no regex capture then this isn't checking what it implies it's checking. To allow for a more complete example without comparing the source and destination paths, I've added an allowance for redirecting to
www.
but this could be removed too.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example should use patterns for both source and canonical URIs. So I think the current one is enough to describe the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is not just allowing redirects to the canonical version of a URL, it's allowing any redirects.
You may as well use
r'https://sphinx-doc\.org/.*': r'https://sphinx-doc\.org/.*'
because there isn't really any validation of the destination.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? The current rule verifies the pages under
https://sphinx-doc.org/
to the URL underhttps://sphinx-doc.org/en/master/
. It does not mean "any redirects".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's allowing practically any redirect, so not what I would consider a good example.
It's not far off allowing
https://www\.example\.com/.*
tohttps://www\.example\.com/.*
which then causes linkcheck to ignore removed pages that just redirect to the home page.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
https://sphinx-doc.org/foo/bar
redirects tohttps://sphinx-doc.org/other/path
, it's not "allowed". Is this allowing any redirect? I'm confused.