-
-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wikipedia: query parameters cause URL parse failure #2412
Comments
Note that this seems to happen for any query parameters, not just the key/val pair I noticed:
This does not seem to occur if there is no anchor, so this may be an interaction between query parameters and anchors
|
On the other hand: >>> import re
>>> p = re.compile(r'https?:\/\/([a-z]+(?:\.m)?\.wikipedia\.org)\/wiki\/((?!File\:)[^ #]+)#?([^ ]*)')
>>> m = p.match('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form')
>>> m.groups()
('en.wikipedia.org', 'Gauss%27s_law?useskin=vector', 'Differential_form') This is the regex used in the plugin, you can see that the second group is now If I remove the anchor part, here is the result: >>> m2 = p.match('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector')
>>> m2.groups()
('en.wikipedia.org', 'Gauss%27s_law?useskin=vector', '') |
One option is to take only the part of the page title before any The other option is to modify the regex such that the title portion matches |
To be honest, the best way is probably to stop doing regex for that and use the good old >>> from urllib.parse import urlparse
>>> urlparse('https://en.wikipedia.org/wiki/%3FOryzomys_pliocaenicus')
ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/%3FOryzomys_pliocaenicus', params='', query='', fragment='')
>>> urlparse('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector')
ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/Gauss%27s_law', params='', query='useskin=vector', fragment='')
>>> urlparse('https://en.wikipedia.org/wiki/Gauss%27s_law?useskin=vector#Differential_form')
ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/Gauss%27s_law', params='', query='useskin=vector', fragment='Differential_form') And then use a regex to work on the path only. This way, we would not have to deal with using a regex to detect the anchor, the query parameter, or what have you, and only work on the |
You know what? You're right. Our plugins (built-in and external) do not take enough advantage of Pattern for Wikipedia article links can be something simple, like just the first part that excludes |
I agree re: using However, I agree with the idea that if a plugin wants to deal with a URL, it would be much more appropriate to hand over a |
…2412) Co-authored-by: dgw <dgw@technobabbl.es> Co-authored-by: Florian Strzelecki <florian.strzelecki@gmail.com>
Description
Wikipedia URLs that contain
?query=params
can confuse Sopel, leading to aKeyError
:Reproduction steps
wikipedia
plugin activeExpected behavior
The same article-retrieval behavior as a Wikipedia URL without query parameters:
Relevant logs
No response
Notes
Exception details from the error log:
Sopel version
7.1.9 (also happens on
master
)Installation method
pip install
Python version
3.9.9
Operating system
Ubuntu 20.04
IRCd
No response
Relevant plugins
wikipedia
The text was updated successfully, but these errors were encountered: