-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problems with get_encodings_from_content #2206
Comments
Oy, I just realized def get_encoding_from_headers(headers):
"""Returns encodings from given HTTP Header Dict.
:param headers: dictionary to extract encoding from.
"""
content_type = headers.get('content-type')
if not content_type:
return None
content_type, params = cgi.parse_header(content_type)
if 'charset' in params:
return params['charset'].strip("'\"")
if 'text' in content_type:
return 'ISO-8859-1' Would you accept a PR to change this to something like: def get_encoding_from_headers(headers, text_fallback='ISO-8859-1'):
"""Returns encodings from given HTTP Header Dict.
:param headers: dictionary to extract encoding from.
"""
content_type = headers.get('content-type')
if not content_type:
return None
content_type, params = cgi.parse_header(content_type)
if 'charset' in params:
return params['charset'].strip("'\"")
if text_fallback and 'text' in content_type:
return text_fallback ? That way you could write: encoding = get_encoding_from_headers(headers, text_fallback=None) or \
get_encoding_from_content(raw) when that's what you need. |
There is a good reason not to do it on the module scope. That function is so rarely used by 98% of requests users that triggering the compile when they do m1 = charset_re.search(raw)
m2 = pragma_re.search(raw)
m3 = xml_re.search(raw)
first = sorted([m1, m2, m3], keyfunc=lambda m: m.start())[0]
return first.groups() But that seems kind of arbitrary. On the one hand, returning all of them is also arbitrary but if we had written it the way you want, someone else would have wanted us to write it the way it is currently written. To address your concern about |
Thanks for the thorough reply, @sigmavirus24.
A typical solution is to compile them lazily (on first use) then, no? That way when you don't need to use the function you don't have to pay for it, and when you do, you're not punished for it on every call. That said, out of curiosity I measured the performance impact of adding import re
CHARSET_RE = compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
PRAGMA_RE = compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
XML_RE = compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]') to an otherwise empty module, and the difference was indistinguishable from noise. Did I measure wrong?
(I was even thinking something simpler, something like: for p in (CHARSET_RE, PRAGMA_RE, XML_RE): # compile these lazily if you like
match = p.search(raw)
if match:
return match.group() )
Hm. Having a function that returned the first match rather than all matches for all three patterns makes the use case of getting the charset declared in the head of a large html document much more efficient. That seems like a primary use case for this function, in light of I did a github code search for if 'charset' not in content_type:
# requests only looks at headers to detect the encoding, we must find the charset ourselves
# we can't use req.content because regex doesn't work on bytestrings apparently
encodings = get_encodings_from_content(req.text)
if encodings:
req.encoding = encodings[0] I'm not sure if you've gotten any requests for it to be the way it's currently written, but the way I'm proposing seems more efficient for the use cases I'm finding. Thanks for the pointer to #2086. I'll keep an eye on that and also look forward to others' feedback on this. |
Our position on Note that your proposed |
As for #2086, our position is unclear: I haven't been able to have a chat with Kenneth yet. |
Sorry if I didn't explain clearly? Not proposing using it automatically. Proposing a version that returns just the first match. If you're talking about #2205, then that's only using it from another utility function which is documented to do so, so not in some hidden and inappropriate way.
That's what I intended, sorry if I didn't explain it as clearly as I should have. |
Ok, sorry, I misunderstood the meaning of 'first'. I presumed that 'first' meant 'first to occur in the response', not first to occur in some other ordering. In that case I'm very much neutral on all of these changes. =) |
I don't have time to respond but wanted to respond to this:
This is how I understood "first" too. |
Yeah, saw that from the code example you gave, sorry again for not being clearer. Trying the regexes in an order prioritized by how commonly they occur and just returning the first match seems principled / reasonable enough, and better supports both users' and request.utils.get_unicode_from_response's use cases. |
Yeah, I'm not adverse to a pattern like this: XML_RE = None
def get_encodings_from_content(content):
global XML_RE
if XML_RE is None:
XML_RE = re.compile('...')
You didn't show us how you measured it.
I'm not sure it's necessary though. I understand there are a lot of people using this just like you describe but I also found plenty of people using the entire list.
I wouldn't be surprised if it were. That said, I'd like to reiterate @Lukasa's assertion that requests is fundamentally a HTTP library. I have not so quietly been an advocate of not having this in Few people can agree on how it should behave, and some of the decisions have been quite arbitrary. |
LGTM
I'm new to this code / not privy to the history, so just offering feedback based on what I find there now. Doesn't seem crazy to me to keep this in given it's just a few lines and seems like it could provide useful, related, and fairly oft-needed functionality for library users who'd otherwise end up having to duplicate it but maybe not get it quite as right as requests could, but I of course defer to maintainers' better wisdom. If requests.utils is going to keep |
I understand that. I'm not upset with you. This code is a hold over from before 1.0 and should have been excised then in my opinion.
We likely won't, but every so often issues like this pop up and the most frustrating part is that we can't easily accommodate everyone. We would love to maximize our users' happiness but we have to reject some ideas. For example, we can't just change what
That's not how I would measure it. I would move the compilation step out of the function in |
Done. Moving the re.compile calls to module scope had no effect on import speed.
I actually got literally the exact same time before and after. Would be interested to hear if you get different results. If not, if you want to make that change, cool, otherwise, feel free to close this as WONTFIX. |
My position here is basically a less negative version of @sigmavirus24's. I simple don't care about these functions. They used to be used by requests pre-1.0 and aren't now, but we left them in because some people found them and used them. They aren't documented and the maintainers don't advertise them, but ordinarily they don't represent a maintenance cost to us. However, fundamentally, all of these functions are trivial utilities. My recommendation is that, rather than change behaviour or add yet another function I'll studiously ignore, people should just copy the function out of |
I think we've come to the conclusion that we will not be fixing this then. @Lukasa are you okay with my closing this? |
Suits me. =) |
Here is the code for
get_encodings_from_content
in the requests.utils module:Is there any reason to recompile the regexes on every call rather than making them module-level constants?
More fundamentally, wouldn't it be more useful to have a function that just returned the first match, rather than a tuple of all matches, and therefore could often stop processing the input after a few lines, rather than always having to do three passes through the entire thing? That way, if you're fetching a large html document using
stream=True
from a server that may not have sent the charset in a Content-Type header, we can write:which is all we care about.
(Assume
raw
has been set already with something like:)
The text was updated successfully, but these errors were encountered: