problems with get_encodings_from_content #2206

jab · 2014-09-05T20:23:59Z

Here is the code for get_encodings_from_content in the requests.utils module:

def get_encodings_from_content(content):
    """Returns encodings from given content string.

    :param content: bytestring to extract encodings from.
    """

    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

    return (charset_re.findall(content) +
            pragma_re.findall(content) +
            xml_re.findall(content))

Is there any reason to recompile the regexes on every call rather than making them module-level constants?

More fundamentally, wouldn't it be more useful to have a function that just returned the first match, rather than a tuple of all matches, and therefore could often stop processing the input after a few lines, rather than always having to do three passes through the entire thing? That way, if you're fetching a large html document using stream=True from a server that may not have sent the charset in a Content-Type header, we can write:

encoding = get_encoding_from_headers(res.headers) or \
    get_encoding_from_content(raw)  # just get the first encoding declared in content, if any

which is all we care about.

(Assume raw has been set already with something like:

bio = BytesIO()
bytesread = 0
for chunk in res.iter_content(chunk_size=CHUNK_SIZE):
    bio.write(chunk)
    bytesread += len(chunk)
    if bytesread > RES_MAX_BYTES:
        return _too_big()
raw = bio.getvalue()

)

The text was updated successfully, but these errors were encountered:

jab · 2014-09-05T20:31:38Z

Oy, I just realized get_encoding_from_headers() returns a fallback value of 'ISO-8859-1' if no encoding was found in the headers, and there's no way to control this behavior:

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

Would you accept a PR to change this to something like:

def get_encoding_from_headers(headers, text_fallback='ISO-8859-1'):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if text_fallback and 'text' in content_type:
        return text_fallback

?

That way you could write:

encoding = get_encoding_from_headers(headers, text_fallback=None) or \
        get_encoding_from_content(raw)

when that's what you need.

sigmavirus24 · 2014-09-05T20:36:43Z

There is a good reason not to do it on the module scope. That function is so rarely used by 98% of requests users that triggering the compile when they do import requests will hurt their performance for no real benefit. That said, how would you determine which was the first encoding in the body? You could do something like:

m1 = charset_re.search(raw)
m2 = pragma_re.search(raw)
m3 = xml_re.search(raw)

first = sorted([m1, m2, m3], keyfunc=lambda m: m.start())[0]

return first.groups()

But that seems kind of arbitrary. On the one hand, returning all of them is also arbitrary but if we had written it the way you want, someone else would have wanted us to write it the way it is currently written.

To address your concern about get_encoding_from_headers that is now not conformant with RFC 7230 and we have plans to change that in #2086. Right now, we could accept that pull request but I'm not sure if it's desirable. I'll have to think about it and wait for @Lukasa would have to weigh in too.

jab · 2014-09-05T21:23:43Z

Thanks for the thorough reply, @sigmavirus24.

There is a good reason not to do it on the module scope. That function is so rarely used by 98% of requests users that triggering the compile when they do import requests will hurt their performance for no real benefit.

A typical solution is to compile them lazily (on first use) then, no? That way when you don't need to use the function you don't have to pay for it, and when you do, you're not punished for it on every call.

That said, out of curiosity I measured the performance impact of adding

import re
CHARSET_RE = compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
PRAGMA_RE = compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
XML_RE = compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

to an otherwise empty module, and the difference was indistinguishable from noise. Did I measure wrong?

That said, how would you determine which was the first encoding in the body? You could do something like...

(I was even thinking something simpler, something like:

for p in (CHARSET_RE, PRAGMA_RE, XML_RE):  # compile these lazily if you like
    match = p.search(raw)
    if match:
        return match.group()

)

But that seems kind of arbitrary. On the one hand, returning all of them is also arbitrary but if we had written it the way you want, someone else would have wanted us to write it the way it is currently written.

Hm. Having a function that returned the first match rather than all matches for all three patterns makes the use case of getting the charset declared in the head of a large html document much more efficient. That seems like a primary use case for this function, in light of get_unicode_from_response's docstring (though see #2205). So it's not arbitrary. At least, an additional get_first_encoding_from_content function could be offered to facilitate this use case.

I did a github code search for get_encodings_from_content to see how people are using this and so far most are using it like this:

                    if 'charset' not in content_type:
                        # requests only looks at headers to detect the encoding, we must find the charset ourselves
                        # we can't use req.content because regex doesn't work on bytestrings apparently
                        encodings = get_encodings_from_content(req.text)
                        if encodings:
                            req.encoding = encodings[0]

I'm not sure if you've gotten any requests for it to be the way it's currently written, but the way I'm proposing seems more efficient for the use cases I'm finding.

Thanks for the pointer to #2086. I'll keep an eye on that and also look forward to others' feedback on this.

Lukasa · 2014-09-05T21:28:34Z

Our position on get_encodings_from_content is very clear: we don't expect to use it automatically. Requests is a HTTP library: we should not be reaching into body content without the user's say-so. We provide get_encodings_from_content as a helper function, but I would be strongly opposed to ever using it automatically.

Note that your proposed get_first_encoding_from_content does not check which one occurs first, it prioritises them. That's a very different thing to do.

Lukasa · 2014-09-05T21:29:10Z

As for #2086, our position is unclear: I haven't been able to have a chat with Kenneth yet.

jab · 2014-09-05T21:33:11Z

Our position on get_encodings_from_content is very clear: we don't expect to use it automatically. Requests is a HTTP library: we should not be reaching into body content without the user's say-so. We provide get_encodings_from_content as a helper function, but I would be strongly opposed to ever using it automatically.

Sorry if I didn't explain clearly? Not proposing using it automatically. Proposing a version that returns just the first match. If you're talking about #2205, then that's only using it from another utility function which is documented to do so, so not in some hidden and inappropriate way.

Note that your proposed get_first_encoding_from_content does not check which one occurs first, it prioritises them. That's a very different thing to do.

That's what I intended, sorry if I didn't explain it as clearly as I should have.

Lukasa · 2014-09-05T21:34:47Z

Ok, sorry, I misunderstood the meaning of 'first'. I presumed that 'first' meant 'first to occur in the response', not first to occur in some other ordering. In that case I'm very much neutral on all of these changes. =)

sigmavirus24 · 2014-09-05T21:46:34Z

I don't have time to respond but wanted to respond to this:

I misunderstood the meaning of 'first'. I presumed that 'first' meant 'first to occur in the response', not first to occur in some other ordering.

This is how I understood "first" too.

jab · 2014-09-05T21:59:42Z

Yeah, saw that from the code example you gave, sorry again for not being clearer. Trying the regexes in an order prioritized by how commonly they occur and just returning the first match seems principled / reasonable enough, and better supports both users' and request.utils.get_unicode_from_response's use cases.

sigmavirus24 · 2014-09-05T23:42:21Z

A typical solution is to compile them lazily (on first use) then, no? That way when you don't need to use the function you don't have to pay for it, and when you do, you're not punished for it on every call.

Yeah, I'm not adverse to a pattern like this:

XML_RE = None

def get_encodings_from_content(content):
    global XML_RE
    if XML_RE is None:
        XML_RE = re.compile('...')

That said, out of curiosity I measured the performance impact [...] Did I measure wrong?

You didn't show us how you measured it.

At least, an additional get_first_encoding_from_content function could be offered to facilitate this use case.

I'm not sure it's necessary though. I understand there are a lot of people using this just like you describe but I also found plenty of people using the entire list.

I'm not sure if you've gotten any requests for it to be the way it's currently written, but the way I'm proposing seems more efficient for the use cases I'm finding.

I wouldn't be surprised if it were. That said, I'd like to reiterate @Lukasa's assertion that requests is fundamentally a HTTP library. I have not so quietly been an advocate of not having this in requests.utils at all. Further a function as you describe is something I don't feel comfortable including. If we include the first of the three that match, then someone will want to reorder the list of regular expressions we use. That will only lead to pain for @Lukasa and myself. The whole soup-y mess that consistently results in the get_encodings_from_content function is enough to make me want to push to remove it yet again.

Few people can agree on how it should behave, and some of the decisions have been quite arbitrary.

jab · 2014-09-06T02:52:53Z

Yeah, I'm not adverse to a pattern like this: [...]

LGTM

You didn't show us how you measured it.

python -m timeit 'import foo' where foo.py was blank vs. having the 3 compile calls.

I have not so quietly been an advocate of not having this in requests.utils at all. [...]
Few people can agree on how it should behave, and some of the decisions have been quite arbitrary.

I'm new to this code / not privy to the history, so just offering feedback based on what I find there now. Doesn't seem crazy to me to keep this in given it's just a few lines and seems like it could provide useful, related, and fairly oft-needed functionality for library users who'd otherwise end up having to duplicate it but maybe not get it quite as right as requests could, but I of course defer to maintainers' better wisdom. If requests.utils is going to keep get_unicode_from_response and friends, it seems reasonable to make this change, but if you're going to get rid of them due to maintenance burden exceeding benefit, totally understandable.

sigmavirus24 · 2014-09-06T03:38:21Z

I'm new to this code / not privy to the history, so just offering feedback based on what I find there now.

I understand that. I'm not upset with you. This code is a hold over from before 1.0 and should have been excised then in my opinion.

if you're going to get rid of them due to maintenance burden exceeding benefit

We likely won't, but every so often issues like this pop up and the most frustrating part is that we can't easily accommodate everyone. We would love to maximize our users' happiness but we have to reject some ideas. For example, we can't just change what get_encodings_from_content returns. We could add get_first_encoding_from_content but that implementation would cause some level of inevitable debate about which would be the best way. Introducing more points of contention is exactly the last thing I want for requests.

python -m timeit 'import foo' where foo.py was blank vs. having the 3 compile calls.

That's not how I would measure it. I would move the compilation step out of the function in requests.utils and then python -m timeit "import requests".

jab · 2014-09-06T05:45:09Z

I would move the compilation step out of the function in requests.utils and then python -m timeit "import requests"

Done. Moving the re.compile calls to module scope had no effect on import speed.

> python -m timeit 'import requests'
1000000 loops, best of 3: 0.478 usec per loop
> vim path/to/requests/utils.py  # move compile calls to module scope
> python -m timeit 'import requests'
1000000 loops, best of 3: 0.478 usec per loop

I actually got literally the exact same time before and after.

Would be interested to hear if you get different results. If not, if you want to make that change, cool, otherwise, feel free to close this as WONTFIX.

Lukasa · 2014-09-06T08:52:50Z

My position here is basically a less negative version of @sigmavirus24's. I simple don't care about these functions. They used to be used by requests pre-1.0 and aren't now, but we left them in because some people found them and used them. They aren't documented and the maintainers don't advertise them, but ordinarily they don't represent a maintenance cost to us.

However, fundamentally, all of these functions are trivial utilities. My recommendation is that, rather than change behaviour or add yet another function I'll studiously ignore, people should just copy the function out of utils.py and change it to do what they need it to do. It should take all of 10 seconds. =)

sigmavirus24 · 2014-09-11T16:57:28Z

I think we've come to the conclusion that we will not be fixing this then. @Lukasa are you okay with my closing this?

Lukasa · 2014-09-11T16:59:20Z

Suits me. =)

sigmavirus24 closed this as completed Sep 11, 2014

github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems with get_encodings_from_content #2206

problems with get_encodings_from_content #2206

jab commented Sep 5, 2014

jab commented Sep 5, 2014

sigmavirus24 commented Sep 5, 2014

jab commented Sep 5, 2014

Lukasa commented Sep 5, 2014

Lukasa commented Sep 5, 2014

jab commented Sep 5, 2014

Lukasa commented Sep 5, 2014

sigmavirus24 commented Sep 5, 2014

jab commented Sep 5, 2014

sigmavirus24 commented Sep 5, 2014

jab commented Sep 6, 2014

sigmavirus24 commented Sep 6, 2014

jab commented Sep 6, 2014

Lukasa commented Sep 6, 2014

sigmavirus24 commented Sep 11, 2014

Lukasa commented Sep 11, 2014

problems with get_encodings_from_content #2206

problems with get_encodings_from_content #2206

Comments

jab commented Sep 5, 2014

jab commented Sep 5, 2014

sigmavirus24 commented Sep 5, 2014

jab commented Sep 5, 2014

Lukasa commented Sep 5, 2014

Lukasa commented Sep 5, 2014

jab commented Sep 5, 2014

Lukasa commented Sep 5, 2014

sigmavirus24 commented Sep 5, 2014

jab commented Sep 5, 2014

sigmavirus24 commented Sep 5, 2014

jab commented Sep 6, 2014

sigmavirus24 commented Sep 6, 2014

jab commented Sep 6, 2014

Lukasa commented Sep 6, 2014

sigmavirus24 commented Sep 11, 2014

Lukasa commented Sep 11, 2014