Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems with get_encodings_from_content #2206

Closed
jab opened this issue Sep 5, 2014 · 16 comments
Closed

problems with get_encodings_from_content #2206

jab opened this issue Sep 5, 2014 · 16 comments

Comments

@jab
Copy link

jab commented Sep 5, 2014

Here is the code for get_encodings_from_content in the requests.utils module:

def get_encodings_from_content(content):
    """Returns encodings from given content string.

    :param content: bytestring to extract encodings from.
    """

    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

    return (charset_re.findall(content) +
            pragma_re.findall(content) +
            xml_re.findall(content))

Is there any reason to recompile the regexes on every call rather than making them module-level constants?

More fundamentally, wouldn't it be more useful to have a function that just returned the first match, rather than a tuple of all matches, and therefore could often stop processing the input after a few lines, rather than always having to do three passes through the entire thing? That way, if you're fetching a large html document using stream=True from a server that may not have sent the charset in a Content-Type header, we can write:

encoding = get_encoding_from_headers(res.headers) or \
    get_encoding_from_content(raw)  # just get the first encoding declared in content, if any

which is all we care about.

(Assume raw has been set already with something like:

bio = BytesIO()
bytesread = 0
for chunk in res.iter_content(chunk_size=CHUNK_SIZE):
    bio.write(chunk)
    bytesread += len(chunk)
    if bytesread > RES_MAX_BYTES:
        return _too_big()
raw = bio.getvalue()

)

@jab
Copy link
Author

jab commented Sep 5, 2014

Oy, I just realized get_encoding_from_headers() returns a fallback value of 'ISO-8859-1' if no encoding was found in the headers, and there's no way to control this behavior:

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

Would you accept a PR to change this to something like:

def get_encoding_from_headers(headers, text_fallback='ISO-8859-1'):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if text_fallback and 'text' in content_type:
        return text_fallback

?

That way you could write:

encoding = get_encoding_from_headers(headers, text_fallback=None) or \
        get_encoding_from_content(raw)

when that's what you need.

@sigmavirus24
Copy link
Contributor

There is a good reason not to do it on the module scope. That function is so rarely used by 98% of requests users that triggering the compile when they do import requests will hurt their performance for no real benefit. That said, how would you determine which was the first encoding in the body? You could do something like:

m1 = charset_re.search(raw)
m2 = pragma_re.search(raw)
m3 = xml_re.search(raw)

first = sorted([m1, m2, m3], keyfunc=lambda m: m.start())[0]

return first.groups()

But that seems kind of arbitrary. On the one hand, returning all of them is also arbitrary but if we had written it the way you want, someone else would have wanted us to write it the way it is currently written.

To address your concern about get_encoding_from_headers that is now not conformant with RFC 7230 and we have plans to change that in #2086. Right now, we could accept that pull request but I'm not sure if it's desirable. I'll have to think about it and wait for @Lukasa would have to weigh in too.

@jab
Copy link
Author

jab commented Sep 5, 2014

Thanks for the thorough reply, @sigmavirus24.

There is a good reason not to do it on the module scope. That function is so rarely used by 98% of requests users that triggering the compile when they do import requests will hurt their performance for no real benefit.

A typical solution is to compile them lazily (on first use) then, no? That way when you don't need to use the function you don't have to pay for it, and when you do, you're not punished for it on every call.

That said, out of curiosity I measured the performance impact of adding

import re
CHARSET_RE = compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
PRAGMA_RE = compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
XML_RE = compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

to an otherwise empty module, and the difference was indistinguishable from noise. Did I measure wrong?

That said, how would you determine which was the first encoding in the body? You could do something like...

(I was even thinking something simpler, something like:

for p in (CHARSET_RE, PRAGMA_RE, XML_RE):  # compile these lazily if you like
    match = p.search(raw)
    if match:
        return match.group()

)

But that seems kind of arbitrary. On the one hand, returning all of them is also arbitrary but if we had written it the way you want, someone else would have wanted us to write it the way it is currently written.

Hm. Having a function that returned the first match rather than all matches for all three patterns makes the use case of getting the charset declared in the head of a large html document much more efficient. That seems like a primary use case for this function, in light of get_unicode_from_response's docstring (though see #2205). So it's not arbitrary. At least, an additional get_first_encoding_from_content function could be offered to facilitate this use case.

I did a github code search for get_encodings_from_content to see how people are using this and so far most are using it like this:

                    if 'charset' not in content_type:
                        # requests only looks at headers to detect the encoding, we must find the charset ourselves
                        # we can't use req.content because regex doesn't work on bytestrings apparently
                        encodings = get_encodings_from_content(req.text)
                        if encodings:
                            req.encoding = encodings[0]

I'm not sure if you've gotten any requests for it to be the way it's currently written, but the way I'm proposing seems more efficient for the use cases I'm finding.

Thanks for the pointer to #2086. I'll keep an eye on that and also look forward to others' feedback on this.

@Lukasa
Copy link
Member

Lukasa commented Sep 5, 2014

Our position on get_encodings_from_content is very clear: we don't expect to use it automatically. Requests is a HTTP library: we should not be reaching into body content without the user's say-so. We provide get_encodings_from_content as a helper function, but I would be strongly opposed to ever using it automatically.

Note that your proposed get_first_encoding_from_content does not check which one occurs first, it prioritises them. That's a very different thing to do.

@Lukasa
Copy link
Member

Lukasa commented Sep 5, 2014

As for #2086, our position is unclear: I haven't been able to have a chat with Kenneth yet.

@jab
Copy link
Author

jab commented Sep 5, 2014

Our position on get_encodings_from_content is very clear: we don't expect to use it automatically. Requests is a HTTP library: we should not be reaching into body content without the user's say-so. We provide get_encodings_from_content as a helper function, but I would be strongly opposed to ever using it automatically.

Sorry if I didn't explain clearly? Not proposing using it automatically. Proposing a version that returns just the first match. If you're talking about #2205, then that's only using it from another utility function which is documented to do so, so not in some hidden and inappropriate way.

Note that your proposed get_first_encoding_from_content does not check which one occurs first, it prioritises them. That's a very different thing to do.

That's what I intended, sorry if I didn't explain it as clearly as I should have.

@Lukasa
Copy link
Member

Lukasa commented Sep 5, 2014

Ok, sorry, I misunderstood the meaning of 'first'. I presumed that 'first' meant 'first to occur in the response', not first to occur in some other ordering. In that case I'm very much neutral on all of these changes. =)

@sigmavirus24
Copy link
Contributor

I don't have time to respond but wanted to respond to this:

I misunderstood the meaning of 'first'. I presumed that 'first' meant 'first to occur in the response', not first to occur in some other ordering.

This is how I understood "first" too.

@jab
Copy link
Author

jab commented Sep 5, 2014

Yeah, saw that from the code example you gave, sorry again for not being clearer. Trying the regexes in an order prioritized by how commonly they occur and just returning the first match seems principled / reasonable enough, and better supports both users' and request.utils.get_unicode_from_response's use cases.

@sigmavirus24
Copy link
Contributor

A typical solution is to compile them lazily (on first use) then, no? That way when you don't need to use the function you don't have to pay for it, and when you do, you're not punished for it on every call.

Yeah, I'm not adverse to a pattern like this:

XML_RE = None

def get_encodings_from_content(content):
    global XML_RE
    if XML_RE is None:
        XML_RE = re.compile('...')

That said, out of curiosity I measured the performance impact [...] Did I measure wrong?

You didn't show us how you measured it.

At least, an additional get_first_encoding_from_content function could be offered to facilitate this use case.

I'm not sure it's necessary though. I understand there are a lot of people using this just like you describe but I also found plenty of people using the entire list.

I'm not sure if you've gotten any requests for it to be the way it's currently written, but the way I'm proposing seems more efficient for the use cases I'm finding.

I wouldn't be surprised if it were. That said, I'd like to reiterate @Lukasa's assertion that requests is fundamentally a HTTP library. I have not so quietly been an advocate of not having this in requests.utils at all. Further a function as you describe is something I don't feel comfortable including. If we include the first of the three that match, then someone will want to reorder the list of regular expressions we use. That will only lead to pain for @Lukasa and myself. The whole soup-y mess that consistently results in the get_encodings_from_content function is enough to make me want to push to remove it yet again.

Few people can agree on how it should behave, and some of the decisions have been quite arbitrary.

@jab
Copy link
Author

jab commented Sep 6, 2014

Yeah, I'm not adverse to a pattern like this: [...]

LGTM

You didn't show us how you measured it.

python -m timeit 'import foo' where foo.py was blank vs. having the 3 compile calls.

I have not so quietly been an advocate of not having this in requests.utils at all. [...]
Few people can agree on how it should behave, and some of the decisions have been quite arbitrary.

I'm new to this code / not privy to the history, so just offering feedback based on what I find there now. Doesn't seem crazy to me to keep this in given it's just a few lines and seems like it could provide useful, related, and fairly oft-needed functionality for library users who'd otherwise end up having to duplicate it but maybe not get it quite as right as requests could, but I of course defer to maintainers' better wisdom. If requests.utils is going to keep get_unicode_from_response and friends, it seems reasonable to make this change, but if you're going to get rid of them due to maintenance burden exceeding benefit, totally understandable.

@sigmavirus24
Copy link
Contributor

I'm new to this code / not privy to the history, so just offering feedback based on what I find there now.

I understand that. I'm not upset with you. This code is a hold over from before 1.0 and should have been excised then in my opinion.

if you're going to get rid of them due to maintenance burden exceeding benefit

We likely won't, but every so often issues like this pop up and the most frustrating part is that we can't easily accommodate everyone. We would love to maximize our users' happiness but we have to reject some ideas. For example, we can't just change what get_encodings_from_content returns. We could add get_first_encoding_from_content but that implementation would cause some level of inevitable debate about which would be the best way. Introducing more points of contention is exactly the last thing I want for requests.

python -m timeit 'import foo' where foo.py was blank vs. having the 3 compile calls.

That's not how I would measure it. I would move the compilation step out of the function in requests.utils and then python -m timeit "import requests".

@jab
Copy link
Author

jab commented Sep 6, 2014

I would move the compilation step out of the function in requests.utils and then python -m timeit "import requests"

Done. Moving the re.compile calls to module scope had no effect on import speed.

> python -m timeit 'import requests'
1000000 loops, best of 3: 0.478 usec per loop
> vim path/to/requests/utils.py  # move compile calls to module scope
> python -m timeit 'import requests'
1000000 loops, best of 3: 0.478 usec per loop

I actually got literally the exact same time before and after.

Would be interested to hear if you get different results. If not, if you want to make that change, cool, otherwise, feel free to close this as WONTFIX.

@Lukasa
Copy link
Member

Lukasa commented Sep 6, 2014

My position here is basically a less negative version of @sigmavirus24's. I simple don't care about these functions. They used to be used by requests pre-1.0 and aren't now, but we left them in because some people found them and used them. They aren't documented and the maintainers don't advertise them, but ordinarily they don't represent a maintenance cost to us.

However, fundamentally, all of these functions are trivial utilities. My recommendation is that, rather than change behaviour or add yet another function I'll studiously ignore, people should just copy the function out of utils.py and change it to do what they need it to do. It should take all of 10 seconds. =)

@sigmavirus24
Copy link
Contributor

I think we've come to the conclusion that we will not be fixing this then. @Lukasa are you okay with my closing this?

@Lukasa
Copy link
Member

Lukasa commented Sep 11, 2014

Suits me. =)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants