-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Suggestion] Simplify charset handling #1737
Comments
You've got lots of good suggestions here. I'll respond to some of them. In no particular order:
Stripping all the functionality from
|
@Lukasa you were tricked into saying this:
This is clearly the agenda of this issue as you can tell by:
@itsadok clearly wants the Let me address one other thing that @Lukasa didn't before I add my opinion.
With that addressed, let me address one more theme of this issue: Because some negligible percentage of all issues have been filed about x, x should be (changed|removed). One thing to note is that all the issues with numbers lower than 1000 were filed before requests 1.0 which is when the API was finalized. If there were legitimate bugs in this attribute prior to that, I would be far from surprised. Also some of those issues are instead about the choice that chardet/charade made. Those are not bugs in requests or Finally, after the release of 1.0 we had a lot of issues about the change from The |
Sorry about the closing and reopening, that was a mis-click. @sigmavirus24 I'm sorry if it seems like I have an agenda. I'm honestly just trying to help. I read through the discussions in all of the issues I posted. They all really seem to revolve around the same basic confusion, with people expecting Response.text to be an all-encompassing solution where in reality it is not. Let me just clarify some misunderstandings in what I wrote.
I merely meant that it should be noted that
The
This is a valid point, and perhaps serves to explain the weird nature of .text. Perhaps all that is needed is a note in the documentation. |
OK, mea culpa. I missed the part where It seems like I hit a nerve, which was really not my intention. I'm going to close the issue, but there is one pun that I just have to make.
The thing is, it seems that in most cases where you'd even want to access the |
Just because something can be used somewhere does not mean it should be.
But we never use it. It is cruft and should be removed. It is the wrong way to do things.
You're assuming everything behaves the same way and that RFCs are followed by servers. They're not. Charade is occasionally used. We keep it because it is essentially part of the API. Response's couldn't have an
I'm going to re-open it. You made valid points as @Lukasa pointed out. It just needs to be clear that this is not any agreement about removing the property. |
Heh, @sigmavirus24 both have the same reactions. Neither of us thinks this issue should be closed: I reopened it just before he could! Neither of us is angry, or unhappy about having this feedback. We're both delighted. You just can't tell because of the limitations of textual communication! =D The reality is that text encoding is hard, everyone is doing the wrong thing from time to time, and we cannot possibly please anyone. For that reason, we do the best we can, and then we expose the You've identified good issues with |
To cut to the chase, here are my suggestions:
Long version
There seems to be a lot of confusion regarding how the
.text
property works. After getting into some trouble with it myself, I searched the issues list, and found a dozen or so issues, all boiling down to the same mismatch between users' expectations and the intent of the library designers.#147 - bytecode string returned when page has charset=UTF-8
#156 - get_unicode_from_response does not check charsets from meta tags
#592 - Internal encoding detection doesn't match external chardet call
#654 - requests.get() ignores charset=UTF-8 and BOM
#765 - Chardet sometimes fails and force the wrong encoding
#861 - parsing encoding utf-8 page doesn't as expected
#1087 - Encodings from content
#1150 - On some pages requests detect encoding incorrectly
#1546 - use a default encoding in Response's text property
#1589 - Make sure content is checked when setting encoding
#1604 - Response.text returns improperly decoded text
#1683 - models.text Behaviour (encoding choice)
(It must be tiring to have the same conversation over and over again. I hope I'm being helpful here and not just piling on).
The argument seems to be:
I accept both these arguments. However, the documentation seems a bit coy, saying that "Requests makes an educated guess about the encoding", implying chardet/charade. In practice, for any content with a "text" media subtype, charade will not be used unless the user explicitly sets the
response.encoding
to None before reading the.text
property.Additionally, while ISO-8859-1 can be used as a default, won't it make more sense to handle that default in
.text
and not in theget_encoding_from_headers
method? This way, theencoding
property will be None if indeed no encoding is specified, allowing the user to make the decision on how to proceed.If you're going to keep the
.text
property, I think it should do a simple decoding if the charset is specified in the headers, and throw an exception otherwise. This way is much less confusing than the state of affairs now. Additionally, the documentation should contain a warning not to use it for arbitrary web pages, and perhaps a code sample showing the proper way to do it.The text was updated successfully, but these errors were encountered: