Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'big5' encoding not detected #4299

Closed
samuelclay opened this issue Sep 22, 2017 · 4 comments
Closed

'big5' encoding not detected #4299

samuelclay opened this issue Sep 22, 2017 · 4 comments

Comments

@samuelclay
Copy link

See this feed: http://www.digitimes.com.tw/tech/rss/xml/xmlrss_10_0.xml

The top line of the feed is:

<?xml version="1.0" encoding="big5" ?>

But the detected encoding is iso-8859-1. It should be big5.

Expected Result

r.encoding should be big5.

Actual Result

r.encoding is iso-8859-1.

Reproduction Steps

>>> import requests
>>> r = requests.get('http://www.digitimes.com.tw/tech/rss/xml/xmlrss_10_0.xml')
>>> r.encoding
'ISO-8859-1'
>>> print r.text[900:1000]
xÆW¤Ó¶§¯à¹q¦À¤jÁp·ù¡A´Á¯ài¦æ¾ã¦X¡A¥Ø«e±À´ú¬ù¦³4®a¼t°Ó°Ñ»P¡A¥i¯à¥é·Ó¦­´ÁÁp¹q°Æ¸³¨Æªø«Å©ú´¼¥l¶°²Õ¦¨¥x
>>> r.encoding = 'big5'
>>> print r.text[900:1000]
均三緘其口即將跨入10月PV Taiwan年度展期近期台太陽能市場熱鬧非凡除了對模組聯盟最終拍板......]]></description>
<keyword><![CDATA[太陽能矽晶圓

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  }, 
  "cryptography": {
    "version": "1.6"
  }, 
  "idna": {
    "version": "2.6"
  }, 
  "implementation": {
    "name": "CPython", 
    "version": "2.7.10"
  }, 
  "platform": {
    "release": "16.7.0", 
    "system": "Darwin"
  }, 
  "pyOpenSSL": {
    "openssl_version": "100020af", 
    "version": "16.2.0"
  }, 
  "requests": {
    "version": "2.18.4"
  }, 
  "system_ssl": {
    "version": "9081df"
  }, 
  "urllib3": {
    "version": "1.22"
  }, 
  "using_pyopenssl": true
}
@sethmlarson
Copy link
Member

We rely on chardet to determine the encoding of the content. Whatever chardet library says is what we auto-detect the content as.

@samuelclay
Copy link
Author

Something strange is going on then...

>>> import chardet
>>> chardet.__version__
'3.0.4'
>>> import urllib
>>> r = urllib.urlopen("http://www.digitimes.com.tw/tech/rss/xml/xmlrss_10_0.xml").read()
>>> chardet.detect(r)
{'confidence': 0.99, 'language': 'Chinese', 'encoding': 'Big5'}
>>> requests.get('http://www.digitimes.com.tw/tech/rss/xml/xmlrss_10_0.xml').encoding
'ISO-8859-1'

@Terr
Copy link

Terr commented Sep 22, 2017

I had something similar recently, so I decided to take a look.

The encoding that is selected for this response seems to be coming from get_encoding_from_headers() at https://github.com/requests/requests/blob/master/requests/utils.py#L428. Because the digitimes.com.tw's server returns a Content-Type with 'text' in its value (text/xml), ISO-8859-1 is assumed.

The detected character set by chardet (apparent_encoding) is only used when no encoding has been set (before parsing the response content body?)

One solution would be to remove the assumption in get_encoding_from_headers(), but I don't know what consequences that would have.

@Lukasa
Copy link
Member

Lukasa commented Sep 22, 2017

This is covered by #2086. However, I should note that there is no way that Requests will ever look into the HTML: only that it will use chardet.

@Lukasa Lukasa closed this as completed Sep 22, 2017
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants