'big5' encoding not detected #4299

samuelclay · 2017-09-22T00:20:45Z

See this feed: http://www.digitimes.com.tw/tech/rss/xml/xmlrss_10_0.xml

The top line of the feed is:

<?xml version="1.0" encoding="big5" ?>

But the detected encoding is iso-8859-1. It should be big5.

Expected Result

r.encoding should be big5.

Actual Result

r.encoding is iso-8859-1.

Reproduction Steps

>>> import requests
>>> r = requests.get('http://www.digitimes.com.tw/tech/rss/xml/xmlrss_10_0.xml')
>>> r.encoding
'ISO-8859-1'
>>> print r.text[900:1000]
xÆW¤Ó¶§¯à¹q¦À¤jÁp·ù¡A´Á¯à¶i¦æ¾ã¦X¡A¥Ø«e±À´ú¬ù¦³4®a¼t°Ó°Ñ»P¡A¥i¯à¥é·Ó¦´ÁÁp¹q°Æ¸³¨Æªø«Å©ú´¼¥l¶°²Õ¦¨¥x
>>> r.encoding = 'big5'
>>> print r.text[900:1000]
均三緘其口。即將跨入10月PV Taiwan年度展期，近期台太陽能市場熱鬧非凡，除了對模組聯盟最終拍板......]]></description>
<keyword><![CDATA[太陽能矽晶圓

System Information

$ python -m requests.help

{
  "chardet": {
    "version": "3.0.4"
  }, 
  "cryptography": {
    "version": "1.6"
  }, 
  "idna": {
    "version": "2.6"
  }, 
  "implementation": {
    "name": "CPython", 
    "version": "2.7.10"
  }, 
  "platform": {
    "release": "16.7.0", 
    "system": "Darwin"
  }, 
  "pyOpenSSL": {
    "openssl_version": "100020af", 
    "version": "16.2.0"
  }, 
  "requests": {
    "version": "2.18.4"
  }, 
  "system_ssl": {
    "version": "9081df"
  }, 
  "urllib3": {
    "version": "1.22"
  }, 
  "using_pyopenssl": true
}

The text was updated successfully, but these errors were encountered:

sethmlarson · 2017-09-22T01:53:54Z

We rely on chardet to determine the encoding of the content. Whatever chardet library says is what we auto-detect the content as.

samuelclay · 2017-09-22T03:18:28Z

Something strange is going on then...

>>> import chardet
>>> chardet.__version__
'3.0.4'
>>> import urllib
>>> r = urllib.urlopen("http://www.digitimes.com.tw/tech/rss/xml/xmlrss_10_0.xml").read()
>>> chardet.detect(r)
{'confidence': 0.99, 'language': 'Chinese', 'encoding': 'Big5'}
>>> requests.get('http://www.digitimes.com.tw/tech/rss/xml/xmlrss_10_0.xml').encoding
'ISO-8859-1'

Terr · 2017-09-22T11:30:21Z

I had something similar recently, so I decided to take a look.

The encoding that is selected for this response seems to be coming from get_encoding_from_headers() at https://github.com/requests/requests/blob/master/requests/utils.py#L428. Because the digitimes.com.tw's server returns a Content-Type with 'text' in its value (text/xml), ISO-8859-1 is assumed.

The detected character set by chardet (apparent_encoding) is only used when no encoding has been set (before parsing the response content body?)

One solution would be to remove the assumption in get_encoding_from_headers(), but I don't know what consequences that would have.

Lukasa · 2017-09-22T16:35:19Z

This is covered by #2086. However, I should note that there is no way that Requests will ever look into the HTML: only that it will use chardet.

Lukasa closed this as completed Sep 22, 2017

github-actions bot locked as resolved and limited conversation to collaborators Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'big5' encoding not detected #4299

'big5' encoding not detected #4299

samuelclay commented Sep 22, 2017

sethmlarson commented Sep 22, 2017

samuelclay commented Sep 22, 2017

Terr commented Sep 22, 2017 •

edited

Loading

Lukasa commented Sep 22, 2017

'big5' encoding not detected #4299

'big5' encoding not detected #4299

Comments

samuelclay commented Sep 22, 2017

Expected Result

Actual Result

Reproduction Steps

System Information

sethmlarson commented Sep 22, 2017

samuelclay commented Sep 22, 2017

Terr commented Sep 22, 2017 • edited Loading

Lukasa commented Sep 22, 2017

Terr commented Sep 22, 2017 •

edited

Loading