-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catalog/v2
Kiwix Server end-point does not specify anymore the charset in the HTTP headers
#941
Comments
A little background to shed additional light on this issue, and which I find puzzling. IIAB reads the Kiwix catalog in python using requests.get and then parses it using Beautiful Soup in order to produce an equivalent json file for use in generating lists of available zims. There are only two lines that are different between the code to read root.xml and the code to read v2/entries?count=-1
https://github.com/iiab/iiab-admin-console/blob/master/roles/cmdsrv/templates/scripts/get_kiwix_catalog_rootxml.py Does requests honor the charset directive? |
Hum, it's an XML document and the encoding is properly declary in the XML Declaration Node (very first line)
@tim-moody beautifulsoup is supposed to parse and use the declared encoding. Can you share a short piece of code that shows it not working? |
@rgaudin as I said above the only difference between the code that works and the code that doesn't is the url that is read. root.xml works and v2/entries doesn't. (The links to the code are above.) But let's keep it simple:
The first prints
The second
|
Well pardon me for being focused on our end of the issue and not yours 😅 Now if you're looking at requests only, it would be just fair that encoding isn't picked up since it's in the document and requests here is being used for the transport only. As I wrote before, beautifulsoup being the xml parser (with lxml), it states in its doc that it reads the XMLDecl and adjusts for it. That said, I see those strings properly decoded here. And the response's Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>>
>>> kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
>>> kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
>>>
>>> test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'
>>>
>>> r = requests.get(kiwix_root_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])
r = requests.get(kiwix_v2_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
<title>Matemáticas por Wikipedia</title>
<updated>
>>>
>>> r = requests.get(kiwix_v2_catalog)
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
<title>Matemáticas por Wikipedia</title>
<updated>
>>> r.apparent_encoding
'utf-8'
>>> |
FWIW that didn't work for me on Ubuntu 23.04. Even after I upgraded its requests from 2.8.1 (Jun 29, 2022) to 2.8.2 (Jan 12, 2023):
Anything else I should try above? |
Ah ok so maybe requests can't pick it and defaulted to system encoding. Bs4 should though. I'll look into your script on Monday |
...
@rgaudin is there anything special about your OS / environment / conditions above? (So far others cannot reproduce your output above — despite trying on several different "stock" freshly installed Linux OS's like Ubuntu, Debian and Raspberry Pi OS — any suggestions or tips?) |
so we can work around this by explicitly decoding:
Still, library.xml, library_zims.xml, and root.xml can all be read by a browser, and this is usefull for troubleshooting. |
Yes ; seems quite obvious that using I'd suggest you pass @tim-moody @kelson42 can we close this ticket? |
@rgaudin Would that not be better (than current situation) if HTTP header gives charset as well? |
I think not but for lousy reasons:
All in all, I think it's a better practice. In the example above, previous code was correct because the HTTP header was sent so it used it but the new one was not because it was not reading the encoding anywhere. What if there's a discrepency between HTTP and XML? One should use XML so by not setting the charset in HTTP header, we're enforcing the better practice 😉 I can live with both situations though. Your call (or libkiwix developers') |
It was a trivial fix but I used the opportunity for some minor refactoring |
Are we still talking about https://library.kiwix.org/catalog/v2/entries?count=-1? that when viewed in chrome has:
|
@tim-moody It's been merged but not deployed. Nightlies are deployed everyday at https://dev.library.kiwix.org/ but library.kiwix.org only changes after kiwix-tools releases |
Following remark of @tim-moody, I have run the following test:
and then
The new end-point does not specify anymore the
charset=utf-8
in thecontent-type
header. Any good reason or we have just forgotten?The text was updated successfully, but these errors were encountered: