Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog/v2 Kiwix Server end-point does not specify anymore the charset in the HTTP headers #941

Closed
kelson42 opened this issue Apr 22, 2023 · 14 comments · Fixed by #942
Closed

Comments

@kelson42
Copy link
Collaborator

Following remark of @tim-moody, I have run the following test:

$ curl -I https://library.kiwix.org/catalog/root.xml
HTTP/2 200 
date: Sat, 22 Apr 2023 06:57:53 GMT
content-type: application/atom+xml; profile=opds-catalog; kind=acquisition; charset=utf-8
content-length: 5002801
access-control-allow-origin: *
cache-control: max-age=0, must-revalidate
vary: Accept-Encoding
x-varnish: 770016197 769329506
age: 458
via: 1.1 varnish (Varnish/7.1)
etag: W/"1670254420720624352.12572515/z"
accept-ranges: bytes
strict-transport-security: max-age=15724800; includeSubDomains

and then

$ curl -I https://library.kiwix.org/catalog/v2/entries?count=10
HTTP/2 200 
date: Sat, 22 Apr 2023 06:57:36 GMT
content-type: application/atom+xml;profile=opds-catalog;kind=acquisition
content-length: 13227
access-control-allow-origin: *
cache-control: max-age=0, must-revalidate
vary: Accept-Encoding
x-varnish: 770016157 769361332
age: 15
via: 1.1 varnish (Varnish/7.1)
etag: W/"1670254420720624352.12572515/z"
accept-ranges: bytes
strict-transport-security: max-age=15724800; includeSubDomains

The new end-point does not specify anymore the charset=utf-8 in the content-type header. Any good reason or we have just forgotten?

@tim-moody
Copy link

A little background to shed additional light on this issue, and which I find puzzling.

IIAB reads the Kiwix catalog in python using requests.get and then parses it using Beautiful Soup in order to produce an equivalent json file for use in generating lists of available zims.

There are only two lines that are different between the code to read root.xml and the code to read v2/entries?count=-1

# kiwix_catalog_url = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
kiwix_catalog_url = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint

https://github.com/iiab/iiab-admin-console/blob/master/roles/cmdsrv/templates/scripts/get_kiwix_catalog_rootxml.py
https://github.com/iiab/iiab-admin-console/blob/master/roles/cmdsrv/templates/scripts/get_kiwix_catalog_opds_v2.py

Does requests honor the charset directive?

@rgaudin
Copy link
Member

rgaudin commented Apr 22, 2023

Hum, it's an XML document and the encoding is properly declary in the XML Declaration Node (very first line)

❯ curl https://library.kiwix.org/catalog/v2/entries?count=1
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:opds="https://specs.opds.io/opds-1.2"
      xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
  <id>a072cd94-2127-291e-d54a-b8fc5f9dee62</id>

  <link rel="self"
        href="/catalog/v2/entries?count=1"
        type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
  <link rel="start"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <link rel="up"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>

  <title>Filtered Entries (count=1)</title>
  <updated>2023-04-22T14:38:18Z</updated>
  <totalResults>4116</totalResults>
  <startIndex>0</startIndex>
  <itemsPerPage>1</itemsPerPage>
  <entry>
    <id>urn:uuid:00099ae9-499a-15b4-2e48-27fe7b56e28f</id>
    <title>Reverse Engineering</title>
    <updated>2022-11-09T00:00:00Z</updated>
    <summary>Q&amp;A for researchers and developers who explore the principles of a system through analysis of its structure, function, and operation</summary>
    <language>eng</language>
    <name>reverseengineering.stackexchange.com_en_all</name>
    <flavour></flavour>
    <category>stack_exchange</category>
    <tags>_category:stack_exchange;stack_exchange;_ftindex:no;_pictures:yes;_videos:yes;_details:yes</tags>
    <articleCount>16304</articleCount>
    <mediaCount>8262</mediaCount>
    <link rel="http://opds-spec.org/image/thumbnail"
          href="/catalog/v2/illustration/00099ae9-499a-15b4-2e48-27fe7b56e28f/?size=48"
          type="image/png;width=48;height=48;scale=1"/>
    <link type="text/html" href="/content/reverseengineering.stackexchange.com_en_all_2022-11" />
    <author>
      <name>Stack Exchange</name>
    </author>
    <publisher>
      <name>Kiwix</name>
    </publisher>
    <dc:issued>2022-11-09T00:00:00Z</dc:issued>
    <link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://download.kiwix.org/zim/stack_exchange/reverseengineering.stackexchange.com_en_all_2022-11.zim.meta4" length="107168768" />
  </entry>
</feed>

@tim-moody beautifulsoup is supposed to parse and use the declared encoding. Can you share a short piece of code that shows it not working?

@tim-moody
Copy link

Can you share a short piece of code that shows it not working?

@rgaudin as I said above the only difference between the code that works and the code that doesn't is the url that is read. root.xml works and v2/entries doesn't. (The links to the code are above.)

But let's keep it simple:

import requests

kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint

test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'

r = requests.get(kiwix_root_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])

r = requests.get(kiwix_v2_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])

The first prints

print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea
<title>Matemáticas por Wikipedia</title>

The second

print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea
<title>Matemáticas por Wikipedia</title>
<updated

@rgaudin
Copy link
Member

rgaudin commented Apr 22, 2023

@rgaudin as I said above the only difference between the code that works and the code that doesn't is the url that is read. root.xml works and v2/entries doesn't. (The links to the code are above.)

Well pardon me for being focused on our end of the issue and not yours 😅
It's clear from your initial message that the charset is not conveyed via HTTP. That doesn't make it a problem since we are sending the proper Content-Type as per the spec and the charset is specified in the document.

Now if you're looking at requests only, it would be just fair that encoding isn't picked up since it's in the document and requests here is being used for the transport only.

As I wrote before, beautifulsoup being the xml parser (with lxml), it states in its doc that it reads the XMLDecl and adjusts for it.

That said, I see those strings properly decoded here. And the response's apparent_encoding is correct. Can you try updating requests?

Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>>
>>> kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
>>> kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
>>>
>>> test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'
>>>
>>> r = requests.get(kiwix_root_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])

r = requests.get(kiwix_v2_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>
>>>
>>> r = requests.get(kiwix_v2_catalog)
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>
>>> r.apparent_encoding
'utf-8'
>>>

@holta
Copy link

holta commented Apr 22, 2023

Can you try updating requests?

FWIW that didn't work for me on Ubuntu 23.04.

Even after I upgraded its requests from 2.8.1 (Jun 29, 2022) to 2.8.2 (Jan 12, 2023):

Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> print(requests.__version__)
2.28.2
>>> r = requests.get('https://library.kiwix.org/catalog/v2/entries?count=-1')
>>> offset = r.text.find('637723c8-a736-6b1c-10d6-1b63d79ed5ea')
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated
>>> r.apparent_encoding
'Windows-1254'

Anything else I should try above?

@rgaudin
Copy link
Member

rgaudin commented Apr 22, 2023

Ah ok so maybe requests can't pick it and defaulted to system encoding. Bs4 should though. I'll look into your script on Monday

@holta
Copy link

holta commented Apr 23, 2023

Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>>
>>> kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
>>> kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
>>>
>>> test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'

...

>>> r = requests.get(kiwix_v2_catalog)
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>
>>> r.apparent_encoding
'utf-8'

@rgaudin is there anything special about your OS / environment / conditions above?

(So far others cannot reproduce your output above — despite trying on several different "stock" freshly installed Linux OS's like Ubuntu, Debian and Raspberry Pi OS — any suggestions or tips?)

@tim-moody
Copy link

so we can work around this by explicitly decoding:

utf8_xlm = r.content.decode("utf-8")
offset = utf8_xlm.find(test_entry)
print(utf8_xlm[offset:offset+100])

Still, library.xml, library_zims.xml, and root.xml can all be read by a browser, and this is usefull for troubleshooting.

@rgaudin
Copy link
Member

rgaudin commented Apr 24, 2023

Yes ; seems quite obvious that using r.text was using the requests-decoded string (and thus System one) ; leading to the issue.

https://github.com/iiab/iiab-admin-console/blob/6138f5217a4ce4f19c6d130a8ebfc65c2116ae1d/roles/cmdsrv/templates/scripts/get_kiwix_catalog_opds_v2.py#L94

I'd suggest you pass r.content directly to BS instead of decoding to UTF-8. There's no chance we'll switch to another encoding but by passing bytes, BS will use the declared encoding.

@tim-moody @kelson42 can we close this ticket?

@kelson42
Copy link
Collaborator Author

@rgaudin Would that not be better (than current situation) if HTTP header gives charset as well?

@rgaudin
Copy link
Member

rgaudin commented Apr 24, 2023

@rgaudin Would that not be better (than current situation) if HTTP header gives charset as well?

I think not but for lousy reasons:

  • By respecting the spec's Content-Types we're preventing other tools that would expect this exact CTs from breaking if they didn't account for an extra charset specification. HTTP allows us to specify the charset but it's not well known.
  • This would mean maintaining the same information twice. Can surely be refactored so we only specify it once once use it both in the XML Declaration and the various endpoints with the different CTs though.

All in all, I think it's a better practice. In the example above, previous code was correct because the HTTP header was sent so it used it but the new one was not because it was not reading the encoding anywhere. What if there's a discrepency between HTTP and XML? One should use XML so by not setting the charset in HTTP header, we're enforcing the better practice 😉

I can live with both situations though. Your call (or libkiwix developers')

@veloman-yunkan
Copy link
Collaborator

It was a trivial fix but I used the opportunity for some minor refactoring

@tim-moody
Copy link

Are we still talking about https://library.kiwix.org/catalog/v2/entries?count=-1? that when viewed in chrome has:

<entry>
    <id>urn:uuid:225744bb-9488-085c-cfa8-19bf352c9e44</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>2023-04-23T00:00:00Z</updated>
    <summary>Una selección de artículos de Wikipedia sobre matemáticas</summary>
    <language>spa</language>
    <name>wikipedia_es_mathematics</name>
    <flavour>maxi</flavour>
    <category>wikipedia</category>
    <tags>wikipedia;_category:wikipedia;_pictures:yes;_videos:no;_details:yes;_ftindex:yes</tags>
    <articleCount>6474</articleCount>
    <mediaCount>82062</mediaCount>
    <link rel="http://opds-spec.org/image/thumbnail"
          href="/catalog/v2/illustration/225744bb-9488-085c-cfa8-19bf352c9e44/?size=48"
          type="image/png;width=48;height=48;scale=1"/>
    <link type="text/html" href="/content/wikipedia_es_mathematics_maxi_2023-04" />
    <author>
      <name>Wikipedia</name>
    </author>
    <publisher>
      <name>Kiwix</name>
    </publisher>
    <dc:issued>2023-04-23T00:00:00Z</dc:issued>
    <link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://download.kiwix.org/zim/wikipedia/wikipedia_es_mathematics_maxi_2023-04.zim.meta4" length="693760000" />
  </entry>

@rgaudin
Copy link
Member

rgaudin commented Apr 26, 2023

@tim-moody It's been merged but not deployed. Nightlies are deployed everyday at https://dev.library.kiwix.org/ but library.kiwix.org only changes after kiwix-tools releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants