`catalog/v2` Kiwix Server end-point does not specify anymore the charset in the HTTP headers #941

kelson42 · 2023-04-22T07:03:26Z

Following remark of @tim-moody, I have run the following test:

$ curl -I https://library.kiwix.org/catalog/root.xml
HTTP/2 200 
date: Sat, 22 Apr 2023 06:57:53 GMT
content-type: application/atom+xml; profile=opds-catalog; kind=acquisition; charset=utf-8
content-length: 5002801
access-control-allow-origin: *
cache-control: max-age=0, must-revalidate
vary: Accept-Encoding
x-varnish: 770016197 769329506
age: 458
via: 1.1 varnish (Varnish/7.1)
etag: W/"1670254420720624352.12572515/z"
accept-ranges: bytes
strict-transport-security: max-age=15724800; includeSubDomains

and then

$ curl -I https://library.kiwix.org/catalog/v2/entries?count=10
HTTP/2 200 
date: Sat, 22 Apr 2023 06:57:36 GMT
content-type: application/atom+xml;profile=opds-catalog;kind=acquisition
content-length: 13227
access-control-allow-origin: *
cache-control: max-age=0, must-revalidate
vary: Accept-Encoding
x-varnish: 770016157 769361332
age: 15
via: 1.1 varnish (Varnish/7.1)
etag: W/"1670254420720624352.12572515/z"
accept-ranges: bytes
strict-transport-security: max-age=15724800; includeSubDomains

The new end-point does not specify anymore the charset=utf-8 in the content-type header. Any good reason or we have just forgotten?

The text was updated successfully, but these errors were encountered:

tim-moody · 2023-04-22T13:19:47Z

A little background to shed additional light on this issue, and which I find puzzling.

IIAB reads the Kiwix catalog in python using requests.get and then parses it using Beautiful Soup in order to produce an equivalent json file for use in generating lists of available zims.

There are only two lines that are different between the code to read root.xml and the code to read v2/entries?count=-1

# kiwix_catalog_url = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
kiwix_catalog_url = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint

https://github.com/iiab/iiab-admin-console/blob/master/roles/cmdsrv/templates/scripts/get_kiwix_catalog_rootxml.py
https://github.com/iiab/iiab-admin-console/blob/master/roles/cmdsrv/templates/scripts/get_kiwix_catalog_opds_v2.py

Does requests honor the charset directive?

rgaudin · 2023-04-22T14:46:24Z

Hum, it's an XML document and the encoding is properly declary in the XML Declaration Node (very first line)

❯ curl https://library.kiwix.org/catalog/v2/entries?count=1
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:opds="https://specs.opds.io/opds-1.2"
      xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
  <id>a072cd94-2127-291e-d54a-b8fc5f9dee62</id>

  <link rel="self"
        href="/catalog/v2/entries?count=1"
        type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
  <link rel="start"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <link rel="up"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>

  <title>Filtered Entries (count=1)</title>
  <updated>2023-04-22T14:38:18Z</updated>
  <totalResults>4116</totalResults>
  <startIndex>0</startIndex>
  <itemsPerPage>1</itemsPerPage>
  <entry>
    <id>urn:uuid:00099ae9-499a-15b4-2e48-27fe7b56e28f</id>
    <title>Reverse Engineering</title>
    <updated>2022-11-09T00:00:00Z</updated>
    <summary>Q&amp;A for researchers and developers who explore the principles of a system through analysis of its structure, function, and operation</summary>
    <language>eng</language>
    <name>reverseengineering.stackexchange.com_en_all</name>
    <flavour></flavour>
    <category>stack_exchange</category>
    <tags>_category:stack_exchange;stack_exchange;_ftindex:no;_pictures:yes;_videos:yes;_details:yes</tags>
    <articleCount>16304</articleCount>
    <mediaCount>8262</mediaCount>
    <link rel="http://opds-spec.org/image/thumbnail"
          href="/catalog/v2/illustration/00099ae9-499a-15b4-2e48-27fe7b56e28f/?size=48"
          type="image/png;width=48;height=48;scale=1"/>
    <link type="text/html" href="/content/reverseengineering.stackexchange.com_en_all_2022-11" />
    <author>
      <name>Stack Exchange</name>
    </author>
    <publisher>
      <name>Kiwix</name>
    </publisher>
    <dc:issued>2022-11-09T00:00:00Z</dc:issued>
    <link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://download.kiwix.org/zim/stack_exchange/reverseengineering.stackexchange.com_en_all_2022-11.zim.meta4" length="107168768" />
  </entry>
</feed>

@tim-moody beautifulsoup is supposed to parse and use the declared encoding. Can you share a short piece of code that shows it not working?

tim-moody · 2023-04-22T16:16:10Z

Can you share a short piece of code that shows it not working?

@rgaudin as I said above the only difference between the code that works and the code that doesn't is the url that is read. root.xml works and v2/entries doesn't. (The links to the code are above.)

But let's keep it simple:

import requests

kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint

test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'

r = requests.get(kiwix_root_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])

r = requests.get(kiwix_v2_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])

The first prints

print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea
<title>Matemáticas por Wikipedia</title>

The second

print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea
<title>MatemÃ¡ticas por Wikipedia</title>
<updated

rgaudin · 2023-04-22T19:10:42Z

@rgaudin as I said above the only difference between the code that works and the code that doesn't is the url that is read. root.xml works and v2/entries doesn't. (The links to the code are above.)

Well pardon me for being focused on our end of the issue and not yours 😅
It's clear from your initial message that the charset is not conveyed via HTTP. That doesn't make it a problem since we are sending the proper Content-Type as per the spec and the charset is specified in the document.

Now if you're looking at requests only, it would be just fair that encoding isn't picked up since it's in the document and requests here is being used for the transport only.

As I wrote before, beautifulsoup being the xml parser (with lxml), it states in its doc that it reads the XMLDecl and adjusts for it.

That said, I see those strings properly decoded here. And the response's apparent_encoding is correct. Can you try updating requests?

Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>>
>>> kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
>>> kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
>>>
>>> test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'
>>>
>>> r = requests.get(kiwix_root_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])

r = requests.get(kiwix_v2_catalog)
offset = r.text.find(test_entry)
print(r.text[offset:offset+100])
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>
>>>
>>> r = requests.get(kiwix_v2_catalog)
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>
>>> r.apparent_encoding
'utf-8'
>>>

holta · 2023-04-22T19:54:51Z

Can you try updating requests?

FWIW that didn't work for me on Ubuntu 23.04.

Even after I upgraded its requests from 2.8.1 (Jun 29, 2022) to 2.8.2 (Jan 12, 2023):

Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> print(requests.__version__)
2.28.2
>>> r = requests.get('https://library.kiwix.org/catalog/v2/entries?count=-1')
>>> offset = r.text.find('637723c8-a736-6b1c-10d6-1b63d79ed5ea')
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>MatemÃ¡ticas por Wikipedia</title>
    <updated
>>> r.apparent_encoding
'Windows-1254'

Anything else I should try above?

rgaudin · 2023-04-22T19:57:09Z

Ah ok so maybe requests can't pick it and defaulted to system encoding. Bs4 should though. I'll look into your script on Monday

holta · 2023-04-23T03:32:11Z

Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>>
>>> kiwix_root_catalog = 'https://library.kiwix.org/catalog/root.xml' # OPDS catalog
>>> kiwix_v2_catalog = 'https://library.kiwix.org/catalog/v2/entries?count=-1' # 3/21/2023 new OPDS endpoint
>>>
>>> test_entry = '637723c8-a736-6b1c-10d6-1b63d79ed5ea'

...

>>> r = requests.get(kiwix_v2_catalog)
>>> offset = r.text.find(test_entry)
>>> print(r.text[offset:offset+100])
637723c8-a736-6b1c-10d6-1b63d79ed5ea</id>
    <title>Matemáticas por Wikipedia</title>
    <updated>
>>> r.apparent_encoding
'utf-8'

@rgaudin is there anything special about your OS / environment / conditions above?

(So far others cannot reproduce your output above — despite trying on several different "stock" freshly installed Linux OS's like Ubuntu, Debian and Raspberry Pi OS — any suggestions or tips?)

tim-moody · 2023-04-23T11:49:06Z

so we can work around this by explicitly decoding:

utf8_xlm = r.content.decode("utf-8")
offset = utf8_xlm.find(test_entry)
print(utf8_xlm[offset:offset+100])

Still, library.xml, library_zims.xml, and root.xml can all be read by a browser, and this is usefull for troubleshooting.

rgaudin · 2023-04-24T10:44:34Z

Yes ; seems quite obvious that using r.text was using the requests-decoded string (and thus System one) ; leading to the issue.

https://github.com/iiab/iiab-admin-console/blob/6138f5217a4ce4f19c6d130a8ebfc65c2116ae1d/roles/cmdsrv/templates/scripts/get_kiwix_catalog_opds_v2.py#L94

I'd suggest you pass r.content directly to BS instead of decoding to UTF-8. There's no chance we'll switch to another encoding but by passing bytes, BS will use the declared encoding.

@tim-moody @kelson42 can we close this ticket?

kelson42 · 2023-04-24T10:54:28Z

@rgaudin Would that not be better (than current situation) if HTTP header gives charset as well?

rgaudin · 2023-04-24T11:09:43Z

@rgaudin Would that not be better (than current situation) if HTTP header gives charset as well?

I think not but for lousy reasons:

By respecting the spec's Content-Types we're preventing other tools that would expect this exact CTs from breaking if they didn't account for an extra charset specification. HTTP allows us to specify the charset but it's not well known.
This would mean maintaining the same information twice. Can surely be refactored so we only specify it once once use it both in the XML Declaration and the various endpoints with the different CTs though.

All in all, I think it's a better practice. In the example above, previous code was correct because the HTTP header was sent so it used it but the new one was not because it was not reading the encoding anywhere. What if there's a discrepency between HTTP and XML? One should use XML so by not setting the charset in HTTP header, we're enforcing the better practice 😉

I can live with both situations though. Your call (or libkiwix developers')

veloman-yunkan · 2023-04-25T08:54:18Z

It was a trivial fix but I used the opportunity for some minor refactoring

tim-moody · 2023-04-26T15:55:38Z

Are we still talking about https://library.kiwix.org/catalog/v2/entries?count=-1? that when viewed in chrome has:

<entry>
    <id>urn:uuid:225744bb-9488-085c-cfa8-19bf352c9e44</id>
    <title>MatemÃ¡ticas por Wikipedia</title>
    <updated>2023-04-23T00:00:00Z</updated>
    <summary>Una selecciÃ³n de artÃculos de Wikipedia sobre matemÃ¡ticas</summary>
    <language>spa</language>
    <name>wikipedia_es_mathematics</name>
    <flavour>maxi</flavour>
    <category>wikipedia</category>
    <tags>wikipedia;_category:wikipedia;_pictures:yes;_videos:no;_details:yes;_ftindex:yes</tags>
    <articleCount>6474</articleCount>
    <mediaCount>82062</mediaCount>
    <link rel="http://opds-spec.org/image/thumbnail"
          href="/catalog/v2/illustration/225744bb-9488-085c-cfa8-19bf352c9e44/?size=48"
          type="image/png;width=48;height=48;scale=1"/>
    <link type="text/html" href="/content/wikipedia_es_mathematics_maxi_2023-04" />
    <author>
      <name>Wikipedia</name>
    </author>
    <publisher>
      <name>Kiwix</name>
    </publisher>
    <dc:issued>2023-04-23T00:00:00Z</dc:issued>
    <link rel="http://opds-spec.org/acquisition/open-access" type="application/x-zim" href="https://download.kiwix.org/zim/wikipedia/wikipedia_es_mathematics_maxi_2023-04.zim.meta4" length="693760000" />
  </entry>

rgaudin · 2023-04-26T15:57:15Z

@tim-moody It's been merged but not deployed. Nightlies are deployed everyday at https://dev.library.kiwix.org/ but library.kiwix.org only changes after kiwix-tools releases

kelson42 added bug question kiwix-serve labels Apr 22, 2023

kelson42 added this to the 12.1.0 milestone Apr 22, 2023

kelson42 assigned veloman-yunkan Apr 22, 2023

kelson42 mentioned this issue Apr 22, 2023

Add README to download.kiwix.org/library/ kiwix/operations#30

Closed

holta mentioned this issue Apr 22, 2023

Multi-lingual Characters (e.g. accents) mangled, perhaps due to recent Kiwix catalog changes? iiab/iiab#3545

Closed

veloman-yunkan mentioned this issue Apr 25, 2023

Explicit charset info in OPDS response Content-Type header #942

Merged

mgautierfr closed this as completed in #942 Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`catalog/v2` Kiwix Server end-point does not specify anymore the charset in the HTTP headers #941

`catalog/v2` Kiwix Server end-point does not specify anymore the charset in the HTTP headers #941

kelson42 commented Apr 22, 2023

tim-moody commented Apr 22, 2023

rgaudin commented Apr 22, 2023

tim-moody commented Apr 22, 2023

rgaudin commented Apr 22, 2023 •

edited

Loading

holta commented Apr 22, 2023

rgaudin commented Apr 22, 2023

holta commented Apr 23, 2023

tim-moody commented Apr 23, 2023

rgaudin commented Apr 24, 2023

kelson42 commented Apr 24, 2023

rgaudin commented Apr 24, 2023

veloman-yunkan commented Apr 25, 2023

tim-moody commented Apr 26, 2023

rgaudin commented Apr 26, 2023

catalog/v2 Kiwix Server end-point does not specify anymore the charset in the HTTP headers #941

catalog/v2 Kiwix Server end-point does not specify anymore the charset in the HTTP headers #941

Comments

kelson42 commented Apr 22, 2023

tim-moody commented Apr 22, 2023

rgaudin commented Apr 22, 2023

tim-moody commented Apr 22, 2023

rgaudin commented Apr 22, 2023 • edited Loading

holta commented Apr 22, 2023

rgaudin commented Apr 22, 2023

holta commented Apr 23, 2023

tim-moody commented Apr 23, 2023

rgaudin commented Apr 24, 2023

kelson42 commented Apr 24, 2023

rgaudin commented Apr 24, 2023

veloman-yunkan commented Apr 25, 2023

tim-moody commented Apr 26, 2023

rgaudin commented Apr 26, 2023

`catalog/v2` Kiwix Server end-point does not specify anymore the charset in the HTTP headers #941

`catalog/v2` Kiwix Server end-point does not specify anymore the charset in the HTTP headers #941

rgaudin commented Apr 22, 2023 •

edited

Loading