OPDS catalog languages endpoint #553

veloman-yunkan · 2021-06-11T09:14:46Z

Fixes #584

This is a preliminary version of the /catalog/v2/languages endpoint intended for collecting early feedback.

Is the format of the language list navigation feed OK?
Is there a better way of translating language ISO codes to human-friendly names than using a hardcoded table?

codecov · 2021-06-11T09:14:56Z

Codecov Report

Merging #553 (ab30957) into master (49322f5) will increase coverage by 0.25%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #553      +/-   ##
==========================================
+ Coverage   64.92%   65.17%   +0.25%     
==========================================
  Files          53       53              
  Lines        3738     3765      +27     
  Branches     1874     1886      +12     
==========================================
+ Hits         2427     2454      +27     
  Misses       1309     1309              
  Partials        2        2

Impacted Files	Coverage Δ
include/library.h	`91.66% <ø> (ø)`
include/opds_dumper.h	`100.00% <ø> (ø)`
src/server/internalServer.h	`100.00% <ø> (ø)`
src/library.cpp	`78.51% <100.00%> (-0.70%)`	⬇️
src/opds_dumper.cpp	`100.00% <100.00%> (ø)`
src/server/internalServer_catalog_v2.cpp	`94.91% <100.00%> (+1.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49322f5...ab30957. Read the comment docs.

MananJethwani · 2021-06-12T15:03:38Z

@veloman-yunkan can we have the ISO codes as well in the response they are required for making requests.

kelson42

@veloman-yunkan Thank you for your quick move on this! Regarding the XML, I have the feeling you try to follow OPDS formalism and I'm not familliar with these requirements. I prefer to let you judge with @mgautierfr on this.

What I see is that this is pretty verbose and what I would have basically foreseen is the title node and a missing node with the ISO 639-3 language code (from the ZIM meta info Language) to allow filtering (like requested by @MananJethwani). Maybe this should be used in place of the current id node value? If OPDS expects from us the other fields, I would obviously keep them. Otherwise I would recommend to consider removing them if no real usage/reason for them.

Could you please as well please document the new end-points (at least the URL for the moment) at https://wiki.kiwix.org/wiki/OPDS, for both language and category end-points? This should be documented properly if we want people to use these new end-points.

One additional side remark: Kiwix Android, in its current filter displays the number of books per language. Considering that we only display the language with books, I wonder if adding a node with the number of books would be apropriate in this feed?

veloman-yunkan · 2021-06-13T19:42:11Z

@veloman-yunkan can we have the ISO codes as well in the response they are required for making requests.

@MananJethwani The language code was already present in the <link> element but I also added it as a separate <dc:language> element similar to the example from the OPDS spec.

MananJethwani · 2021-06-14T19:37:54Z

@veloman-yunkan I get this error when I try to fetch data from localhost:8000/catalog/v2/languages

the server response in logs ->

Requesting : 
full_url  : /catalog/v2/languages
method    : GET (0)
version   : HTTP/1.1
request#  : 33
headers   :
 - accept : '*/*'
 - accept-encoding : 'gzip, deflate'
 - accept-language : 'en-GB,en;q=0.5'
 - connection : 'keep-alive'
 - host : 'localhost:8000'
 - referer : 'http://localhost:8000/'
 - user-agent : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0'
arguments :
Parsed : 
full_url: /catalog/v2/languages
url   : /catalog/v2/languages
acceptEncodingDeflate : 1
has_range : 0
is_valid_url : 1
.............
===== Unhandled error : map::at
** running handle_catalog** running handle_catalog_v2========== INTERNAL ERROR !! ============
Response :
httpResponseCode : 500
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
 - Content-Encoding: 'deflate'
 - Vary: 'Accept-Encoding'
Request time : 0.001489s
----------------------

I don't know if this is intentional and you are working on it but thought it would be good to mention.😅

veloman-yunkan · 2021-06-15T05:02:06Z

@MananJethwani That's expected. The language table currently contains only three languages used in the unit tests.

kelson42 · 2021-06-15T05:09:38Z

@veloman-yunkan OK, so lets wait to @mgautierfr feedback to get an agreement on the XML format so you can complete the PR with all available ZIM languages and fix the error.

veloman-yunkan · 2021-06-15T05:23:32Z

Could you please as well please document the new end-points (at least the URL for the moment) at https://wiki.kiwix.org/wiki/OPDS, for both language and category end-points? This should be documented properly if we want people to use these new end-points.

We don't yet want people to start using these new end-points. We will document them once the new API is stable.

One additional side remark: Kiwix Android, in its current filter displays the number of books per language. Considering that we only display the language with books, I wonder if adding a node with the number of books would be apropriate in this feed?

I am not going to judge about the appropriateness of this enhancement, but I see no difficulty implementing it.

mgautierfr · 2021-06-15T12:16:08Z

Is the format of the language list navigation feed OK?

Seems good enough for me.

Is there a better way of translating language ISO codes to human-friendly names than using a hardcoded table?

Do we need human-friendly names ?
As @MananJethwani suggests, the iso code should be our primary "data".
We could provide human friendly names but I'm not sure we want to.
The human-friendly names are part of the UI. It should be handled by the client presenting the language to the user.
We MAY have this in libkiwix, but it should be done on the client side, not in the OPDS stream.

(To answer your question, it is probably better to use icu to handle languages)

That's expected. The language table currently contains only three languages used in the unit tests.

This could be expected that we don't support language we don't know. But we must not have a internal server error.
We can never be sure we know about all the language. (And anyway, one could create a zim with wrong language set)

Considering that we only display the language with books, I wonder if adding a node with the number of books would be apropriate in this feed?

The OPDS Spec speaks about thr:count attribute for Facet (https://specs.opds.io/opds-1.2#the-thrcount-attribute). (Which came from https://datatracker.ietf.org/doc/html/rfc4685).
Even if nothing is said about feed, we could use thr:count here.

kelson42 · 2021-06-15T12:57:33Z

The proposed "title" node is the language fullname or friendlyname.

It is important to put this language fullname in the OPDS stream to make it directly usable. We can not assume here the OPDS indgester will have access to libkiwix ABI.

The mapping beetween code and fullname should be done with libicu indeed.

if no mapping, i would avoid to create a "title" node or put an empty string value.

mgautierfr · 2021-06-15T13:56:52Z

It is important to put this language fullname in the OPDS stream to make it directly usable.

We COULD add the fullname as a helper yes. But it is not our primary data.

And one important point here is that we don't know the language of the client, so the fullname will be in the language itself or in English.
That is why the real "translation" (from iso code to fullname) should be made on client side (either with libkiwix or with one of the numerous ways available to get a name from a iso code)

We can not assume here the OPDS indgester will have access to libkiwix ABI.

Even more, we must design a API usable without libkiwix.
libkiwx on the client side is just a helper for our tools.

That is why we need to design the stream closest to the standard. And the standard is ISO 639 (either version 639-1, 639-2 or 639-3. And we must decide now as it is part of the spec.)

kelson42 · 2021-06-15T19:51:21Z

It is important to put this language fullname in the OPDS stream to make it directly usable.

We COULD add the fullname as a helper yes. But it is not our primary data.

And one important point here is that we don't know the language of the client, so the fullname will be in the language itself or in English.

Language name should be in its own language.

That is why the real "translation" (from iso code to fullname) should be made on client side (either with libkiwix or with one of the numerous ways available to get a name from a iso code)

For now, for me, this is not a requirement, at any level. Not sure if it will ever be required to have language names translated.

We can not assume here the OPDS indgester will have access to libkiwix ABI.

Even more, we must design a API usable without libkiwix.
libkiwx on the client side is just a helper for our tools.

That is why we need to design the stream closest to the standard. And the standard is ISO 639 (either version 639-1, 639-2 or 639-3. And we must decide now as it is part of the spec.)

Like stated in an earlier comment, we need ISO 639-3 because this is what is used in the ZIM metadata.

stale · 2021-06-23T11:16:06Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2021-06-23T14:49:16Z

@veloman-yunkan Do I'm right, you need to finish this Pr?

stale · 2021-07-01T03:27:17Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

veloman-yunkan · 2021-07-01T07:04:54Z

@veloman-yunkan Do I'm right, you need to finish this Pr?

@kelson42 Yes, I will finish it.

veloman-yunkan · 2021-07-05T19:10:13Z

@mgautierfr @kelson42

it is probably better to use icu to handle languages

The mapping beetween code and fullname should be done with libicu indeed.

The version of libicu built by kiwix-build explicitly excludes the language data (among other types of data). The conversion from language code to human-friendly names is achieved in new commit 0198cda paired with kiwix/kiwix-build#494. Note that since language data in ICU contains much more information than we need for this enhancement, libicu (as modified by kiwix/kiwix-build#494) gets bigger by a couple of megabytes, whereas the solution utilizing a custom map would need only a few kilobytes in libkiwix.

Please advise which solution should be adopted for this PR.

kelson42 · 2021-07-06T11:36:00Z

Lets "pay" the few MBs.

mgautierfr

Two small issues open to discussion.
Else we have to finish kiwix/kiwix-build#494 and wait for nightly ci to run to have the last icu data.

src/server/internalServer_catalog_v2.cpp

mgautierfr · 2021-07-06T13:10:16Z

src/opds_dumper.cpp

 std::string getLanguageSelfName(const std::string& lang) {
-  return langMap.at(lang).selfName;
+  const icu::Locale locale(lang.c_str());
+  icu::UnicodeString ustring;
+  locale.getDisplayLanguage(locale, ustring);
+  std::string result;
+  ustring.toUTF8String(result);
+  return result;
 };

 std::string getLanguageEnglishName(const std::string& lang) {
-  return langMap.at(lang).englishName;
+  const icu::Locale locale(lang.c_str());
+  icu::UnicodeString ustring;
+  locale.getDisplayLanguage(icu::Locale("en"), ustring);
+  std::string result;
+  ustring.toUTF8String(result);
+  return result;
 };


I wonder about the performance of this (although not really important).
Creating the icu::Locale may be time conssuming (to check).
Maybe we could have only one method returning a pair of string and so get the locale only once.

What about simply dropping getLanguageEnglishName(). @kelson42 considers the current format of the language entry pretty verbose, so removing the <content> node will solve both concerns at the same time.

veloman-yunkan · 2021-07-08T12:29:37Z

That's expected. The language table currently contains only three languages used in the unit tests.

This could be expected that we don't support language we don't know. But we must not have a internal server error.
We can never be sure we know about all the language. (And anyway, one could create a zim with wrong language set)

Switching to ICU-based language handling eliminated the crash. However, for a language code unknown to ICU the code text is returned unmodified as the human-friendly name. Is that acceptable? An alternative is to convert it to a string of the form "Unknown language langcode" or similar.

kelson42 · 2021-07-08T12:31:14Z

Acceptable, at least for now and in the libkiwix.

veloman-yunkan · 2021-07-08T13:02:24Z

Considering that we only display the language with books, I wonder if adding a node with the number of books would be apropriate in this feed?

The OPDS Spec speaks about thr:count attribute for Facet (https://specs.opds.io/opds-1.2#the-thrcount-attribute). (Which came from https://datatracker.ietf.org/doc/html/rfc4685).
Even if nothing is said about feed, we could use thr:count here.

This is also done

kelson42 · 2021-07-13T12:35:40Z

@veloman-yunkan I have rebased. Is this PR ready for a new review pass? If yes, please re-request a review. BTW, part of the CI seems to fail.

veloman-yunkan · 2021-07-13T18:45:40Z

@veloman-yunkan I have rebased. Is this PR ready for a new review pass? If yes, please re-request a review. BTW, part of the CI seems to fail.

@kelson42 kiwix/kiwix-build#494 must be merged first

stale · 2021-07-20T23:28:58Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2021-07-28T09:35:35Z

I have rebased.
@MananJethwani Please confirms the new end-point delivers what you need for #555
@mgautierfr This is ready to review/merge now.

MananJethwani · 2021-07-28T15:31:20Z

@veloman-yunkan why is title in ISO-2 format, could it be possible to have full name in title?

veloman-yunkan · 2021-07-31T11:57:13Z

@veloman-yunkan why is title in ISO-2 format, could it be possible to have full name in title?

@MananJethwani That is already the case. The title contains the full name of the language in itself. The ISO code (but ISO-3 rather than ISO-2) is in the <dc:language> element.

kelson42 · 2021-07-31T16:39:00Z

Here is how it looks like to me:

$ curl http://localhost:8080/catalog/v2/languages
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:opds="https://specs.opds.io/opds-1.2">
  <id>54688692-5eac-41c8-ee1d-bb0b9aa51c3a</id>
  <link rel="self"
        href="/catalog/v2/languages"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <link rel="start"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <title>List of languages</title>
  <updated>2021-07-31T18:37:26Z</updated>

  <entry>
    <title>English</title>
    <dc:language>eng</dc:language>
    <thr:count>1</thr:count>
    <link rel="subsection"
          href="/catalog/v2/entries?lang=eng"
          type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
    <updated>2021-07-31T18:37:26Z</updated>
    <id>c0c20680-d69e-0eb4-548f-b4fc22bcc400</id>
  </entry>
  <entry>
    <title>français</title>
    <dc:language>fra</dc:language>
    <thr:count>1</thr:count>
    <link rel="subsection"
          href="/catalog/v2/entries?lang=fra"
          type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
    <updated>2021-07-31T18:37:26Z</updated>
    <id>a07fb2e6-ce28-6a50-708d-ef76c614c166</id>
  </entry>
</feed>

Looks to be the kind of thing I was expected. LGTM

MananJethwani · 2021-08-01T08:31:34Z

sorry I might have checked and older version LGTM as well

Added a new entry in /catalog/v2/root.xml that points to a not-yet-existing list of languages navigation feed.

Introduced a helper function `Library::getBookPropValueSet()` and deduplicated Library::getBooks{Languages,Creators,Publishers}() methods.

Language code to human friendly name translation is now done with the help of the ICU library. It works if the line ``` -include $(LANGSRCDIR)/resfiles.mk ``` in the file `source/data/Makefile.in` of the icu4c dependency is not commented out. Currently, the said line is commented out (along with some other include's) by the `icu4c_custom_data.patch` patch of the `kiwix-build` tool.

mgautierfr · 2021-08-03T09:34:12Z

Look good to me also (code and opds output).

Rebased-fixup on master.

veloman-yunkan requested review from mgautierfr and kelson42 June 11, 2021 09:15

kelson42 mentioned this pull request Jun 13, 2021

Use language/category API to populate Kiwix Serve filter select boxes #555

Closed

kelson42 reviewed Jun 13, 2021

View reviewed changes

stale bot added the stale label Jun 23, 2021

stale bot removed the stale label Jun 23, 2021

stale bot added the stale label Jul 1, 2021

veloman-yunkan force-pushed the catalog_languages_endpoint branch from 23c3a3a to 0198cda Compare July 4, 2021 11:38

stale bot removed the stale label Jul 4, 2021

veloman-yunkan mentioned this pull request Jul 4, 2021

Language data is packaged into ICU kiwix/kiwix-build#494

Merged

mgautierfr requested changes Jul 6, 2021

View reviewed changes

kelson42 force-pushed the catalog_languages_endpoint branch from 0dae28e to 2be68a9 Compare July 13, 2021 12:35

veloman-yunkan requested a review from mgautierfr July 13, 2021 18:46

stale bot added the stale label Jul 20, 2021

kelson42 force-pushed the catalog_languages_endpoint branch from 2be68a9 to 2b517d1 Compare July 28, 2021 09:29

stale bot removed the stale label Jul 28, 2021

kelson42 force-pushed the catalog_languages_endpoint branch from 2b517d1 to 6cd950f Compare July 31, 2021 16:27

veloman-yunkan added 9 commits August 3, 2021 11:32

List of languages entry in /catalog/v2/root.xml

b2027b3

Added a new entry in /catalog/v2/root.xml that points to a not-yet-existing list of languages navigation feed.

Helper function Library::getBookPropValueSet()

18871b4

Introduced a helper function `Library::getBookPropValueSet()` and deduplicated Library::getBooks{Languages,Creators,Publishers}() methods.

Made test library.xml a multi-language library

64b55db

Preliminary version of /catalog/v2/languages

5f90f5e

Language code in /catalog/v2/languages entries

8a4248e

OPDSDumper::categoriesOPDSFeed() with no args

96cf7e7

Got rid of <content> node in languages OPDS entry

45adda4

Languages OPDS feed includes book counts

ab30957

mgautierfr force-pushed the catalog_languages_endpoint branch from 6cd950f to ab30957 Compare August 3, 2021 09:32

mgautierfr approved these changes Aug 3, 2021

View reviewed changes

mgautierfr merged commit b4f7dfa into master Aug 3, 2021

mgautierfr deleted the catalog_languages_endpoint branch August 3, 2021 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPDS catalog languages endpoint #553

OPDS catalog languages endpoint #553

veloman-yunkan commented Jun 11, 2021 •

edited by kelson42

Loading

codecov bot commented Jun 11, 2021 •

edited

Loading

MananJethwani commented Jun 12, 2021

kelson42 left a comment

veloman-yunkan commented Jun 13, 2021

MananJethwani commented Jun 14, 2021

veloman-yunkan commented Jun 15, 2021

kelson42 commented Jun 15, 2021

veloman-yunkan commented Jun 15, 2021

mgautierfr commented Jun 15, 2021

kelson42 commented Jun 15, 2021

mgautierfr commented Jun 15, 2021

kelson42 commented Jun 15, 2021

stale bot commented Jun 23, 2021

kelson42 commented Jun 23, 2021

stale bot commented Jul 1, 2021

veloman-yunkan commented Jul 1, 2021

veloman-yunkan commented Jul 5, 2021

kelson42 commented Jul 6, 2021

mgautierfr left a comment

mgautierfr Jul 6, 2021

veloman-yunkan Jul 8, 2021

veloman-yunkan commented Jul 8, 2021

kelson42 commented Jul 8, 2021

veloman-yunkan commented Jul 8, 2021

kelson42 commented Jul 13, 2021 •

edited

Loading

veloman-yunkan commented Jul 13, 2021

stale bot commented Jul 20, 2021

kelson42 commented Jul 28, 2021

MananJethwani commented Jul 28, 2021 •

edited

Loading

veloman-yunkan commented Jul 31, 2021 •

edited

Loading

kelson42 commented Jul 31, 2021

MananJethwani commented Aug 1, 2021

mgautierfr commented Aug 3, 2021

OPDS catalog languages endpoint #553

OPDS catalog languages endpoint #553

Conversation

veloman-yunkan commented Jun 11, 2021 • edited by kelson42 Loading

codecov bot commented Jun 11, 2021 • edited Loading

Codecov Report

MananJethwani commented Jun 12, 2021

kelson42 left a comment

Choose a reason for hiding this comment

veloman-yunkan commented Jun 13, 2021

MananJethwani commented Jun 14, 2021

veloman-yunkan commented Jun 15, 2021

kelson42 commented Jun 15, 2021

veloman-yunkan commented Jun 15, 2021

mgautierfr commented Jun 15, 2021

kelson42 commented Jun 15, 2021

mgautierfr commented Jun 15, 2021

kelson42 commented Jun 15, 2021

stale bot commented Jun 23, 2021

kelson42 commented Jun 23, 2021

stale bot commented Jul 1, 2021

veloman-yunkan commented Jul 1, 2021

veloman-yunkan commented Jul 5, 2021

kelson42 commented Jul 6, 2021

mgautierfr left a comment

Choose a reason for hiding this comment

mgautierfr Jul 6, 2021

Choose a reason for hiding this comment

veloman-yunkan Jul 8, 2021

Choose a reason for hiding this comment

veloman-yunkan commented Jul 8, 2021

kelson42 commented Jul 8, 2021

veloman-yunkan commented Jul 8, 2021

kelson42 commented Jul 13, 2021 • edited Loading

veloman-yunkan commented Jul 13, 2021

stale bot commented Jul 20, 2021

kelson42 commented Jul 28, 2021

MananJethwani commented Jul 28, 2021 • edited Loading

veloman-yunkan commented Jul 31, 2021 • edited Loading

kelson42 commented Jul 31, 2021

MananJethwani commented Aug 1, 2021

mgautierfr commented Aug 3, 2021

veloman-yunkan commented Jun 11, 2021 •

edited by kelson42

Loading

codecov bot commented Jun 11, 2021 •

edited

Loading

kelson42 commented Jul 13, 2021 •

edited

Loading

MananJethwani commented Jul 28, 2021 •

edited

Loading

veloman-yunkan commented Jul 31, 2021 •

edited

Loading