Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPDS catalog languages endpoint #553

Merged
merged 9 commits into from
Aug 3, 2021
Merged

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Jun 11, 2021

Fixes #584

This is a preliminary version of the /catalog/v2/languages endpoint intended for collecting early feedback.

  1. Is the format of the language list navigation feed OK?

  2. Is there a better way of translating language ISO codes to human-friendly names than using a hardcoded table?

@codecov
Copy link

codecov bot commented Jun 11, 2021

Codecov Report

Merging #553 (ab30957) into master (49322f5) will increase coverage by 0.25%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #553      +/-   ##
==========================================
+ Coverage   64.92%   65.17%   +0.25%     
==========================================
  Files          53       53              
  Lines        3738     3765      +27     
  Branches     1874     1886      +12     
==========================================
+ Hits         2427     2454      +27     
  Misses       1309     1309              
  Partials        2        2              
Impacted Files Coverage Δ
include/library.h 91.66% <ø> (ø)
include/opds_dumper.h 100.00% <ø> (ø)
src/server/internalServer.h 100.00% <ø> (ø)
src/library.cpp 78.51% <100.00%> (-0.70%) ⬇️
src/opds_dumper.cpp 100.00% <100.00%> (ø)
src/server/internalServer_catalog_v2.cpp 94.91% <100.00%> (+1.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 49322f5...ab30957. Read the comment docs.

@MananJethwani
Copy link
Contributor

@veloman-yunkan can we have the ISO codes as well in the response they are required for making requests.

Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veloman-yunkan Thank you for your quick move on this! Regarding the XML, I have the feeling you try to follow OPDS formalism and I'm not familliar with these requirements. I prefer to let you judge with @mgautierfr on this.

What I see is that this is pretty verbose and what I would have basically foreseen is the title node and a missing node with the ISO 639-3 language code (from the ZIM meta info Language) to allow filtering (like requested by @MananJethwani). Maybe this should be used in place of the current id node value? If OPDS expects from us the other fields, I would obviously keep them. Otherwise I would recommend to consider removing them if no real usage/reason for them.

Could you please as well please document the new end-points (at least the URL for the moment) at https://wiki.kiwix.org/wiki/OPDS, for both language and category end-points? This should be documented properly if we want people to use these new end-points.

One additional side remark: Kiwix Android, in its current filter displays the number of books per language. Considering that we only display the language with books, I wonder if adding a node with the number of books would be apropriate in this feed?

@veloman-yunkan
Copy link
Collaborator Author

@veloman-yunkan can we have the ISO codes as well in the response they are required for making requests.

@MananJethwani The language code was already present in the <link> element but I also added it as a separate <dc:language> element similar to the example from the OPDS spec.

@MananJethwani
Copy link
Contributor

@veloman-yunkan I get this error when I try to fetch data from localhost:8000/catalog/v2/languages
Screenshot from 2021-06-14 16-17-31

the server response in logs ->

Requesting : 
full_url  : /catalog/v2/languages
method    : GET (0)
version   : HTTP/1.1
request#  : 33
headers   :
 - accept : '*/*'
 - accept-encoding : 'gzip, deflate'
 - accept-language : 'en-GB,en;q=0.5'
 - connection : 'keep-alive'
 - host : 'localhost:8000'
 - referer : 'http://localhost:8000/'
 - user-agent : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0'
arguments :
Parsed : 
full_url: /catalog/v2/languages
url   : /catalog/v2/languages
acceptEncodingDeflate : 1
has_range : 0
is_valid_url : 1
.............
===== Unhandled error : map::at
** running handle_catalog** running handle_catalog_v2========== INTERNAL ERROR !! ============
Response :
httpResponseCode : 500
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
 - Content-Encoding: 'deflate'
 - Vary: 'Accept-Encoding'
Request time : 0.001489s
----------------------

I don't know if this is intentional and you are working on it but thought it would be good to mention.😅

@veloman-yunkan
Copy link
Collaborator Author

@MananJethwani That's expected. The language table currently contains only three languages used in the unit tests.

@kelson42
Copy link
Collaborator

@veloman-yunkan OK, so lets wait to @mgautierfr feedback to get an agreement on the XML format so you can complete the PR with all available ZIM languages and fix the error.

@veloman-yunkan
Copy link
Collaborator Author

Could you please as well please document the new end-points (at least the URL for the moment) at https://wiki.kiwix.org/wiki/OPDS, for both language and category end-points? This should be documented properly if we want people to use these new end-points.

We don't yet want people to start using these new end-points. We will document them once the new API is stable.

One additional side remark: Kiwix Android, in its current filter displays the number of books per language. Considering that we only display the language with books, I wonder if adding a node with the number of books would be apropriate in this feed?

I am not going to judge about the appropriateness of this enhancement, but I see no difficulty implementing it.

@mgautierfr
Copy link
Member

Is the format of the language list navigation feed OK?

Seems good enough for me.

Is there a better way of translating language ISO codes to human-friendly names than using a hardcoded table?

Do we need human-friendly names ?
As @MananJethwani suggests, the iso code should be our primary "data".
We could provide human friendly names but I'm not sure we want to.
The human-friendly names are part of the UI. It should be handled by the client presenting the language to the user.
We MAY have this in libkiwix, but it should be done on the client side, not in the OPDS stream.

(To answer your question, it is probably better to use icu to handle languages)

That's expected. The language table currently contains only three languages used in the unit tests.

This could be expected that we don't support language we don't know. But we must not have a internal server error.
We can never be sure we know about all the language. (And anyway, one could create a zim with wrong language set)

Considering that we only display the language with books, I wonder if adding a node with the number of books would be apropriate in this feed?

The OPDS Spec speaks about thr:count attribute for Facet (https://specs.opds.io/opds-1.2#the-thrcount-attribute). (Which came from https://datatracker.ietf.org/doc/html/rfc4685).
Even if nothing is said about feed, we could use thr:count here.

@kelson42
Copy link
Collaborator

The proposed "title" node is the language fullname or friendlyname.

It is important to put this language fullname in the OPDS stream to make it directly usable. We can not assume here the OPDS indgester will have access to libkiwix ABI.

The mapping beetween code and fullname should be done with libicu indeed.

if no mapping, i would avoid to create a "title" node or put an empty string value.

@mgautierfr
Copy link
Member

It is important to put this language fullname in the OPDS stream to make it directly usable.

We COULD add the fullname as a helper yes. But it is not our primary data.

And one important point here is that we don't know the language of the client, so the fullname will be in the language itself or in English.
That is why the real "translation" (from iso code to fullname) should be made on client side (either with libkiwix or with one of the numerous ways available to get a name from a iso code)

We can not assume here the OPDS indgester will have access to libkiwix ABI.

Even more, we must design a API usable without libkiwix.
libkiwx on the client side is just a helper for our tools.

That is why we need to design the stream closest to the standard. And the standard is ISO 639 (either version 639-1, 639-2 or 639-3. And we must decide now as it is part of the spec.)

@kelson42
Copy link
Collaborator

It is important to put this language fullname in the OPDS stream to make it directly usable.

We COULD add the fullname as a helper yes. But it is not our primary data.

And one important point here is that we don't know the language of the client, so the fullname will be in the language itself or in English.

Language name should be in its own language.

That is why the real "translation" (from iso code to fullname) should be made on client side (either with libkiwix or with one of the numerous ways available to get a name from a iso code)

For now, for me, this is not a requirement, at any level. Not sure if it will ever be required to have language names translated.

We can not assume here the OPDS indgester will have access to libkiwix ABI.

Even more, we must design a API usable without libkiwix.
libkiwx on the client side is just a helper for our tools.

That is why we need to design the stream closest to the standard. And the standard is ISO 639 (either version 639-1, 639-2 or 639-3. And we must decide now as it is part of the spec.)

Like stated in an earlier comment, we need ISO 639-3 because this is what is used in the ZIM metadata.

@stale
Copy link

stale bot commented Jun 23, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jun 23, 2021
@kelson42
Copy link
Collaborator

@veloman-yunkan Do I'm right, you need to finish this Pr?

@stale stale bot removed the stale label Jun 23, 2021
@stale
Copy link

stale bot commented Jul 1, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jul 1, 2021
@veloman-yunkan
Copy link
Collaborator Author

@veloman-yunkan Do I'm right, you need to finish this Pr?

@kelson42 Yes, I will finish it.

@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr @kelson42

it is probably better to use icu to handle languages

The mapping beetween code and fullname should be done with libicu indeed.

The version of libicu built by kiwix-build explicitly excludes the language data (among other types of data). The conversion from language code to human-friendly names is achieved in new commit 0198cda paired with kiwix/kiwix-build#494. Note that since language data in ICU contains much more information than we need for this enhancement, libicu (as modified by kiwix/kiwix-build#494) gets bigger by a couple of megabytes, whereas the solution utilizing a custom map would need only a few kilobytes in libkiwix.

Please advise which solution should be adopted for this PR.

@kelson42
Copy link
Collaborator

kelson42 commented Jul 6, 2021

Lets "pay" the few MBs.

Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small issues open to discussion.
Else we have to finish kiwix/kiwix-build#494 and wait for nightly ci to run to have the last icu data.

src/server/internalServer_catalog_v2.cpp Outdated Show resolved Hide resolved
Comment on lines 86 to 102
std::string getLanguageSelfName(const std::string& lang) {
return langMap.at(lang).selfName;
const icu::Locale locale(lang.c_str());
icu::UnicodeString ustring;
locale.getDisplayLanguage(locale, ustring);
std::string result;
ustring.toUTF8String(result);
return result;
};

std::string getLanguageEnglishName(const std::string& lang) {
return langMap.at(lang).englishName;
const icu::Locale locale(lang.c_str());
icu::UnicodeString ustring;
locale.getDisplayLanguage(icu::Locale("en"), ustring);
std::string result;
ustring.toUTF8String(result);
return result;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about the performance of this (although not really important).
Creating the icu::Locale may be time conssuming (to check).
Maybe we could have only one method returning a pair of string and so get the locale only once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about simply dropping getLanguageEnglishName(). @kelson42 considers the current format of the language entry pretty verbose, so removing the <content> node will solve both concerns at the same time.

@veloman-yunkan
Copy link
Collaborator Author

That's expected. The language table currently contains only three languages used in the unit tests.

This could be expected that we don't support language we don't know. But we must not have a internal server error.
We can never be sure we know about all the language. (And anyway, one could create a zim with wrong language set)

Switching to ICU-based language handling eliminated the crash. However, for a language code unknown to ICU the code text is returned unmodified as the human-friendly name. Is that acceptable? An alternative is to convert it to a string of the form "Unknown language langcode" or similar.

@kelson42
Copy link
Collaborator

kelson42 commented Jul 8, 2021

Acceptable, at least for now and in the libkiwix.

@veloman-yunkan
Copy link
Collaborator Author

Considering that we only display the language with books, I wonder if adding a node with the number of books would be apropriate in this feed?

The OPDS Spec speaks about thr:count attribute for Facet (https://specs.opds.io/opds-1.2#the-thrcount-attribute). (Which came from https://datatracker.ietf.org/doc/html/rfc4685).
Even if nothing is said about feed, we could use thr:count here.

This is also done

@kelson42
Copy link
Collaborator

kelson42 commented Jul 13, 2021

@veloman-yunkan I have rebased. Is this PR ready for a new review pass? If yes, please re-request a review. BTW, part of the CI seems to fail.

@veloman-yunkan
Copy link
Collaborator Author

@veloman-yunkan I have rebased. Is this PR ready for a new review pass? If yes, please re-request a review. BTW, part of the CI seems to fail.

@kelson42 kiwix/kiwix-build#494 must be merged first

@stale
Copy link

stale bot commented Jul 20, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jul 20, 2021
@stale stale bot removed the stale label Jul 28, 2021
@kelson42
Copy link
Collaborator

I have rebased.
@MananJethwani Please confirms the new end-point delivers what you need for #555
@mgautierfr This is ready to review/merge now.

@MananJethwani
Copy link
Contributor

MananJethwani commented Jul 28, 2021

@veloman-yunkan why is title in ISO-2 format, could it be possible to have full name in title?

@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Jul 31, 2021

@veloman-yunkan why is title in ISO-2 format, could it be possible to have full name in title?

@MananJethwani That is already the case. The title contains the full name of the language in itself. The ISO code (but ISO-3 rather than ISO-2) is in the <dc:language> element.

@kelson42
Copy link
Collaborator

Here is how it looks like to me:

$ curl http://localhost:8080/catalog/v2/languages
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:dc="http://purl.org/dc/terms/"
      xmlns:opds="https://specs.opds.io/opds-1.2">
  <id>54688692-5eac-41c8-ee1d-bb0b9aa51c3a</id>
  <link rel="self"
        href="/catalog/v2/languages"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <link rel="start"
        href="/catalog/v2/root.xml"
        type="application/atom+xml;profile=opds-catalog;kind=navigation"/>
  <title>List of languages</title>
  <updated>2021-07-31T18:37:26Z</updated>

  <entry>
    <title>English</title>
    <dc:language>eng</dc:language>
    <thr:count>1</thr:count>
    <link rel="subsection"
          href="/catalog/v2/entries?lang=eng"
          type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
    <updated>2021-07-31T18:37:26Z</updated>
    <id>c0c20680-d69e-0eb4-548f-b4fc22bcc400</id>
  </entry>
  <entry>
    <title>français</title>
    <dc:language>fra</dc:language>
    <thr:count>1</thr:count>
    <link rel="subsection"
          href="/catalog/v2/entries?lang=fra"
          type="application/atom+xml;profile=opds-catalog;kind=acquisition"/>
    <updated>2021-07-31T18:37:26Z</updated>
    <id>a07fb2e6-ce28-6a50-708d-ef76c614c166</id>
  </entry>
</feed>

Looks to be the kind of thing I was expected. LGTM

@MananJethwani
Copy link
Contributor

sorry I might have checked and older version LGTM as well

Added a new entry in /catalog/v2/root.xml that points to a
not-yet-existing list of languages navigation feed.
Introduced a helper function `Library::getBookPropValueSet()` and
deduplicated Library::getBooks{Languages,Creators,Publishers}() methods.
Language code to human friendly name translation is now done with the
help of the ICU library. It works if the line

```
-include $(LANGSRCDIR)/resfiles.mk
```

in the file `source/data/Makefile.in` of the icu4c dependency is not
commented out. Currently, the said line is commented out (along with
some other include's) by the `icu4c_custom_data.patch` patch of the
`kiwix-build` tool.
@mgautierfr
Copy link
Member

Look good to me also (code and opds output).

Rebased-fixup on master.

@mgautierfr mgautierfr merged commit b4f7dfa into master Aug 3, 2021
@mgautierfr mgautierfr deleted the catalog_languages_endpoint branch August 3, 2021 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OPDS endpoint for languages
4 participants