Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split tags in their own attributes #131

Closed
automactic opened this issue Apr 12, 2018 · 12 comments
Closed

Split tags in their own attributes #131

automactic opened this issue Apr 12, 2018 · 12 comments

Comments

@automactic
Copy link
Member

automactic commented Apr 12, 2018

Currently we identify the existence of picture, embedded index and video, etc with tags. I can see a few short comings with this approach.

  1. Clarity: Take picture as an example. When tags contains nopic, we know it does not have picture. But when tags does not contain nopic, the natural inference is we do not know if it has picture or not.

  2. Self-contradictory: If tags has value nopic;pic, what should we believe?

  3. Multiple Variances: No picture attribute can be expressed in many ways, nopic, no_pic, _nopic, NOPIC, nopicture, no_picture, without_pic, etc.

  4. Not self documenting: It is hard for someone who do not have experience with libkiwix to know what to test against. Suppose we have a zim file with "medicine" in tags, what exactly does it mean? Is this a zim file with only medicine related stuff or it just mean it has medicine article among many other topics?

I propose we separate tags into attributes or functions, for example: kiwix::Reader::hasPicture -> bool and kiwix::Reader::hasEmbeddedIndex -> bool.

@mgautierfr
Copy link
Member

I mainly agree here.

Free tags have the great advantage to be extensive without any change in code but, if used without care it quickly leads to inconsistent system.

To add a bit more example of the complexity of this:
Some recent zims have tags ";wikipedia;novid;_ftindex", ";wikipedia;nopic;_ftindex". Questions:

  • Does ";wikipedia;novid;_ftindex" has pictures ?
  • Does ";wikipedia;nopic;_ftindex" has videos ?
  • How do you parse this ? nopic, novid, _ftindex may be usefull if you know in advance (so, in the code) that those tags mean something and you search for there presence or not. But wikipedia ? It seems it somehow a category. But the categories are an "open" list. You cannot assume you know all the categories of all zims. So, must you assume that all tags you don't know are category ? What about new "technical" tags (as nodet) the code don't know about ? It may assume this is a category and display zim wrongly.

More complicated, older zims never contains videos, but when starting to add videos in zims, we introduce the novid tag for new zims not containing videos. But if you found a zims without tag novid, does it mean that it is a new zim with videos or a old zim without videos ?

However, I agree with Kelson as he explained in another threads (#130), adding a new API for each new tags seems counter-productive. I like the idea of letting anyone create zim with tags they want and use those tags the way they want in the final application without any change in libzim or kiwix-lib.

Saying that, I may have a beginning of solution (to be discussed)
Tags (who is actually a zim metadata tags you can get with kiwix-lib's value = Reader::getTags() or Reader::getMetatag("Tags", value)*) are for now a list of values separated by ;.
Instead of a simple list of values, I suggest to have a list of keys/values using the following format : key0:value0;key1:value1;... or ftindex:yes;vid:no;pic:yes;category:wikipedia.

This way, we solve most (if not all) of our problems, keeping the constraints we want :

  • An application knowing a tag can look specifically for it.
  • Unknown tags are simply ignored by application.
  • No self-contradiction. It is always possible to create invalid tags, but in normal case it is avoid, and contradiction (pic:yes;pic:no) may be detected.
  • Self documentary. Choosing wisely the name of the key, we can give a pretty good hint on what "medecine" means.
  • We are still having a generic system. New tags can be introduce without breaking existing implementation.
  • The non existance of a tag is an information in itself. New zim would always have vid:yes or vid:no, if a zim has no tag vid:…, it means it is an old one without videos.
  • We can event introduce a new method in kiwix-lib handling all the parsing and still being generic : string Reader::getTagValue(string key) (category = reader->getTagValue("category") or reader->getTagValue("pic") == "yes")
  • By keeping this information in the tags and not the metadata, this flexibility is also keep in the library.xml or opds stream.

Comments ? :)

* this method is wrongly named and should be getMetadata, but that's not the point here.

@automactic
Copy link
Member Author

automactic commented Apr 26, 2018

I agree with @mgautierfr's approach.

The only thing I want to make sure is we hide implementation detail to user of this library, so they don't have to parse the underlying backing data (whether it will be a encoded json, comma separated value or something else). Specifically,

In libkiwix, Reader::getMetaTags should return std::unordered_map or something similar. Or maybe Reader::getMeta(std::string tagName). But I can see some complication in defining value or return type with this approach.

In library.xml or future online library api, store meta data parsed in native xml or json format.

@kelson42
Copy link
Collaborator

kelson42 commented Apr 26, 2018

What for sure will not happen is trying to build something which avoids inconsistencies because with tags we can have inconsistencies. Inconsistencies are ineherent to any open system, and ZIM publisher simply have to care about that. So point (2) of @automactic is a "no point" to me.

@kelson42
Copy link
Collaborator

kelson42 commented Apr 26, 2018

Basically the plan is to:

  • Keep the tags
  • Rename them in a positive manner (to the opposite of today)
  • We should move the one, where application rely strongly on them to "system ones", with "_ " at the beginning
  • System tags should be documented and normalized.
  • If necessary we could move content identifier like "youtune" or "wikipedia" to "_wikipedia"... But I'm not sure about that.

We do not solve the problem with "old" ZIM files or ZIM files wrongly tags... but this is not a solvable problem to me. We should simply release more often to mitigate that problem.

@mgautierfr I'm not in favor of a key=value system, sounds too complicated to me... and so far I'm not aware of any concrete use case/feature on reader level which is not doable with the current tag system (or an improved/normalised version of it).

We have a similar problem with the filenames, even if the application should never rely on the filname to sort ZIM files. I will publish a plan regarding all of this and we will have an opportunity to talk about that again. Please keep this ticket open and be a bit patient.

@automactic
Copy link
Member Author

automactic commented Apr 26, 2018

What for sure will not happen is trying to build something which avoids inconsistencies because we tags with can have inconsistencies.

I have problem with this statement. Surely no system is perfect. But there is absolutely no reason for the designer not to try to avoid inconsistency happening. It is the same reason as why people use compilers and static checkers: catch errors as early as possible, at compile time, not at run time. If we do not try to create a system that reduce inconsistency, that leaves zim creators more hoops to jump, more stuff to check, the end product problematic.

@automactic
Copy link
Member Author

@kelson42, I understand you do not want to keep modifying libkiwix every time a new attribute is added. And I agree with you on this point.

But we cannot simply throw stuff into tags when we cannot find a good place for some of those attributes.

@kelson42
Copy link
Collaborator

@automactic Yes, we can. You can do whatever you want with this system. If you can't, please provide a user story in an other ticket explaining clearly what you want to do from a user perspective (which means as Kiwix for iOS/MacOS developer).

@rgaudin
Copy link
Member

rgaudin commented Jun 19, 2019

As mentioned in kiwix/kiwix-tools/issues/291, there are two concepts behind “a zim file”:

  • the content. It's what the user eventually wants and is itself defined by:
    • a category (wikipedia, phet, other)
    • a language (fr, en, fa)
    • a selection (movies, all)
    • a format or variant (mini, nopic)
  • the actual file. It's a date-versioned version of the content:
    • date (2019-06, 2019-05, latest).

Ideally, catalog-manipulating tools should allow filtering on any of those fields. For this to happen, the ZIM should record a value for those fields (implementation doesn't matter).

Recording this data would allow us to implement ZIM search engines (on top of OPDS) and/or provide permalinks to content without using the UUID nor referencing the date (changing) nor hacking around the file names.

@kelson42
Copy link
Collaborator

We will carry on with our effort of renaming a few tags, see openzim/mwoffliner#485. A few ones will be "standardized", documented and/or "hidden tags" ready to be relied on in software. That said we will keep the idea that this stays a tag system with all its strengths and weaknesses.

@sobaee
Copy link

sobaee commented Mar 25, 2022

Interesting

Now, please tell me about the difference between 2 wikipedia english medicine zim libraries founded beside each other, one has a tag "maxi" and the other has no tag

There is a very big difference in size
Is it worth to download the 7 gb file?

Screenshot_20220326002229

@rgaudin
Copy link
Member

rgaudin commented Mar 25, 2022

No tag means the full thing so with videos for instance, that maxi doesn't include (though it includes pictures). I don't know wikimed so there might be additional differences

@mgautierfr
Copy link
Member

Tags are information stored in the zim file themselves and and can be found in the catalog (this is a opds catalog with different api endpoints, but if you want a simple xml feed with all information this is https://library.kiwix.org/catalog/v2/entries)

The filename follow a different naming system, this is probably specified somewhere but I don't know where. @kelson42 may help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants