Split tags in their own attributes #131

automactic · 2018-04-12T20:43:39Z

Currently we identify the existence of picture, embedded index and video, etc with tags. I can see a few short comings with this approach.

Clarity: Take picture as an example. When tags contains nopic, we know it does not have picture. But when tags does not contain nopic, the natural inference is we do not know if it has picture or not.
Self-contradictory: If tags has value nopic;pic, what should we believe?
Multiple Variances: No picture attribute can be expressed in many ways, nopic, no_pic, _nopic, NOPIC, nopicture, no_picture, without_pic, etc.
Not self documenting: It is hard for someone who do not have experience with libkiwix to know what to test against. Suppose we have a zim file with "medicine" in tags, what exactly does it mean? Is this a zim file with only medicine related stuff or it just mean it has medicine article among many other topics?

I propose we separate tags into attributes or functions, for example: kiwix::Reader::hasPicture -> bool and kiwix::Reader::hasEmbeddedIndex -> bool.

The text was updated successfully, but these errors were encountered:

mgautierfr · 2018-04-24T10:14:55Z

I mainly agree here.

Free tags have the great advantage to be extensive without any change in code but, if used without care it quickly leads to inconsistent system.

To add a bit more example of the complexity of this:
Some recent zims have tags ";wikipedia;novid;_ftindex", ";wikipedia;nopic;_ftindex". Questions:

Does ";wikipedia;novid;_ftindex" has pictures ?
Does ";wikipedia;nopic;_ftindex" has videos ?
How do you parse this ? nopic, novid, _ftindex may be usefull if you know in advance (so, in the code) that those tags mean something and you search for there presence or not. But wikipedia ? It seems it somehow a category. But the categories are an "open" list. You cannot assume you know all the categories of all zims. So, must you assume that all tags you don't know are category ? What about new "technical" tags (as nodet) the code don't know about ? It may assume this is a category and display zim wrongly.

More complicated, older zims never contains videos, but when starting to add videos in zims, we introduce the novid tag for new zims not containing videos. But if you found a zims without tag novid, does it mean that it is a new zim with videos or a old zim without videos ?

However, I agree with Kelson as he explained in another threads (#130), adding a new API for each new tags seems counter-productive. I like the idea of letting anyone create zim with tags they want and use those tags the way they want in the final application without any change in libzim or kiwix-lib.

Saying that, I may have a beginning of solution (to be discussed)
Tags (who is actually a zim metadata tags you can get with kiwix-lib's value = Reader::getTags() or Reader::getMetatag("Tags", value)*) are for now a list of values separated by ;.
Instead of a simple list of values, I suggest to have a list of keys/values using the following format : key0:value0;key1:value1;... or ftindex:yes;vid:no;pic:yes;category:wikipedia.

This way, we solve most (if not all) of our problems, keeping the constraints we want :

An application knowing a tag can look specifically for it.
Unknown tags are simply ignored by application.
No self-contradiction. It is always possible to create invalid tags, but in normal case it is avoid, and contradiction (pic:yes;pic:no) may be detected.
Self documentary. Choosing wisely the name of the key, we can give a pretty good hint on what "medecine" means.
We are still having a generic system. New tags can be introduce without breaking existing implementation.
The non existance of a tag is an information in itself. New zim would always have vid:yes or vid:no, if a zim has no tag vid:…, it means it is an old one without videos.
We can event introduce a new method in kiwix-lib handling all the parsing and still being generic : string Reader::getTagValue(string key) (category = reader->getTagValue("category") or reader->getTagValue("pic") == "yes")
By keeping this information in the tags and not the metadata, this flexibility is also keep in the library.xml or opds stream.

Comments ? :)

* this method is wrongly named and should be getMetadata, but that's not the point here.

automactic · 2018-04-26T15:12:25Z

I agree with @mgautierfr's approach.

The only thing I want to make sure is we hide implementation detail to user of this library, so they don't have to parse the underlying backing data (whether it will be a encoded json, comma separated value or something else). Specifically,

In libkiwix, Reader::getMetaTags should return std::unordered_map or something similar. Or maybe Reader::getMeta(std::string tagName). But I can see some complication in defining value or return type with this approach.

In library.xml or future online library api, store meta data parsed in native xml or json format.

kelson42 · 2018-04-26T15:19:31Z

What for sure will not happen is trying to build something which avoids inconsistencies because with tags we can have inconsistencies. Inconsistencies are ineherent to any open system, and ZIM publisher simply have to care about that. So point (2) of @automactic is a "no point" to me.

kelson42 · 2018-04-26T15:32:15Z

Basically the plan is to:

Keep the tags
Rename them in a positive manner (to the opposite of today)
We should move the one, where application rely strongly on them to "system ones", with "_ " at the beginning
System tags should be documented and normalized.
If necessary we could move content identifier like "youtune" or "wikipedia" to "_wikipedia"... But I'm not sure about that.

We do not solve the problem with "old" ZIM files or ZIM files wrongly tags... but this is not a solvable problem to me. We should simply release more often to mitigate that problem.

@mgautierfr I'm not in favor of a key=value system, sounds too complicated to me... and so far I'm not aware of any concrete use case/feature on reader level which is not doable with the current tag system (or an improved/normalised version of it).

We have a similar problem with the filenames, even if the application should never rely on the filname to sort ZIM files. I will publish a plan regarding all of this and we will have an opportunity to talk about that again. Please keep this ticket open and be a bit patient.

automactic · 2018-04-26T18:11:29Z

What for sure will not happen is trying to build something which avoids inconsistencies because we tags with can have inconsistencies.

I have problem with this statement. Surely no system is perfect. But there is absolutely no reason for the designer not to try to avoid inconsistency happening. It is the same reason as why people use compilers and static checkers: catch errors as early as possible, at compile time, not at run time. If we do not try to create a system that reduce inconsistency, that leaves zim creators more hoops to jump, more stuff to check, the end product problematic.

automactic · 2018-04-26T18:13:11Z

@kelson42, I understand you do not want to keep modifying libkiwix every time a new attribute is added. And I agree with you on this point.

But we cannot simply throw stuff into tags when we cannot find a good place for some of those attributes.

kelson42 · 2018-04-27T05:22:20Z

@automactic Yes, we can. You can do whatever you want with this system. If you can't, please provide a user story in an other ticket explaining clearly what you want to do from a user perspective (which means as Kiwix for iOS/MacOS developer).

rgaudin · 2019-06-19T17:55:53Z

As mentioned in kiwix/kiwix-tools/issues/291, there are two concepts behind “a zim file”:

the content. It's what the user eventually wants and is itself defined by:
- a category (wikipedia, phet, other)
- a language (fr, en, fa)
- a selection (movies, all)
- a format or variant (mini, nopic)
the actual file. It's a date-versioned version of the content:
- date (2019-06, 2019-05, latest).

Ideally, catalog-manipulating tools should allow filtering on any of those fields. For this to happen, the ZIM should record a value for those fields (implementation doesn't matter).

Recording this data would allow us to implement ZIM search engines (on top of OPDS) and/or provide permalinks to content without using the UUID nor referencing the date (changing) nor hacking around the file names.

kelson42 · 2019-08-13T14:06:57Z

We will carry on with our effort of renaming a few tags, see openzim/mwoffliner#485. A few ones will be "standardized", documented and/or "hidden tags" ready to be relied on in software. That said we will keep the idea that this stays a tag system with all its strengths and weaknesses.

sobaee · 2022-03-25T21:42:42Z

Interesting

Now, please tell me about the difference between 2 wikipedia english medicine zim libraries founded beside each other, one has a tag "maxi" and the other has no tag

There is a very big difference in size
Is it worth to download the 7 gb file?

rgaudin · 2022-03-25T21:56:14Z

No tag means the full thing so with videos for instance, that maxi doesn't include (though it includes pictures). I don't know wikimed so there might be additional differences

mgautierfr · 2022-03-28T09:22:24Z

Tags are information stored in the zim file themselves and and can be found in the catalog (this is a opds catalog with different api endpoints, but if you want a simple xml feed with all information this is https://library.kiwix.org/catalog/v2/entries)

The filename follow a different naming system, this is probably specified somewhere but I don't know where. @kelson42 may help here.

automactic mentioned this issue Apr 13, 2018

provide a way to get category #130

Closed

mgautierfr mentioned this issue Jan 7, 2019

Rename a few tags openzim/mwoffliner#485

Closed

This was referenced May 11, 2019

Simplify Wikipedia ZIM file offering openzim/zim-requests#129

Closed

Feature Request - Tags kiwix/kiwix-tools#71

Closed

mgautierfr mentioned this issue May 13, 2019

Include human readable IDs in OPDS feed kiwix/kiwix-tools#291

Closed

mgautierfr mentioned this issue Jun 24, 2019

Better display of the tags kiwix/kiwix-desktop#172

Closed

kelson42 closed this as completed Aug 13, 2019

kelson42 self-assigned this Aug 13, 2019

kelson42 added enhancement question labels Aug 13, 2019

mgautierfr mentioned this issue Mar 3, 2021

Add category support to OPDS kiwix/kiwix-tools#318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split tags in their own attributes #131

Split tags in their own attributes #131

automactic commented Apr 12, 2018 •

edited

Loading

mgautierfr commented Apr 24, 2018

automactic commented Apr 26, 2018 •

edited

Loading

kelson42 commented Apr 26, 2018 •

edited

Loading

kelson42 commented Apr 26, 2018 •

edited

Loading

automactic commented Apr 26, 2018 •

edited by kelson42

Loading

automactic commented Apr 26, 2018

kelson42 commented Apr 27, 2018

rgaudin commented Jun 19, 2019

kelson42 commented Aug 13, 2019

sobaee commented Mar 25, 2022

rgaudin commented Mar 25, 2022

mgautierfr commented Mar 28, 2022

Split tags in their own attributes #131

Split tags in their own attributes #131

Comments

automactic commented Apr 12, 2018 • edited Loading

mgautierfr commented Apr 24, 2018

automactic commented Apr 26, 2018 • edited Loading

kelson42 commented Apr 26, 2018 • edited Loading

kelson42 commented Apr 26, 2018 • edited Loading

automactic commented Apr 26, 2018 • edited by kelson42 Loading

automactic commented Apr 26, 2018

kelson42 commented Apr 27, 2018

rgaudin commented Jun 19, 2019

kelson42 commented Aug 13, 2019

sobaee commented Mar 25, 2022

rgaudin commented Mar 25, 2022

mgautierfr commented Mar 28, 2022

automactic commented Apr 12, 2018 •

edited

Loading

automactic commented Apr 26, 2018 •

edited

Loading

kelson42 commented Apr 26, 2018 •

edited

Loading

kelson42 commented Apr 26, 2018 •

edited

Loading

automactic commented Apr 26, 2018 •

edited by kelson42

Loading