-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simply the "Imports" section in the Makefile template #1194
Conversation
In order to make the Makefile template more readable, we need to move some of the logic it contains into the odk.py script. The general principle is that the configuration object that is given to the templates should contain "pre-computed" fields that can be used directly by the templates. This pre-computation will be done by the new `_derive_fields()` method in the `ProductGroup` class and its derived classes. In this commit, we migrate some of the logic needed to deal with imports. For example, instead of letting the Makefile template having to deal with the possibility that the `module_type` of a given module might not be set (in which case the template needs to get the value of the `module_type` at the level of the entire ImportGroup), we ensure in the `_derive_fields()` method of ImportGroup that all individual modules have an explicit `module_type` (set from the group-level value if no module type has been explicitly specified in the configuration file), so that the Makefile template can simply use something like: {% if 'filter' == ont.module_type %} instead of {% if ('filter' == ont.module_type) or (ont.module_type is none and 'filter' == project.import_group.module_type) %} Likewise for the `slme_individuals` and `module_type_slme` parameters. Also, we add a `special_products` derived field to the ImportGroup, containing the list of all the modules that either (1) are marked as "large", or (2) use a different type than the default, group-level module type. Overall, this makes the section of the template responsible for the production of the rules for the individual modules easier to read.
The rules that produce the individual import modules are only generated when use_base_merging is False, so there is no need, for every single rule, to test for use_base_merging and to make the module dependent on `$(MIRRORDIR)/merged.owl` if use_base_merging is True -- that can never happen, if we are generating a rule for an individual import module then we are necessarily under use_base_merging=False.
Indent Jinja2 conditionals (within the {% ... %} blocks to make the template easier to read. That is, instead of: {% if foo %} {% for b in bar %} {% if b.is_baz %} ... {% endif %} {% endfor %} {% endif %} we do instead: {% if foo %} {% for b in bar %} {% if b.is_baz %} ... {% endif %} {% endfor %} {% endif %} This does not solve *all* the readability issues of the Makefile template, but at least it provides a relatively easy way to find, for example, the `if` that corresponds to a given `endif`.
Reformat some import rules to make them more readable. Apply fine-grained control of Jinja2 whitespaces to avoid generating spurious blank lines in the Makefile.
If we are generating a `$(IMPORTDIR)/%_import.owl` rule, then it necessarily means that `use_base_merging` is False, so there is no need to test that field when generating the dependencies of the rule. (This should have been done at the same time as the same change concerning the rules for the "special" import modules.)
Move the rule that produce the OBO version of an import module towards the end of the imports section, so that it is not mixed with the rules that produce the actual import modules. Apply some minor Jinja2 reformatting, and also add a comment explaining why we need non-generic rules for the "special" imports. Include the type of the module in the generated Makefile comment, as advanced users might find that useful.
When generating a ROBOT command that takes repeated arguments from a list, write each item of the list on a separate line. That is: robot extract --base-iri IRI1 \ --base-iri IRI2 \ --base-iri IRI3 \ ... instead of: robot extract --base-iri IRI1 --base-iri IRI2 --base-iri IRI3 ... as the second form could produce arbitrary long lines depending on how many items we need to expand. This makes the generated Makefile easier to read, at the expense of making the *template* slightly harder to read. But since the Makefile is what the users are exposed to, its readability is more important than the readability of its template (which only ODK developers need worry about).
Though never officially documented, 'fast_slme' was a module type that was almost identical to 'slme', the only difference being that a "fast" SLME module was produced without removing the original ontology annotations and without adding a 'dc:source' annotation. Since those tasks used to be performed by SPARQL queries, they were very memory-intensive and could be a bottlenecks for large modules. In fact, SLME modules that were marked as "large" were automatically treated as "fast SLME" modules because of their high memory requirement. Now that adding the 'dc:source' annotation is added by the ODK plugin in a way that does not involve SPARQL, that step is as fast as it could be and, more importantly, does not depend on the size of the import module anymore. Therefore, there is no longer any real need for a "fast track" for large SLME modules. So we remove that fast track entirely. All SLME modules are now equal, regardless of their size. For backwards compatibility, if a module is explicitly configured as a 'fast_slme' module, we silently recognise it as a standard 'slme' module. This also means that for now, the only effect of the 'is_large' parameter is to allow to selectively *not* refresh large import modules, while refreshing the other, non-large modules.
As suggested by a comment in the template, there is no real need for the Makefile to catenate, sort, and "uniquify" the lists of terms we use as a seed for imports (the automatically generated $(IMPORTSEED) and the manually maintained $(IMPORT)_terms.txt). ROBOT is perfectly able to accept: * more than one --term-file; * containing comments; * containing unsorted terms; * containing duplicated terms. Anything we might do before passing the lists to ROBOT would in fact be redundant with what ROBOT will do anyway, so we might as well let ROBOT do everything. This means that we no longer need any *_terms_combined.txt file.
Add a 'scan_signature' option in the ImportGroup. If true (which is the default), a "pre-seed" is extracted from the -edit file (and its components, if any). That pre-seed is added to the main import seed (tmp/seed.txt) along with any terms extracted from the DOSDP patterns (if any). This is the behaviour of the current version of the ODK. If that option is set to false, then the pre-seed is not extracted, and terms from DOSDP patterns are not added to the main seed as well. This means that the main seed can now only contain the custom seed optionally set by the 'use_custom_import_module'. This, in turn, means that the only terms that will effectively be imported will be those actively selected by the user (either in the custom import module, or in the individual *_terms.txt files). If we do not scan the signature and there is no custom import module, then the tmp/seed.txt file would effectively be empty, so in that case we do not even generate the corresponding rule in the Makefile. closes #638
When refactoring the `$(IMPORTDIR)/merged_import.owl` rule, an error was introduced: the `remove` command that removes annotation properties only took as input (for the list of annotation properties to preserve) the properties listed in ANNOTATION_PROPERTIES. It should also take as input all the seeds, so that any annotation property listed here is preserved as well (this is the original behaviour).
When creating the merged import module, there is no need to preserve the original ontology annotations. They are actually misleading since they are only the annotations of the _first_ mirror to have been merged. (Prior to the refactoring, those annotations were removed by the "postprocess" SPARQL script.) Conversely, when creating a generic SLME import module, during the extraction step we do want to preserve annotations -- otherwise the dc:source annotation we have explicitly added at the beginning of the command pipeline would be lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code line by line and LGTM; I made some mostly irrelevant comments, and currently running some tests. This may take a bit, but just wanted to ensure you I am working in it :)
FYI I tested it on all FlyBase ontologies (FBdv, FBbt, DPO, FBcv) without issues. But of course they are all using a merged import module, so this did not test all the code paths. |
@matentzn Are you still testing this? |
Sorry, I was buried in a backlog, now I am running some tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with three ontologies I care about (HPO, RO and CHR) and it the behaviour looks fine to me. Awesome! THANK you for this!
This PR attempts to make the "Imports" section in
Makefile.jinja2
easier to read and maintain.First, it moves some of the logic into the
odk.py
script. Notably, the script now takes care of automatically assigning the default module type to all individual modules that do not have an explicitmodule_type
parameter, so that we do not have to go fetch the default value in the template. That is, instead of having to do things like that:(testing if the module is a SLME module or if it does not have an explicit type and the default module type is SLME), we can do simply:
That change alone should reduce the risks of blood clots forming in the brain of ODK developers trying to read the template by at least a few percent.
Second, it removes rules that were generated for nothing: when
use_base_merging
is used, there is no need to generate rules for the individual imports (see #1189).Third, it removes the “fast track” for “large“ SLME modules. This was a special rule invoked for (1) SLME modules marked as “large” or (2) module using the (undocumented) type
fast_slme
. The only difference compared to the standard SLME rule is that the “preprocessing step” (removing original ontology annotations and inserting adc:source
annotation instead) is skipped in the fast track. This was because the preprocessing step used to be performed with SPARQL queries, which was memory-intensive for large modules. That step is now performed by the ODK plugin using pure Java code, and has no excessive memory requirement. Therefore, it is no longer needed to treat large modules differently.Fourth, and though it is not really a “simplification” (on the contrary, it’s a new feature), it adds the possibility of disabling the extraction of a seed from the -edit file (option
scan_signature = False
) (#638).Fifth, it re-formats the template according to a few rules:
--input
instead of-i
,--term-file
instead of-T
, etc);{% endif %}{# !use_base_merging -#}
).