Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simply the "Imports" section in the Makefile template #1194

Merged
merged 15 commits into from
Mar 25, 2025
Merged

Conversation

gouttegd
Copy link
Contributor

@gouttegd gouttegd commented Mar 4, 2025

This PR attempts to make the "Imports" section in Makefile.jinja2 easier to read and maintain.

First, it moves some of the logic into the odk.py script. Notably, the script now takes care of automatically assigning the default module type to all individual modules that do not have an explicit module_type parameter, so that we do not have to go fetch the default value in the template. That is, instead of having to do things like that:

{% if ('slme' == ont.module_type) or (ont.module_type is none and 'slme' == project.import_group.module_type) %}

(testing if the module is a SLME module or if it does not have an explicit type and the default module type is SLME), we can do simply:

{% if 'slme' == ont.module_type %}

That change alone should reduce the risks of blood clots forming in the brain of ODK developers trying to read the template by at least a few percent.

Second, it removes rules that were generated for nothing: when use_base_merging is used, there is no need to generate rules for the individual imports (see #1189).

Third, it removes the “fast track” for “large“ SLME modules. This was a special rule invoked for (1) SLME modules marked as “large” or (2) module using the (undocumented) type fast_slme. The only difference compared to the standard SLME rule is that the “preprocessing step” (removing original ontology annotations and inserting a dc:source annotation instead) is skipped in the fast track. This was because the preprocessing step used to be performed with SPARQL queries, which was memory-intensive for large modules. That step is now performed by the ODK plugin using pure Java code, and has no excessive memory requirement. Therefore, it is no longer needed to treat large modules differently.

Fourth, and though it is not really a “simplification” (on the contrary, it’s a new feature), it adds the possibility of disabling the extraction of a seed from the -edit file (option scan_signature = False) (#638).

Fifth, it re-formats the template according to a few rules:

  • break all lines so that no lines is longer than 100 characters (max) or 72 characters (preferred);
  • consistently use long option names (e.g., --input instead of -i, --term-file instead of -T, etc);
  • indent Jinja2 conditionals;
  • add comments at the end of conditional blocks to help identify which block is closed (e.g., {% endif %}{# !use_base_merging -#}).

gouttegd added 10 commits March 3, 2025 20:59
In order to make the Makefile template more readable, we need to move
some of the logic it contains into the odk.py script.

The general principle is that the configuration object that is given to
the templates should contain "pre-computed" fields that can be used
directly by the templates. This pre-computation will be done by the new
`_derive_fields()` method in the `ProductGroup` class and its derived
classes.

In this commit, we migrate some of the logic needed to deal with
imports. For example, instead of letting the Makefile template having to
deal with the possibility that the `module_type` of a given module might
not be set (in which case the template needs to get the value of the
`module_type` at the level of the entire ImportGroup), we ensure in the
`_derive_fields()` method of ImportGroup that all individual modules
have an explicit `module_type` (set from the group-level value if no
module type has been explicitly specified in the configuration file), so
that the Makefile template can simply use something like:

  {% if 'filter' == ont.module_type %}

instead of

  {% if ('filter' == ont.module_type) or
        (ont.module_type is none
         and 'filter' == project.import_group.module_type) %}

Likewise for the `slme_individuals` and `module_type_slme` parameters.

Also, we add a `special_products` derived field to the ImportGroup,
containing the list of all the modules that either (1) are marked as
"large", or (2) use a different type than the default, group-level
module type.

Overall, this makes the section of the template responsible for the
production of the rules for the individual modules easier to read.
The rules that produce the individual import modules are only generated
when use_base_merging is False, so there is no need, for every single
rule, to test for use_base_merging and to make the module dependent on
`$(MIRRORDIR)/merged.owl` if use_base_merging is True -- that can never
happen, if we are generating a rule for an individual import module then
we are necessarily under use_base_merging=False.
Indent Jinja2 conditionals (within the {% ... %} blocks to make the
template easier to read.

That is, instead of:

  {% if foo %}
  {% for b in bar %}
  {% if b.is_baz %}
  ...
  {% endif %}
  {% endfor %}
  {% endif %}

we do instead:

  {% if foo %}
  {%   for b in bar %}
  {%     if b.is_baz %}
  ...
  {%     endif %}
  {%   endfor %}
  {% endif %}

This does not solve *all* the readability issues of the Makefile
template, but at least it provides a relatively easy way to find, for
example, the `if` that corresponds to a given `endif`.
Reformat some import rules to make them more readable. Apply
fine-grained control of Jinja2 whitespaces to avoid generating spurious
blank lines in the Makefile.
If we are generating a `$(IMPORTDIR)/%_import.owl` rule, then it
necessarily means that `use_base_merging` is False, so there is no need
to test that field when generating the dependencies of the rule.

(This should have been done at the same time as the same change
concerning the rules for the "special" import modules.)
Move the rule that produce the OBO version of an import module towards
the end of the imports section, so that it is not mixed with the rules
that produce the actual import modules.

Apply some minor Jinja2 reformatting, and also add a comment explaining
why we need non-generic rules for the "special" imports.

Include the type of the module in the generated Makefile comment, as
advanced users might find that useful.
When generating a ROBOT command that takes repeated arguments from a
list, write each item of the list on a separate line. That is:

  robot extract --base-iri IRI1 \
                --base-iri IRI2 \
                --base-iri IRI3 \
                ...

instead of:

  robot extract --base-iri IRI1 --base-iri IRI2 --base-iri IRI3 ...

as the second form could produce arbitrary long lines depending on how
many items we need to expand.

This makes the generated Makefile easier to read, at the expense of
making the *template* slightly harder to read. But since the Makefile is
what the users are exposed to, its readability is more important than
the readability of its template (which only ODK developers need worry
about).
Though never officially documented, 'fast_slme' was a module type that
was almost identical to 'slme', the only difference being that a "fast"
SLME module was produced without removing the original ontology
annotations and without adding a 'dc:source' annotation. Since those
tasks used to be performed by SPARQL queries, they were very
memory-intensive and could be a bottlenecks for large modules.

In fact, SLME modules that were marked as "large" were automatically
treated as "fast SLME" modules because of their high memory requirement.

Now that adding the 'dc:source' annotation is added by the ODK plugin in
a way that does not involve SPARQL, that step is as fast as it could be
and, more importantly, does not depend on the size of the import module
anymore. Therefore, there is no longer any real need for a "fast track"
for large SLME modules.

So we remove that fast track entirely. All SLME modules are now equal,
regardless of their size. For backwards compatibility, if a module is
explicitly configured as a 'fast_slme' module, we silently recognise it
as a standard 'slme' module.

This also means that for now, the only effect of the 'is_large'
parameter is to allow to selectively *not* refresh large import modules,
while refreshing the other, non-large modules.
As suggested by a comment in the template, there is no real need for the
Makefile to catenate, sort, and "uniquify" the lists of terms we use as
a seed for imports (the automatically generated $(IMPORTSEED) and the
manually maintained $(IMPORT)_terms.txt). ROBOT is perfectly able to
accept:

* more than one --term-file;
* containing comments;
* containing unsorted terms;
* containing duplicated terms.

Anything we might do before passing the lists to ROBOT would in fact be
redundant with what ROBOT will do anyway, so we might as well let ROBOT
do everything.

This means that we no longer need any *_terms_combined.txt file.
Add a 'scan_signature' option in the ImportGroup. If true (which is the
default), a "pre-seed" is extracted from the -edit file (and its
components, if any). That pre-seed is added to the main import seed
(tmp/seed.txt) along with any terms extracted from the DOSDP patterns
(if any). This is the behaviour of the current version of the ODK.

If that option is set to false, then the pre-seed is not extracted, and
terms from DOSDP patterns are not added to the main seed as well. This
means that the main seed can now only contain the custom seed optionally
set by the 'use_custom_import_module'. This, in turn, means that the
only terms that will effectively be imported will be those actively
selected by the user (either in the custom import module, or in the
individual *_terms.txt files).

If we do not scan the signature and there is no custom import module,
then the tmp/seed.txt file would effectively be empty, so in that case
we do not even generate the corresponding rule in the Makefile.

closes #638
@gouttegd gouttegd self-assigned this Mar 4, 2025
gouttegd added 2 commits March 4, 2025 19:42
When refactoring the `$(IMPORTDIR)/merged_import.owl` rule, an error was
introduced: the `remove` command that removes annotation properties only
took as input (for the list of annotation properties to preserve) the
properties listed in ANNOTATION_PROPERTIES. It should also take as input
all the seeds, so that any annotation property listed here is preserved
as well (this is the original behaviour).
@gouttegd gouttegd requested a review from matentzn March 4, 2025 23:23
gouttegd added 2 commits March 5, 2025 12:54
When creating the merged import module, there is no need to preserve the
original ontology annotations. They are actually misleading since they
are only the annotations of the _first_ mirror to have been merged.
(Prior to the refactoring, those annotations were removed by the
"postprocess" SPARQL script.)

Conversely, when creating a generic SLME import module, during the
extraction step we do want to preserve annotations -- otherwise the
dc:source annotation we have explicitly added at the beginning of the
command pipeline would be lost.
Copy link
Contributor

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code line by line and LGTM; I made some mostly irrelevant comments, and currently running some tests. This may take a bit, but just wanted to ensure you I am working in it :)

@gouttegd
Copy link
Contributor Author

currently running some tests

FYI I tested it on all FlyBase ontologies (FBdv, FBbt, DPO, FBcv) without issues. But of course they are all using a merged import module, so this did not test all the code paths.

@gouttegd
Copy link
Contributor Author

@matentzn Are you still testing this?

@matentzn
Copy link
Contributor

Sorry, I was buried in a backlog, now I am running some tests.

Copy link
Contributor

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with three ontologies I care about (HPO, RO and CHR) and it the behaviour looks fine to me. Awesome! THANK you for this!

@gouttegd gouttegd merged commit ab99581 into master Mar 25, 2025
1 check passed
@gouttegd gouttegd deleted the rework-imports branch March 25, 2025 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants