Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template Rework #403

Merged
merged 48 commits into from
Jun 25, 2019
Merged

Template Rework #403

merged 48 commits into from
Jun 25, 2019

Conversation

beckyjackson
Copy link
Contributor

@beckyjackson beckyjackson commented Nov 15, 2018

This PR completely reworks the logic of template. Instead of using a static TemplateOperation class, it now uses a Template class. This allows for a cleaner way to add more features to template.

This rework also supports new template strings. I've included the documentation and examples below. I apologize in advance about the length 😅 TLDR:

  • Property template strings
    • property logic (super properties, equivalents, disjoints, inverses)
    • extra property types (functional, inverse functional, symmetric, etc.)
  • Individual template strings
    • property assertions (object and data)
    • same and different individual assertions
  • Aliases for entity TYPE declarations
    • e.g. object property can be used instead of owl:ObjectProperty

@ontodev/robot-team This is an experimental update, so I would appreciate tests from anyone. The old template strings should produce the same results as before, even though the logic is different.

Property Template Strings

  • PROPERTY_TYPE: ROBOT creates a property for each row of data that has a TYPE of either an object or data property. The property type can be (any type followed by a * can ONLY be used for object properties):
    • logical types: these types link the created property to other properties (annotation properties can only use subproperty)
      • subproperty: the created property will be a subproperty of each templated property expression (default)
      • equivalent: the created property will be equivalent to all of the templated property expressions
      • disjoint: the created property will be disjoint from each templated property expression and the values cannot be the same
      • inverse*: the created object property will be the inverse of each templated property expression
    • property types: these types define the type of the created property, and will not work with annotation properties
      • functional: the created property will be functional, meaning each entity (subject) can have at most one value
      • inverse functional*: the created object property will be inverse functional, meaning each value can have at most one subject
      • irreflexive*: the created object property will be irreflexive, meaning the subject cannot also be the value
      • reflexive*: the created object property will be reflexive, meaning each subject is also a value
      • symmetric*: the created object property will be symmetric, meaning the subject and value can be reversed
      • asymmetric*: the created object property will be asymmetric, meaning the subject and value cannot be reversed
      • transitive*: the created object property will be transitive, meaning the property can be chained
  • P property expression: If the template string starts with a P and a space then it will be interpreted as a property expression. The value of the current cell will be substituted into the template, replacing all occurrences of the % character. Then the result will be parsed into an OWL property expression. ROBOT uses the same syntax for property expressions as Protégé: Manchester Syntax. If it does not recognize a name, ROBOT will assume that you're trying to refer to an entity by its IRI or CURIE. This can lead to unexpected behavior, but it allows you to refer to entities without loading them into the input ontology.
    • object properties: the only supported object property expression is the inverse object property expression. The template string is P inverse(%). A single object property for a value can be specified by P %.
    • data properties: data property expressions are not yet supported by OWL. A data property for a value (e.g. for a parent property) can be specified by P %.
    • annotation properties: annotation property expressions are not possible. An annotation property for a value (e.g. for a parent property) can be specified by P %.
  • DOMAIN: The domain to a property is a class expression in Manchester Syntax (for object and data properties). For annotation properties, the domain must be a single class specified by label, CURIE, or IRI.
  • RANGE: The range to a property is either a class expression in Manchester Syntax (for object properties) or the name, CURIE, or IRI of a datatype (for annotation and data properties).

Example of Property Template Strings

TYPE PROPERTY_TYPE P % DOMAIN RANGE
owl:ObjectProperty subproperty Property 1 Class 1 Class 2
owl:DataProperty functional Property 2 Class 2 xsd:string

The functional data property will still default to a subproperty logical axiom for the P % template string, unless a different logical property type (equivalent, disjoint) is provided. Property type can be split, e.g. PROPERTY_TYPE SPLIT=|.

Individual Template Strings

  • INDIVIDUAL_TYPE: ROBOT creates an individual for each or of data that has a TYPE of another class. The individual type can be:
    • named: the created individual will be a default named individual. When the INDIVIDUAL_TYPE is left blank, this is the default. This should be used when adding object property or data property assertions
    • same: the created individual will be asserted to be the same individual as each templated individual in the row
    • different: the created individual will be asserted to be a different individual than any of the templated individuals in the row
  • I individual assertion:
    • I property %: when creating a named individual, replace property with an object property or data property to add assertions. The % will be replaced by the template cell value or values. For object property assertions, this is another individual. For data property assertions, this is a literal value.
    • I %: when creating a same or different individual, this template string is used to specify which individual will be the value of the same or different individual axiom.

Example of Individual Template Strings

INDIVIDUAL_TYPE I 'Property 1' some % I %
named Individual 2
different Individual 1

Datatype Template Strings

Datatypes can currently be added (TYPE = owl:Datatype), but I have not yet added support for datatype definitions. I'm still figuring out the best way to do this that wouldn't make the template strings too complex. There's nothing in the docs for this yet. Suggestions would be appreciated.

TYPE declarations

  • TYPE: this is the rdf:type for the row. Because ROBOT is focused on ontology development, the default value is owl:Class and this column is optional. When creating an OWLIndividual, specify the class to which it belongs in this column.
    • class or owl:Class
    • object property or owl:ObjectProperty
    • data property or owl:DataProperty
    • annotation property or owl:AnnotationProperty
    • datatype or owl:Datatype

@beckyjackson beckyjackson changed the title Template Rework WIP: Template Rework Dec 28, 2018
Copy link
Member

@jamesaoverton jamesaoverton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is much better than my original code.

Minor problems: comments inline.

Major problems:

  • japicmp says this breaks backwards compatibility
  • I think the >C and >P are just wrong. We should only care about the type of the current column (>A comment, >AI see also, >AT foo:count^^xsd:integer, etc.), and the type of the column to the left shouldn't matter. We need tests to cover this.

Please address these and then I'll take another look.

@beckyjackson
Copy link
Contributor Author

>C and >A is how we've been doing it, but I can see how that doesn't make sense. Should I switch everything to >A?

@jamesaoverton
Copy link
Member

I'd prefer to switch everything to >A*, since it's more flexible and makes more sense. I didn't catch this in previous versions -- my fault.

For backwards compatibility, I guess we could allow >C as an undocumented option, treating it as >A. Maybe we could do a string substitution early in the processing and log a warning.

@beckyjackson
Copy link
Contributor Author

beckyjackson commented Mar 15, 2019

As rows are parsed, exception messages will be logged as ERRORs, e.g.:

2019-03-15 09:59:00,005 ERROR org.obolibrary.robot.Template - MANCHESTER PARSE ERROR the expression 'is_specified_output_of some 'protein-protein interaction detection'' at row 171, column 5 cannot be parsed: encountered unknown 'protein-protein interaction detection'
2019-03-15 09:59:00,006 ERROR org.obolibrary.robot.Template - MANCHESTER PARSE ERROR the expression 'is_specified_output_of some 'extracellular electrophysiology recording'' at row 173, column 5 cannot be parsed: encountered unknown 'extracellular electrophysiology recording'

If there were any exception messages after parsing all rows, the operation will fail.

You can override this with --force true (default is false, so it will fail if there are any errors in the template). It will still attempt to create an ontology, even if there were parse errors. Those axioms will probably just not be included in the output.

The operation will fail immediately if there are any errors with the headers.

See #300

@jamesaoverton
Copy link
Member

For many users, I think that it's easier to fail on the first error, then fix and try again. That was the old behaviour. It's better to start fixing errors at the top than the bottom, but by printing all errors to the terminal you'll see the last error first.

I can see how advanced users might want all errors at once, but that requires a more complex mental model of the operation.

@beckyjackson
Copy link
Contributor Author

beckyjackson commented Mar 18, 2019

I can see how that can be confusing. I find it frustrating when I have a template and I need to keep running it after fixing one thing at a time.

Maybe the default behavior, instead of showing all errors, is to fail on the first error. Then we can add another option that lets you print all errors? I'm not sure what to call it though, some ideas (I don't particularly like any of these, though):

  • --log-all-errors true (default false)
  • --fail-first-error false(default true)
  • --fail-fast false (default true)

Or the behavior could just be if --force false then fail on the first error. If --force true, log all errors and attempt to create an ontology.

@jamesaoverton
Copy link
Member

I like your final suggestion.

@beckyjackson
Copy link
Contributor Author

beckyjackson commented May 6, 2019

@ontodev/robot-admin

While working on GAZ, I found that this new template rework handles very large template files MUCH better than the old template. It's faster, and doesn't fail if the file is too large.

I ran template on a 20.5MB template CSV with this new code, and it processed in 28 seconds. If I run the same command using master, it failed after 52 minutes with java.lang.OutOfMemoryError: Java heap space.

I think it would be good to get this merged into master soon, assuming the code has no breaking changes. The command line interface should run exactly the same, and as far as I can tell, it is entirely backwards compatible with existing templates.

I had to make a change to the labels in order to get it backwards-compatible with the OBI templates...

It makes sense that you should be able to define a new class and use it by-label in the template. But, we changed the special label column to only work with LABEL, since people may use custom label properties. Therefore, if you're not using LABEL, you can't reference new classes. I think a good workaround is to assume that rdfs:label is the label property, but allow users to specify a different label property in the command:

If you're using A rdfs:label as a template string, you don't need to do anything and you can reference new labels in the template. This --label-properety option will also set a different property in the QuotedEntityChecker so that all labels are picked up using the custom property.

Copy link
Member

@jamesaoverton jamesaoverton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked through the code and it's good. I made comments about a few minor things.

The NI option is not documented, and I'm not sure that I understand it.

The last big thing I would like is an example file / integration test that exercises a wide range of the new features. The inline tables in the Markdown file are good but the aren't integration tests. The current integration tests only show the old way of doing things. We want to keep them -- maybe they could be moved to unit tests. But we need at least one good new example that shows the new "right way" to work with ROBOT templates.

@beckyjackson
Copy link
Contributor Author

The NI option is not documented, and I'm not sure that I understand it.

This NI option isn't really necessary anymore. It's exactly the same as I so I'm going to remove it. Let me know if you think otherwise.

@jamesaoverton
Copy link
Member

jamesaoverton commented Jun 18, 2019

It's fine to remove NI.

@beckyjackson
Copy link
Contributor Author

OK - I made the requested changes and added a new integration test with the new template strings.

Copy link
Member

@jamesaoverton jamesaoverton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. A couple more small things.

ID,A rdfs:label,A IAO:0000115,>A IAO:0000119,AI rdfs:seeAlso,A IAO:0000117,TYPE,SC %,DC %,DOMAIN,RANGE,I weight in kilograms
ex:F344N,F 344/N,An inbred strain of rat used in many scientific investigations.,James A. Overton,http://www.informatics.jax.org/external/festing/rat/docs/F344.shtml,James A. Overton,class,NCBITaxon:10116,,,,
ex:B6C3F1,B6C3F1,An inbred strain of mouse used in many scientific investigations.,James A. Overton,http://jaxmice.jax.org/strain/100010.html,James A. Overton,class,NCBITaxon:10090,F 344/N,F 344/N' or B6C3F1,,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third row is a class with domain "'F 344/N' or B6C3F1". That can't be right.

@@ -0,0 +1,5 @@
ID,Label,Definition,Definition Source,See Also,Editor,RDF Type,Class Type,Parent IRI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming this file is fine, but we need the test to run somewhere. Now I think it is not.

@jamesaoverton jamesaoverton changed the title WIP: Template Rework Template Rework Jun 25, 2019
@jamesaoverton jamesaoverton merged commit a548540 into ontodev:master Jun 25, 2019
@jamesaoverton
Copy link
Member

🎉 🎂 🎈 !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants