Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add export command #481

Merged
merged 24 commits into from
Mar 6, 2020
Merged

Add export command #481

merged 24 commits into from
Mar 6, 2020

Conversation

beckyjackson
Copy link
Contributor

@beckyjackson beckyjackson commented May 20, 2019

Export

Contents

  1. Formats
  2. Columns
  3. Including and Excluding Entities
  4. Rendering Cell Values
  5. Preparing the Ontology

ROBOT can export details about ontology entities as a table. At minimum, the export command expects an input ontology (--input), a set of column headers (--header), and a file to write to (--export):

robot export --input nucleus_part_of.owl \
  --header "ID|LABEL" \
  --export nucleus.csv

Formats

The following formats are currently supported:

  • tsv
  • csv
  • html

These can be specified with the --format option:

robot export --input nucleus_part_of.owl \
  --header "LABEL|SubClass Of" \
  --format html --export results/nucleus.html

If this option is not included, export will predict the format based on the file extension. If the extension does not match with an existing format, it will default to tsv.

The html format will output an HTML table with Bootstrap styling. All entities referenced will be rendered as clickable links.

Columns

The --header option is a pipe-separated list of special keywords or properties used in the ontology. The columns in the --header argument will exactly match the first line of the export file (the column headers).

Various --header types are supported:

  • Special Headers:
    • IRI: creates an "IRI" column based on the full unique identifier
    • ID: creates an "ID" column based on the short form of the unique identifier (CURIE)
    • LABEL: creates a "Label" column based on rdfs:label
    • SYNONYMS: creates a "SYNONYMS" column based on all synonyms (oboInOwl exact, broad, narrow, related, or IAO alternative term)
    • SubClass Of: creates a "SubClass Of" column based on rdfs:subClassOf
    • SubClasses: creates a "SubClasses" column based on direct children of a class
    • Equivalent Class: creates an "Equivalent Classes" column based on owl:equivalentClass
    • SubProperty Of: creates a "SubProperty Of" column based on rdfs:subPropertyOf
    • Equivalent Property: creates an "Equivalent Properties" column based on owl:equivalentProperty
    • Disjoint With: creates a "Disjoint With" column based on owl:disjointWith
    • Type: creates an "Instance Of" column based on rdf:type for named individuals
  • Property CURIES: you can always reference a property by the short form of the unique identifier (e.g. oboInOwl:hasDbXref). Any prefix used must be defined.
  • Property Labels: as long as a property label is defined in the input ontology, you can reference a property by label (e.g. database_cross_reference). This label will also be used as the column header.

The first header in the --header list is used to sort the rows of the export. You can change the column that is sorted on by including --sort <header>. This can either be one header, or a pipe-separated list of headers that will be sorted in-order:

robot export --input nucleus_part_of.owl \
  --header "ID|LABEL|SubClass Of" \
  --sort "LABEL|SubClass Of" \
  --export results/nucleus-sorted.csv

In the example above, the rows are first sorted on the NAME field, and then sorted by SubClass Of. This means that entities with the same parent will be grouped in alphabetical order.

If the --sort header starts with ^, the column will be sorted in reverse order.

robot export --input nucleus_part_of.owl \
  --header "ID|LABEL|SubClass Of" \
  --sort "^LABEL" \
  --export results/nucleus-reversed.csv

All special keyword columns will include both named OWL objects (named classes, properties, and individuals) and anonymous expressions (class expressions, property expressions). When using another object or data property, the values will include both individuals and class expressions (from subclass or equivalent statements) in Manchester syntax. When using an annotation property, the literal value will be returned.

By default, multiple values in a cell are separated with a pipe character (|). You can update this to anything you'd like with the --split option. For example, you could separate with commas:

robot export --input nucleus_part_of.owl \
  --header "NAME|SubClass Of" --split ", "

The output of any cell with multiple values is sorted in alphabetical order.

Including and Excluding Entities

By default, the export includes details on the classes and individuals in an ontology. Properties are excluded. You can configure which types of entities you wish to include with the --include <entity types> option. The <entity types> argument is a space-, comma-, or tab-separated list of one or more of the following entity types:

  • classes
  • individuals
  • properties

For example, to return the details of individuals only:

robot export --input template.owl \
  --header "ID|LABEL|Type" \
  --include "individuals" \
  --export results/individuals.csv

To return details of classes and properties:

robot export --input nucleus_part_of.owl \
  --header "ID|LABEL|SubClass Of|SubProperty Of" \
  --include "classes properties" \
  --export results/classes-properties.csv

The --include option does not need to be specified if you are getting details on individuals and classes. If you do specify an --include, it cannot be an empty string, as no entities will be included in the export.

Finally, the export will include anonymous expressions (subclasses, equivalent classes, property expressions). If you only wish to include named entities, add --exclude-anonymous true:

robot export --input nucleus_part_of.owl \
  --header "LABEL|SubClass Of|part of" \
  --exclude-anonymous true \
  --export results/nucleus.csv

Note that in the example above, the first two headers are special keywords and the third is the label of a property used in the ontology.

Rendering Cell Values

Entities used in cell values are rendered by one of four different strategies:

  • NAME - render the entity by label (if label does not exist, entity is rendered by CURIE)
  • ID - render the entity by short form ID/CURIE
  • IRI - render the entity by full IRI
  • LABEL - render the entity by label ONLY (if label does not exist, entity is rendered as an empty string)

By default, values are rendered with the NAME strategy. To update the strategy globally, you can use the --entity-format option and provide one of the above values:

robot export --input nucleus_part_of.owl \
  --header "ID|SubClass Of" \
  --entity-format ID \
  --exclude-anonymous true \
  --export results/nucleus-ids.csv

In the above example, all the "subclass of" values will be rendered by their short form ID.

You can also specify different rendering strategies for different columns by including the strategy name in a square-bracket-enclosed tag after the column name:

robot export --input nucleus_part_of.owl \
  --header "LABEL|SubClass Of [ID]|SubClass Of [IRI]" \
  --exclude-anonymous true \
  --export results/nucleus-iris.csv

These tags should not be used with the following default columns: LABEL, ID, or IRI as they will not change the rendered values.

Preparing the Ontology

When exporting details on classes using object or data properties, we recommend running reason, relax, and reduce first. You can also create a subset of entities using remove or filter.

@jamesaoverton
Copy link
Member

See #459

Copy link
Member

@jamesaoverton jamesaoverton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, but I'd like some significant changes. Then I'll review it again.

docs/export.md Outdated Show resolved Hide resolved
docs/export.md Show resolved Hide resolved
docs/export.md Outdated Show resolved Hide resolved
@jamesaoverton
Copy link
Member

I think this export command is almost ready, but there are some things I would like feedback on:

  1. (minor) For reverse sort we're currently prefixing the column name with a *, but would ^ or something else be better? e.g. --sort "^LABEL
  2. (major) Currently when the output is a class expression we're getting Manchester with labels but not quotes, e.g. has part some nucleus. I think this should be 'has part' some nucleus, which would match Protege and ROBOT template.
  3. (major) Should we plan to use export for formats other than tables, such as JSON? If so, is there anything we're doing now that we'll regret?

@jamesaoverton
Copy link
Member

  1. If you ask export for part of, the result will be X but will not distinguish between part of some X and part of only X. I don't really like that, but that seems to be what was requested.

@jamesaoverton
Copy link
Member

  1. The current code does not handle cardinality restrictions.

@beckyjackson
Copy link
Contributor Author

Currently when the output is a class expression we're getting Manchester with labels but not quotes

Fixed, for example:

LABEL, SubClass Of
a, 'has part' some b

If you ask export for part of, the result will be X but will not distinguish between part of some X and part of only X

Updated to distinguish, for example:

LABEL, has part
a, some (b or c)

The current code does not handle cardinality restrictions.

Cardinality restrictions added, for example:

LABEL, has part
a, min 1 b

@cmungall
Copy link
Contributor

minor: spaces in OWL constructs look odd to me. I would s/SubClass Of/SubClassOf/ and keep it identical with OWLAPI. I think we do this elsewhere too?

@cmungall
Copy link
Contributor

For OPs, the default should always be some, and only showing named classes in the column

E.g.

`X SubClassOf 'part of' some Y'

CURIE,part of
X,Y

I can't think of any use case for showing only, or showing class expressions as values. Anyone who wants to work with OWL with work with OWL.

One exception may be for reversibility with templates. Maybe there could be a global option for this.

@cmungall
Copy link
Contributor

A common idiom in TSVs is to stripe IDs and Labels. E.g.

Class: X_1
    Annotations: label "foo"
   SubClassOf 'part of' some X_2
Class: X_2
    Annotations: label "bar"

would be good to see

ID,label,part of,part of label
X:1,foo.X:2.bar
X:2.bar,,

Specifying this on a per OP basis could be tedious for the user. So I think we should have options that apply to any field that denotes an OWL object

how about:

 --add-labels <bool> if true, add label column for every OWL object. Append ' label' to column name.
--use-curies <bool> if true, any emitted OWL object is serialized as a CURIE
--use-iris <bool> if true, any emitted OWL object is serialized as a IRI. If both curies and iris are true, then emit both

@cmungall
Copy link
Contributor

Property CURIES: you can always reference a property by the short form of the unique identifier (e.g. oboInOwl:hasDbXref)

I am responsible for this shortform thing, I wouldn't encourage it.

instead, how about allowing an OP to be specified by its rdfs:label?

@cmungall
Copy link
Contributor

apologies for the scattergun comments, and for not noticing this has been open for a while. Overall this is really awesome and I'm looking forward to having this in. It's good to think hard about and get a range of opinions about things like class expressions in exported values. The communities I work with would want a big dumb denormalized table with no parsing required. Something that can be loaded directly into pandas or an r dataframe

@matentzn
Copy link
Contributor

Pretty cool feature! In case this was not tested: "," or tabs in all fields should be escaped properly. I assume multiple labels are piped as well. I am not that interested in the class expression export being parseable, since I hope this feature is for documentation purposes only (and not for reverse injecting this with template back into ontology; very unsafe IMHO).

Will be incorporating this ODK as soon as it is out!

@jamesaoverton
Copy link
Member

The "striping" use case is a good addition. I think it's good for the --header in the request to be exactly the same in the resulting table. So for striping IDs and labels I would prefer something like ID,label,part of [ID],part of where you can specify IRI or CURIE/ID or LABEL in square brackets, and the default is LABEL. Even if it's more verbose, I prefer that to adding the options @cmungall suggested.

@cmungall Are you sure that you don't want the output to include expressions? What if including expressions was an option, turned off by default?

I was thinking about matching export to template but I've changed my mind. It's feature creep. We should have a separate "reverse template" and not try to overload this.

Yes, we should have better tests for escaping delimiters, and especially for escaping quotes. There are so many dumb edges cases that using a proper CSV library is probably worth it.

I'd still like to know if we plan to add JSON output to this command in the future, in which case we might want to change things now. Like maybe change --header to --fields?

@jamesaoverton jamesaoverton changed the title Add export command WIP: Add export command Jun 25, 2019
@cmungall
Copy link
Contributor

cmungall commented Dec 22, 2019

Sorry for the delay in responding

@cmungall Are you sure that you don't want the output to include expressions? What if including expressions was an option, turned off by default?

Off by default is fine, but I am not really sure I see the use case for including these.

I have checked out the branch and running:

robot export --input nucleus.owl   --header "CURIE,LABEL,SubClass Of,part of"   --include "classes"  --exclude-anonymous true  --export /tmp/classes-properties.csv

which gives me:

CURIE,LABEL,SubClass Of,part of
...
GO:0031981,nuclear lumen,intracellular organelle lumen|nuclear part,some nucleus
...

So some nucleus isn't even a valid class expression, it's a portion of one. I can see the logic that joining the column name with column value makes the class expression, and that in some scenarios you might want to see universal restrictions, cardinality restrictions, etc. But anyone advanced enough to want these will be comfortable with working with the OWL.

But if it doesn't add complexity and you think there is a use case, I don't object to a non-default option to emit a partial class expression, so long as the default is to emit the Y in R some Y.

My liist of proposed changes prior to merge:

  1. emit named classes by default for existential restrictions
  2. emit CURIEs, not just labels: Add export command #481 (comment)
  3. Allow default OBO CURIEs to be used in header in addition to labels or fragments. Currently this works as expected --header "CURIE,LABEL,SubClass Of,BFO_0000050,IAO_0000115" but not --header "CURIE,LABEL,SubClass Of,BFO:0000050,IAO:0000115"

@cmungall
Copy link
Contributor

cmungall commented Dec 22, 2019

A minor annoyance:

robot export --input uberon_annotated.owl   --header "CURIE,LABEL,SubClass Of,BFO_0000050,IAO_0000115,hasDbXref"   --include "classes"  --exclude-anonymous true  --export /tmp/classes-properties.csv
2019-12-21 17:20:36,342 ERROR org.obolibrary.robot.ExportOperation - Missing literal for 'UBERON_0002530' hasDbXref
2019-12-21 17:20:36,343 ERROR org.obolibrary.robot.ExportOperation - Missing literal for 'UBERON_0002530' hasDbXref
2019-12-21 17:20:36,343 ERROR org.obolibrary.robot.ExportOperation - Missing literal for 'UBERON_0002530' hasDbXref
2019-12-21 17:20:36,343 ERROR org.obolibrary.robot.ExportOperation - Missing literal for 'UBERON_0002530' hasDbXref
....

The ERROR appears to be a false positive, because the file looks fine:

CURIE,LABEL,SubClass Of,BFO_0000050,IAO_0000115,hasDbXref
UBERON:0000062,organ,,some 'anatomical system',Anatomical structure that performs a specific function or group of functions [WP].,MA:0003001|OpenCyc:Mx4rv5XMb5wpEbGdrcN5Y29ycA|OpenCyc:Mx4rwP3iWpwpEbGdrcN5Y29ycA|EFO:0000634|EMAPA:35949|ENVO:01000162|WBbt:0003760|FMA:67498|UMLS:C0178784
...
UBERON:0002530,gland,organ,,an organ that functions as a secretory or excretory organ,FBbt:00100317|UMLS:C1285092|EHDAA:4475|BTO:0000522|EHDAA:6522|HAO:0000375|OpenCyc:Mx4rwP3vyJwpEbGdrcN5Y29ycA|WikipediaCategory:Glands|galen:Gland|EHDAA:2161|MAT:0000021|MA:0003038|FMA:86294|MIAA:0000021|AAO:0000212|EFO:0000797|AEO:0000096|EMAPA:18425|EHDAA2:0003096

@cmungall
Copy link
Contributor

cmungall commented Dec 22, 2019

Gosh I'm just full of complaints amn't I... running this on hp.owl at the moment, seems very slow. Just writing this as note to self to do some profiling/optimization.

$ time robot export --input ~/repos/human-phenotype-ontology/hp.obo --header "CURIE,LABEL,SubClass Of,IAO_0000115,hasDbXref,comment"   --include "classes"  --exclude-anonymous true  --export /tmp/hp.csv

real    35m30.746s
user    31m41.848s
sys     0m20.502s

@beckyjackson
Copy link
Contributor Author

emit named classes by default for existential restrictions

Would you prefer this looks like:

part of
nucleus

Or...

'part of' some
nucleus

emit CURIEs, not just labels

OK, I see your options in that comment. Is the default behavior to add these "label" columns?

Allow default OBO CURIEs to be used in header in addition to labels or fragments

Agreed that this should be allowed.

The ERROR appears to be a false positive

I'll take a look and see what's going on here. It's been awhile since I've looked at this code 😅

@cmungall
Copy link
Contributor

I strongly recommend the former, i.e

part of
nucleus

OK, I see your options in that comment. Is the default behavior to add these "label" columns?

Yes, I think this makes sense as a default

@jamesaoverton
Copy link
Member

@beckyjackson Please remove all those target="__blank" from the HTML renderer and the HTML test file.

@cmungall
Copy link
Contributor

cmungall commented Mar 2, 2020

--help gives:

 -f,--format <arg>              output file format (TSV, CSV)

If I try -f TSV

UNKNOWN FORMAT ERROR 'TSV' is an unknown export format

However, tsv is allowed

Either make it case insensitive or change the help message

@cmungall
Copy link
Contributor

cmungall commented Mar 2, 2020

Test:

robot export -f tsv --input go.owl --header "CURIE|LABEL|IAO_0000115|part of" --include "classes" --exclude-anonymous true --export /tmp/go.tsv

gives:

"CURIE" "LABEL" "IAO_0000115"   "part of"
"GO:0000001"    "mitochondrion inheritance"     """The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.""^^xsd:string"       ""

there should be no quoting in TSV

(the definition field is triple-double quoted!!!)

Also string literals should just emit the literal, not the xsd type or language. Again like with class expressions the general principle for tabular outputs is that values should be as atomic as possible.

@cmungall
Copy link
Contributor

cmungall commented Mar 2, 2020

Also, still not emitting IDs:

"CURIE" "LABEL" "IAO_0000115"   "SubClass Of"   "part of"
"GO:0000001"    "mitochondrion inheritance"     """The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.""^^xsd:string"       "mitochondrion distribution|organelle inheritance"      ""

@jamesaoverton
Copy link
Member

jamesaoverton commented Mar 2, 2020

Thanks @cmungall. @beckyjackson will work on these today. Keep them coming 😄

What IDs are you expecting in your previous comment?

@cmungall
Copy link
Contributor

cmungall commented Mar 2, 2020

It also seems to be 'inferring' labels for unlabeled classes

E.g for merged classes:

   <owl:Class rdf:about="http://purl.obolibrary.org/obo/GO_0000004">
        <obo:IAO_0000231 rdf:resource="http://purl.obolibrary.org/obo/IAO_0000227"/>
        <obo:IAO_0100001 rdf:resource="http://purl.obolibrary.org/obo/GO_0008150"/>
        <owl:deprecated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</owl:deprecated>
    </owl:Class>

yields:

"CURIE" "LABEL" "IAO_0000115"   "SubClass Of"   "part of"
...
"GO:0000004"    "obo:GO_0000004"        ""      ""      ""

I'm guessing the URI is used as the label if not present (and a slightly different CURIE contracting algorithm...)

Here the field value should be blank/empty

@cmungall
Copy link
Contributor

cmungall commented Mar 2, 2020

Request/proposal: use ID rather than CURIE

@beckyjackson
Copy link
Contributor Author

beckyjackson commented Mar 2, 2020

@cmungall - if you're trying to get the IDs of terms, the [ID] tag needs to be included in the header, e.g. prop [ID]|prop will emit ID, label.

And I agree that we should use ID instead of CURIE, since that's what the tag is.

@jamesaoverton
Copy link
Member

@cmungall We think we've addressed all your comments from this morning. Please try again.

@cmungall
Copy link
Contributor

cmungall commented Mar 2, 2020

extract.md says:

Finally, the export will include anonymous expressions (subclasses, equivalent classes, property expressions). If you only wish to include named entities, add --exclude-anonymous true:

In fact, exclude is true by default (which is my preferred default), so the docs should be changed

UPDATE I see in fact it is false by default. This is not my preference but I can live with this.

@cmungall
Copy link
Contributor

cmungall commented Mar 2, 2020

if you're trying to get the IDs of terms, the [ID] tag needs to be included in the header, e.g. prop [ID]|prop will emit ID, label.

I see. What do we think of making the ID or IRI the default (e.g. SubClassOf) and making asking for the label the non-default (e.g. SubClassOf [LABEL])?

@cmungall
Copy link
Contributor

cmungall commented Mar 2, 2020

Can we also remove tautologies by default? I don't think it's useful to see that root nodes and obsolete classes are subClasses of owl:Thing.

(you could argue that it is not completely content free - e.g. someone may have manually classified an incoherent class under Nothing, or we could be running export post-reason, but even here, the inclusion of the assertion is so arbitrary depending on a sequence of owlapi operations, it renders it useless for any purpose)

@cmungall
Copy link
Contributor

cmungall commented Mar 2, 2020

Can I just say again this command is AWESOME

OK, I think we just have to make a decision on the following two things:

  • owl:Thing
  • defaults for IDs vs Labels

I can live with whichever decision is made either way but it's worth making a considered decision here

A few other minor things that can be punted to a future release so long as they are not considered compatibility breaking, just adding so they do not get forgotten:

  • uses of | delimiter in string/lang literals should be escaped. I have never seen this in the wild, but good to be safe.
  • consider use of a super property to group all 4 obo synonym scopes together

Docs (I can make these changes later):

  • docs could have more guidance on getting desired annotation assertions out (I can write this)
  • change Eqivalent to Equivalent
  • it looks as if multi-valued fields that are pipe-separated are first sorted prior to concatenization. This is good (non-spurious diffs). Worth explicitly documenting this

@jamesaoverton
Copy link
Member

jamesaoverton commented Mar 3, 2020

Thanks for the detailed feedback @cmungall!

  1. tautologies: If you're just asking to ignore subclass of owl:Thing, then I guess I'm OK with that. I don't want to build anything fancier into export, since we already have remove --axioms tautologies and remove --axioms structural-tautologies. Most exports will require some sort of pre-processing anyway.

  2. labels: My strong preference is to use labels when we have them, otherwise an ID, failing that an IRI. Maybe we can call that strategy NAME? (I couldn't think of anything better.) So NAME would be the default strategy, and we can add an option to set that default --entity-format [NAME/LABEL/ID/IRI], and you can override the default in your column header e.g. SubClass Of [ID].

  3. synonyms: I like the idea of a special case SYNONYMS (or OBO SYNONYMS?) to handle all the OBO synonym types: IAO alternative term, OIO exact/narrow/broad synonym. Is that what you mean?

To Do before release:

  • ignore subclass of owl:Thing
  • add --entity-format option (case insensitive), rename LABEL-ONLY to LABEL in the code
  • switch sorting from * to ^
  • escape | in values
  • fix typo "Eqivalent"
  • make CURIE a synonym of ID
  • add synonyms without spaces (case insensitive) for these: SubClassOf, EquivalentClass, SubPropertyOf, EquivalentProperty, DisjointWith
  • add special case for synonyms
  • document that multi-value cells are sorted
  • drop resource= attribute from HTML (that's for RDFa, later)

@beckyjackson
Copy link
Contributor Author

I like the idea of renaming LABEL to NAME.

The only use case I see for LABEL (currently LABEL-ONLY) is for the label column itself, though. If somebody is using the tag LABEL in another column, like the subclass of column, and the parent doesn't have a label, then nothing would show up. Would you still want to proceed with this behavior?

@jamesaoverton
Copy link
Member

Good question about missing LABELs. I think I'm ok with unlabelled things disappearing, as long as the documentation is clear. But I'm not completely sure... If we ask for LABEL and an unlabelled term is part of a class expression like "foo subclass of ID:123", then what would we see?

@beckyjackson
Copy link
Contributor Author

We would see

foo subclass of 

@beckyjackson
Copy link
Contributor Author

All checklist items above have been addressed. I updated the documentation in the first comment to reflect all new behavior.

@jamesaoverton
Copy link
Member

My plan is to merge this now, then ask obo-tools for more feedback before releasing 1.7.0.

@cmungall Does that sound good?

@jamesaoverton jamesaoverton changed the title WIP: Add export command Add export command Mar 4, 2020
@beckyjackson beckyjackson mentioned this pull request Mar 4, 2020
5 tasks
@jamesaoverton jamesaoverton merged commit e687020 into ontodev:master Mar 6, 2020
@cmungall
Copy link
Contributor

cmungall commented Mar 6, 2020

I am testing master now, so far it all looks fantastic, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants