Add export command #481

beckyjackson · 2019-05-20T15:50:35Z

Export

Special Headers:
- IRI: creates an "IRI" column based on the full unique identifier
- ID: creates an "ID" column based on the short form of the unique identifier (CURIE)
- LABEL: creates a "Label" column based on rdfs:label
- SYNONYMS: creates a "SYNONYMS" column based on all synonyms (oboInOwl exact, broad, narrow, related, or IAO alternative term)
- SubClass Of: creates a "SubClass Of" column based on rdfs:subClassOf
- SubClasses: creates a "SubClasses" column based on direct children of a class
- Equivalent Class: creates an "Equivalent Classes" column based on owl:equivalentClass
- SubProperty Of: creates a "SubProperty Of" column based on rdfs:subPropertyOf
- Equivalent Property: creates an "Equivalent Properties" column based on owl:equivalentProperty
- Disjoint With: creates a "Disjoint With" column based on owl:disjointWith
- Type: creates an "Instance Of" column based on rdf:type for named individuals
Property CURIES: you can always reference a property by the short form of the unique identifier (e.g. oboInOwl:hasDbXref). Any prefix used must be defined.
Property Labels: as long as a property label is defined in the input ontology, you can reference a property by label (e.g. database_cross_reference). This label will also be used as the column header.

The first header in the --header list is used to sort the rows of the export. You can change the column that is sorted on by including --sort <header>. This can either be one header, or a pipe-separated list of headers that will be sorted in-order:

robot export --input nucleus_part_of.owl \
  --header "ID|LABEL|SubClass Of" \
  --sort "LABEL|SubClass Of" \
  --export results/nucleus-sorted.csv

In the example above, the rows are first sorted on the NAME field, and then sorted by SubClass Of. This means that entities with the same parent will be grouped in alphabetical order.

If the --sort header starts with ^, the column will be sorted in reverse order.

robot export --input nucleus_part_of.owl \
  --header "ID|LABEL|SubClass Of" \
  --sort "^LABEL" \
  --export results/nucleus-reversed.csv

All special keyword columns will include both named OWL objects (named classes, properties, and individuals) and anonymous expressions (class expressions, property expressions). When using another object or data property, the values will include both individuals and class expressions (from subclass or equivalent statements) in Manchester syntax. When using an annotation property, the literal value will be returned.

By default, multiple values in a cell are separated with a pipe character (|). You can update this to anything you'd like with the --split option. For example, you could separate with commas:

robot export --input nucleus_part_of.owl \
  --header "NAME|SubClass Of" --split ", "

The output of any cell with multiple values is sorted in alphabetical order.

Including and Excluding Entities

By default, the export includes details on the classes and individuals in an ontology. Properties are excluded. You can configure which types of entities you wish to include with the --include <entity types> option. The <entity types> argument is a space-, comma-, or tab-separated list of one or more of the following entity types:

classes
individuals
properties

For example, to return the details of individuals only:

robot export --input template.owl \
  --header "ID|LABEL|Type" \
  --include "individuals" \
  --export results/individuals.csv

To return details of classes and properties:

robot export --input nucleus_part_of.owl \
  --header "ID|LABEL|SubClass Of|SubProperty Of" \
  --include "classes properties" \
  --export results/classes-properties.csv

The --include option does not need to be specified if you are getting details on individuals and classes. If you do specify an --include, it cannot be an empty string, as no entities will be included in the export.

Finally, the export will include anonymous expressions (subclasses, equivalent classes, property expressions). If you only wish to include named entities, add --exclude-anonymous true:

robot export --input nucleus_part_of.owl \
  --header "LABEL|SubClass Of|part of" \
  --exclude-anonymous true \
  --export results/nucleus.csv

Note that in the example above, the first two headers are special keywords and the third is the label of a property used in the ontology.

Rendering Cell Values

Entities used in cell values are rendered by one of four different strategies:

NAME - render the entity by label (if label does not exist, entity is rendered by CURIE)
ID - render the entity by short form ID/CURIE
IRI - render the entity by full IRI
LABEL - render the entity by label ONLY (if label does not exist, entity is rendered as an empty string)

By default, values are rendered with the NAME strategy. To update the strategy globally, you can use the --entity-format option and provide one of the above values:

robot export --input nucleus_part_of.owl \
  --header "ID|SubClass Of" \
  --entity-format ID \
  --exclude-anonymous true \
  --export results/nucleus-ids.csv

In the above example, all the "subclass of" values will be rendered by their short form ID.

You can also specify different rendering strategies for different columns by including the strategy name in a square-bracket-enclosed tag after the column name:

robot export --input nucleus_part_of.owl \
  --header "LABEL|SubClass Of [ID]|SubClass Of [IRI]" \
  --exclude-anonymous true \
  --export results/nucleus-iris.csv

These tags should not be used with the following default columns: LABEL, ID, or IRI as they will not change the rendered values.

Preparing the Ontology

When exporting details on classes using object or data properties, we recommend running reason, relax, and reduce first. You can also create a subset of entities using remove or filter.

jamesaoverton · 2019-06-05T18:16:53Z

See #459

jamesaoverton

This is a good start, but I'd like some significant changes. Then I'll review it again.

robot-core/src/main/java/org/obolibrary/robot/ExportOperation.java

docs/export.md

robot-core/src/main/java/org/obolibrary/robot/ExportOperation.java

jamesaoverton · 2019-06-12T14:46:23Z

I think this export command is almost ready, but there are some things I would like feedback on:

(minor) For reverse sort we're currently prefixing the column name with a *, but would ^ or something else be better? e.g. --sort "^LABEL
(major) Currently when the output is a class expression we're getting Manchester with labels but not quotes, e.g. has part some nucleus. I think this should be 'has part' some nucleus, which would match Protege and ROBOT template.
(major) Should we plan to use export for formats other than tables, such as JSON? If so, is there anything we're doing now that we'll regret?

jamesaoverton · 2019-06-12T14:59:36Z

If you ask export for part of, the result will be X but will not distinguish between part of some X and part of only X. I don't really like that, but that seems to be what was requested.

jamesaoverton · 2019-06-12T15:00:35Z

The current code does not handle cardinality restrictions.

beckyjackson · 2019-06-12T16:43:08Z

Currently when the output is a class expression we're getting Manchester with labels but not quotes

Fixed, for example:

LABEL, SubClass Of
a, 'has part' some b

If you ask export for part of, the result will be X but will not distinguish between part of some X and part of only X

Updated to distinguish, for example:

LABEL, has part
a, some (b or c)

The current code does not handle cardinality restrictions.

Cardinality restrictions added, for example:

LABEL, has part
a, min 1 b

cmungall · 2019-06-13T00:48:18Z

minor: spaces in OWL constructs look odd to me. I would s/SubClass Of/SubClassOf/ and keep it identical with OWLAPI. I think we do this elsewhere too?

cmungall · 2019-06-13T02:21:34Z

For OPs, the default should always be some, and only showing named classes in the column

E.g.

`X SubClassOf 'part of' some Y'

CURIE,part of
X,Y

I can't think of any use case for showing only, or showing class expressions as values. Anyone who wants to work with OWL with work with OWL.

One exception may be for reversibility with templates. Maybe there could be a global option for this.

cmungall · 2019-06-13T02:30:11Z

A common idiom in TSVs is to stripe IDs and Labels. E.g.

Class: X_1
    Annotations: label "foo"
   SubClassOf 'part of' some X_2
Class: X_2
    Annotations: label "bar"

would be good to see

ID,label,part of,part of label
X:1,foo.X:2.bar
X:2.bar,,

Specifying this on a per OP basis could be tedious for the user. So I think we should have options that apply to any field that denotes an OWL object

how about:

 --add-labels <bool> if true, add label column for every OWL object. Append ' label' to column name.
--use-curies <bool> if true, any emitted OWL object is serialized as a CURIE
--use-iris <bool> if true, any emitted OWL object is serialized as a IRI. If both curies and iris are true, then emit both

cmungall · 2019-06-13T02:32:49Z

Property CURIES: you can always reference a property by the short form of the unique identifier (e.g. oboInOwl:hasDbXref)

I am responsible for this shortform thing, I wouldn't encourage it.

instead, how about allowing an OP to be specified by its rdfs:label?

cmungall · 2019-06-13T02:35:36Z

apologies for the scattergun comments, and for not noticing this has been open for a while. Overall this is really awesome and I'm looking forward to having this in. It's good to think hard about and get a range of opinions about things like class expressions in exported values. The communities I work with would want a big dumb denormalized table with no parsing required. Something that can be loaded directly into pandas or an r dataframe

matentzn · 2019-06-13T06:56:39Z

Pretty cool feature! In case this was not tested: "," or tabs in all fields should be escaped properly. I assume multiple labels are piped as well. I am not that interested in the class expression export being parseable, since I hope this feature is for documentation purposes only (and not for reverse injecting this with template back into ontology; very unsafe IMHO).

Will be incorporating this ODK as soon as it is out!

jamesaoverton · 2019-06-13T13:49:07Z

The "striping" use case is a good addition. I think it's good for the --header in the request to be exactly the same in the resulting table. So for striping IDs and labels I would prefer something like ID,label,part of [ID],part of where you can specify IRI or CURIE/ID or LABEL in square brackets, and the default is LABEL. Even if it's more verbose, I prefer that to adding the options @cmungall suggested.

@cmungall Are you sure that you don't want the output to include expressions? What if including expressions was an option, turned off by default?

I was thinking about matching export to template but I've changed my mind. It's feature creep. We should have a separate "reverse template" and not try to overload this.

Yes, we should have better tests for escaping delimiters, and especially for escaping quotes. There are so many dumb edges cases that using a proper CSV library is probably worth it.

I'd still like to know if we plan to add JSON output to this command in the future, in which case we might want to change things now. Like maybe change --header to --fields?

cmungall · 2019-12-22T01:19:47Z

Sorry for the delay in responding

@cmungall Are you sure that you don't want the output to include expressions? What if including expressions was an option, turned off by default?

Off by default is fine, but I am not really sure I see the use case for including these.

I have checked out the branch and running:

robot export --input nucleus.owl   --header "CURIE,LABEL,SubClass Of,part of"   --include "classes"  --exclude-anonymous true  --export /tmp/classes-properties.csv

which gives me:

CURIE,LABEL,SubClass Of,part of
...
GO:0031981,nuclear lumen,intracellular organelle lumen|nuclear part,some nucleus
...

So some nucleus isn't even a valid class expression, it's a portion of one. I can see the logic that joining the column name with column value makes the class expression, and that in some scenarios you might want to see universal restrictions, cardinality restrictions, etc. But anyone advanced enough to want these will be comfortable with working with the OWL.

But if it doesn't add complexity and you think there is a use case, I don't object to a non-default option to emit a partial class expression, so long as the default is to emit the Y in R some Y.

My liist of proposed changes prior to merge:

emit named classes by default for existential restrictions
emit CURIEs, not just labels: Add export command #481 (comment)
Allow default OBO CURIEs to be used in header in addition to labels or fragments. Currently this works as expected --header "CURIE,LABEL,SubClass Of,BFO_0000050,IAO_0000115" but not --header "CURIE,LABEL,SubClass Of,BFO:0000050,IAO:0000115"

cmungall · 2019-12-22T01:22:16Z

A minor annoyance:

robot export --input uberon_annotated.owl   --header "CURIE,LABEL,SubClass Of,BFO_0000050,IAO_0000115,hasDbXref"   --include "classes"  --exclude-anonymous true  --export /tmp/classes-properties.csv
2019-12-21 17:20:36,342 ERROR org.obolibrary.robot.ExportOperation - Missing literal for 'UBERON_0002530' hasDbXref
2019-12-21 17:20:36,343 ERROR org.obolibrary.robot.ExportOperation - Missing literal for 'UBERON_0002530' hasDbXref
2019-12-21 17:20:36,343 ERROR org.obolibrary.robot.ExportOperation - Missing literal for 'UBERON_0002530' hasDbXref
2019-12-21 17:20:36,343 ERROR org.obolibrary.robot.ExportOperation - Missing literal for 'UBERON_0002530' hasDbXref
....

The ERROR appears to be a false positive, because the file looks fine:

CURIE,LABEL,SubClass Of,BFO_0000050,IAO_0000115,hasDbXref
UBERON:0000062,organ,,some 'anatomical system',Anatomical structure that performs a specific function or group of functions [WP].,MA:0003001|OpenCyc:Mx4rv5XMb5wpEbGdrcN5Y29ycA|OpenCyc:Mx4rwP3iWpwpEbGdrcN5Y29ycA|EFO:0000634|EMAPA:35949|ENVO:01000162|WBbt:0003760|FMA:67498|UMLS:C0178784
...
UBERON:0002530,gland,organ,,an organ that functions as a secretory or excretory organ,FBbt:00100317|UMLS:C1285092|EHDAA:4475|BTO:0000522|EHDAA:6522|HAO:0000375|OpenCyc:Mx4rwP3vyJwpEbGdrcN5Y29ycA|WikipediaCategory:Glands|galen:Gland|EHDAA:2161|MAT:0000021|MA:0003038|FMA:86294|MIAA:0000021|AAO:0000212|EFO:0000797|AEO:0000096|EMAPA:18425|EHDAA2:0003096

cmungall · 2019-12-22T01:48:52Z

Gosh I'm just full of complaints amn't I... running this on hp.owl at the moment, seems very slow. Just writing this as note to self to do some profiling/optimization.

$ time robot export --input ~/repos/human-phenotype-ontology/hp.obo --header "CURIE,LABEL,SubClass Of,IAO_0000115,hasDbXref,comment"   --include "classes"  --exclude-anonymous true  --export /tmp/hp.csv

real    35m30.746s
user    31m41.848s
sys     0m20.502s

beckyjackson · 2019-12-23T18:20:30Z

emit named classes by default for existential restrictions

Would you prefer this looks like:

part of
nucleus

Or...

'part of' some
nucleus

emit CURIEs, not just labels

OK, I see your options in that comment. Is the default behavior to add these "label" columns?

Allow default OBO CURIEs to be used in header in addition to labels or fragments

Agreed that this should be allowed.

The ERROR appears to be a false positive

I'll take a look and see what's going on here. It's been awhile since I've looked at this code 😅

cmungall · 2020-02-13T18:47:12Z

I strongly recommend the former, i.e

part of
nucleus

OK, I see your options in that comment. Is the default behavior to add these "label" columns?

Yes, I think this makes sense as a default

robot-core/src/main/java/org/obolibrary/robot/ExportOperation.java

jamesaoverton · 2020-02-28T18:59:39Z

@beckyjackson Please remove all those target="__blank" from the HTML renderer and the HTML test file.

cmungall · 2020-03-02T17:14:07Z

--help gives:

 -f,--format <arg>              output file format (TSV, CSV)

If I try -f TSV

UNKNOWN FORMAT ERROR 'TSV' is an unknown export format

However, tsv is allowed

Either make it case insensitive or change the help message

cmungall · 2020-03-02T17:16:19Z

Test:

robot export -f tsv --input go.owl --header "CURIE|LABEL|IAO_0000115|part of" --include "classes" --exclude-anonymous true --export /tmp/go.tsv

gives:

"CURIE" "LABEL" "IAO_0000115"   "part of"
"GO:0000001"    "mitochondrion inheritance"     """The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.""^^xsd:string"       ""

there should be no quoting in TSV

(the definition field is triple-double quoted!!!)

Also string literals should just emit the literal, not the xsd type or language. Again like with class expressions the general principle for tabular outputs is that values should be as atomic as possible.

cmungall · 2020-03-02T17:23:31Z

Also, still not emitting IDs:

"CURIE" "LABEL" "IAO_0000115"   "SubClass Of"   "part of"
"GO:0000001"    "mitochondrion inheritance"     """The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.""^^xsd:string"       "mitochondrion distribution|organelle inheritance"      ""

jamesaoverton · 2020-03-02T17:33:55Z

Thanks @cmungall. @beckyjackson will work on these today. Keep them coming 😄

What IDs are you expecting in your previous comment?

cmungall · 2020-03-02T17:38:46Z

It also seems to be 'inferring' labels for unlabeled classes

E.g for merged classes:

   <owl:Class rdf:about="http://purl.obolibrary.org/obo/GO_0000004">
        <obo:IAO_0000231 rdf:resource="http://purl.obolibrary.org/obo/IAO_0000227"/>
        <obo:IAO_0100001 rdf:resource="http://purl.obolibrary.org/obo/GO_0008150"/>
        <owl:deprecated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</owl:deprecated>
    </owl:Class>

yields:

"CURIE" "LABEL" "IAO_0000115"   "SubClass Of"   "part of"
...
"GO:0000004"    "obo:GO_0000004"        ""      ""      ""

I'm guessing the URI is used as the label if not present (and a slightly different CURIE contracting algorithm...)

Here the field value should be blank/empty

cmungall · 2020-03-02T17:39:02Z

Request/proposal: use ID rather than CURIE

beckyjackson · 2020-03-02T17:39:27Z

@cmungall - if you're trying to get the IDs of terms, the [ID] tag needs to be included in the header, e.g. prop [ID]|prop will emit ID, label.

And I agree that we should use ID instead of CURIE, since that's what the tag is.

jamesaoverton · 2020-03-02T20:31:35Z

@cmungall We think we've addressed all your comments from this morning. Please try again.

cmungall · 2020-03-02T22:00:15Z

extract.md says:

Finally, the export will include anonymous expressions (subclasses, equivalent classes, property expressions). If you only wish to include named entities, add --exclude-anonymous true:

In fact, exclude is true by default (which is my preferred default), so the docs should be changed

UPDATE I see in fact it is false by default. This is not my preference but I can live with this.

cmungall · 2020-03-02T22:07:57Z

if you're trying to get the IDs of terms, the [ID] tag needs to be included in the header, e.g. prop [ID]|prop will emit ID, label.

I see. What do we think of making the ID or IRI the default (e.g. SubClassOf) and making asking for the label the non-default (e.g. SubClassOf [LABEL])?

cmungall · 2020-03-02T22:32:16Z

Can we also remove tautologies by default? I don't think it's useful to see that root nodes and obsolete classes are subClasses of owl:Thing.

(you could argue that it is not completely content free - e.g. someone may have manually classified an incoherent class under Nothing, or we could be running export post-reason, but even here, the inclusion of the assertion is so arbitrary depending on a sequence of owlapi operations, it renders it useless for any purpose)

cmungall · 2020-03-02T22:45:29Z

Can I just say again this command is AWESOME

OK, I think we just have to make a decision on the following two things:

owl:Thing
defaults for IDs vs Labels

I can live with whichever decision is made either way but it's worth making a considered decision here

A few other minor things that can be punted to a future release so long as they are not considered compatibility breaking, just adding so they do not get forgotten:

uses of | delimiter in string/lang literals should be escaped. I have never seen this in the wild, but good to be safe.
consider use of a super property to group all 4 obo synonym scopes together

Docs (I can make these changes later):

docs could have more guidance on getting desired annotation assertions out (I can write this)
change Eqivalent to Equivalent
it looks as if multi-valued fields that are pipe-separated are first sorted prior to concatenization. This is good (non-spurious diffs). Worth explicitly documenting this

jamesaoverton · 2020-03-03T13:26:42Z

beckyjackson · 2020-03-03T15:25:27Z

I like the idea of renaming LABEL to NAME.

The only use case I see for LABEL (currently LABEL-ONLY) is for the label column itself, though. If somebody is using the tag LABEL in another column, like the subclass of column, and the parent doesn't have a label, then nothing would show up. Would you still want to proceed with this behavior?

jamesaoverton · 2020-03-03T15:38:47Z

Good question about missing LABELs. I think I'm ok with unlabelled things disappearing, as long as the documentation is clear. But I'm not completely sure... If we ask for LABEL and an unlabelled term is part of a class expression like "foo subclass of ID:123", then what would we see?

beckyjackson · 2020-03-03T15:58:41Z

We would see

foo subclass of

beckyjackson · 2020-03-03T17:01:04Z

All checklist items above have been addressed. I updated the documentation in the first comment to reflect all new behavior.

jamesaoverton · 2020-03-04T17:30:54Z

My plan is to merge this now, then ask obo-tools for more feedback before releasing 1.7.0.

@cmungall Does that sound good?

cmungall · 2020-03-06T21:13:20Z

I am testing master now, so far it all looks fantastic, thanks!

rctauber added 3 commits May 1, 2019 14:26

Create export command

e001287

Add export command and documentation

e447a66

Update export docs and minor bugs

4a21b4e

jamesaoverton previously requested changes Jun 5, 2019

View reviewed changes

rctauber added 3 commits June 6, 2019 09:48

Code review updates

60c358e

Fix documentation about labels and CURIEs

7135ada

change error name

85ff3dc

jamesaoverton mentioned this pull request Jun 12, 2019

Standard TSV exports for ontologies #459

Open

Support single quoting and parentheses, support different restrictions

e3cb2a4

Improve documentation strings

5dfe2e4

jamesaoverton changed the title ~~Add export command~~ WIP: Add export command Jun 25, 2019

cmungall mentioned this pull request Dec 22, 2019

Add robot export to standard output products INCATools/ontology-development-kit#303

Open

jamesaoverton reviewed Feb 20, 2020

View reviewed changes

robot-core/src/main/java/org/obolibrary/robot/ExportOperation.java Outdated Show resolved Hide resolved

Add HTML test

5d42a63

Remove blank target

fc623e4

Feedback updates

ec9f763

rctauber added 2 commits March 3, 2020 08:53

Major updates

d4472a7

Only create providers once

b668e5b

rctauber added 2 commits March 3, 2020 09:14

Update JavaDocs

52c61f3

Add JavaDocs

f9505b1

jamesaoverton changed the title ~~WIP: Add export command~~ Add export command Mar 4, 2020

beckyjackson mentioned this pull request Mar 4, 2020

Add JSON format to export #645

Merged

5 tasks

jamesaoverton merged commit e687020 into ontodev:master Mar 6, 2020

Add export command #481

Add export command #481

Conversation

beckyjackson commented May 20, 2019 • edited Loading

Export

Contents

Formats

Columns

Including and Excluding Entities

Rendering Cell Values

Preparing the Ontology

jamesaoverton commented Jun 5, 2019

jamesaoverton left a comment

Choose a reason for hiding this comment

jamesaoverton commented Jun 12, 2019

jamesaoverton commented Jun 12, 2019

jamesaoverton commented Jun 12, 2019

beckyjackson commented Jun 12, 2019

cmungall commented Jun 13, 2019

cmungall commented Jun 13, 2019

cmungall commented Jun 13, 2019

cmungall commented Jun 13, 2019

cmungall commented Jun 13, 2019

matentzn commented Jun 13, 2019

jamesaoverton commented Jun 13, 2019

cmungall commented Dec 22, 2019 • edited Loading

cmungall commented Dec 22, 2019 • edited Loading

cmungall commented Dec 22, 2019 • edited Loading

beckyjackson commented Dec 23, 2019

cmungall commented Feb 13, 2020

jamesaoverton commented Feb 28, 2020

cmungall commented Mar 2, 2020

cmungall commented Mar 2, 2020 • edited Loading

cmungall commented Mar 2, 2020

jamesaoverton commented Mar 2, 2020 • edited Loading

cmungall commented Mar 2, 2020

cmungall commented Mar 2, 2020

beckyjackson commented Mar 2, 2020 • edited Loading

jamesaoverton commented Mar 2, 2020

cmungall commented Mar 2, 2020 • edited Loading

cmungall commented Mar 2, 2020

cmungall commented Mar 2, 2020

cmungall commented Mar 2, 2020

jamesaoverton commented Mar 3, 2020 • edited by beckyjackson Loading

beckyjackson commented Mar 3, 2020

jamesaoverton commented Mar 3, 2020

beckyjackson commented Mar 3, 2020

beckyjackson commented Mar 3, 2020

jamesaoverton commented Mar 4, 2020

cmungall commented Mar 6, 2020

beckyjackson commented May 20, 2019 •

edited

Loading

cmungall commented Dec 22, 2019 •

edited

Loading

cmungall commented Dec 22, 2019 •

edited

Loading

cmungall commented Dec 22, 2019 •

edited

Loading

cmungall commented Mar 2, 2020 •

edited

Loading

jamesaoverton commented Mar 2, 2020 •

edited

Loading

beckyjackson commented Mar 2, 2020 •

edited

Loading

cmungall commented Mar 2, 2020 •

edited

Loading

jamesaoverton commented Mar 3, 2020 •

edited by beckyjackson

Loading