Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass as expand operation #100

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 22 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,9 @@ When using `gizmos` as a Python module, all operations accept a `sqlalchemy` Con

## Modules

### `gizmos.check`
### `check`

The `check` module validates a SQLite database for use with other `gizmos` modules. We recommend running your database through `gizmos.check` before using the other commands.
The `check` module validates a SQLite database for use with other `gizmos` modules. We recommend running your database through `check` before using the other commands.
```
python3 -m gizmos.check [path-to-database]
```
Expand All @@ -75,7 +75,20 @@ This command will check that both the `prefix` and `statements` tables exist wit

All errors are logged, and if errors are found, the command will exit with status code `1`. Only the first 10 messages about specific rows in the `statements` table are logged to save time. If you wish to override this, use the `--limit <int>`/`-l <int>` option. To print all messages, include `--limit none`.

### `gizmos.export`
### `expand`

The `expand` module takes an [import table](#creating-import-modules) and creates an explicit import table that contains all terms that will be in the extracted module and the reason that they are included. The reason will be one of: ancestor, child, descendant, or parent of an included term.
```
python3 -m gizmos.expand -d [path-to-database] -i [import-table] > [output-tsv]
```

The default output format is TSV, but if you want to write a CSV, you can include `-f csv`/`--format csv`.

By default for each reason, up to three terms are included (e.g. if a term is a descendant of included terms X, Y, Z). When the term is related to more than three terms, the reason will be shown as "descendant of N terms", where N is the number of terms. You can change this limit with `-l`/`--limit`. The limit must be a whole integer.

`expand` also includes the `-I`/`--intermediates` option, like `extract`. You should include this option if you plan to include it when you create your extracted module. For more details on this option, see [`extract`](#extract).

### `export`

The `export` module creates a table (default TSV) output containing the terms and their predicates written to stdout.
```
Expand Down Expand Up @@ -120,7 +133,7 @@ If an ontology term has more than one value for a given predicate, it will be re

If you have many predicates to include, you can use `-P <file>`/`--predicates <file>` for a list of predicates (CURIE or label), each on one line.

### `gizmos.extract`
### `extract`

The `extract` module creates a TTL or JSON-LD file containing the term, predicates, and ancestors written to stdout.
```
Expand All @@ -141,7 +154,7 @@ Finally, if you want to annotate all extracted terms with a source ontology IRI,

#### Creating Import Modules

`gizmos.extract` can also be used with import configuration files (`-i <file>`/`--imports <file>`):
`extract` can also be used with import configuration files (`-i <file>`/`--imports <file>`):

```
python3 -m gizmos.extract -d [path-to-database] -i [path-to-imports] > [output-ttl]
Expand Down Expand Up @@ -181,7 +194,7 @@ This is a TSV or CSV with the following headers:

The config file can be useful for handling multiple imports with different options in a `Makefile`. If your imports all use the same `--intermediates` option and the same predicates, there is no need to specify a config file.

### `gizmos.search`
### `search`

The `search` module returns a list of JSON objects for use with the tree browser search bar.

Expand Down Expand Up @@ -222,7 +235,7 @@ Search is run over all three properties, so even if a term's label does not matc

Finally, the search only returns the first 30 results by default. If you wish to return less or more, you can specify this with `--limit <int>`/`-l <int>`.

### `gizmos.tree`
### `tree`

The `tree` module produces a CGI tree browser for a given term contained in a SQL database.

Expand All @@ -248,7 +261,7 @@ The `term` should be a CURIE with a prefix already defined in the `prefix` table

This can be useful when writing scripts that return trees from different databases.

If you provide the `-s`/`--include-search` flag, a search bar will be included in the page. This search bar uses [typeahead.js](https://twitter.github.io/typeahead.js/) and expects the output of [`gizmos.search`](#gizmos.search). The URL for the fetching the data for [Bloodhound](https://github.com/twitter/typeahead.js/blob/master/doc/bloodhound.md) is `?text=[search-text]&format=json`, or `?db=[db]&text=[search-text]&format=json` if the `-d` flag is also provided. The `format=json` is provided as a flag for use in scripts. See the CGI Example below for details on implementation.
If you provide the `-s`/`--include-search` flag, a search bar will be included in the page. This search bar uses [typeahead.js](https://twitter.github.io/typeahead.js/) and expects the output of [`search`](#gizmos.search). The URL for the fetching the data for [Bloodhound](https://github.com/twitter/typeahead.js/blob/master/doc/bloodhound.md) is `?text=[search-text]&format=json`, or `?db=[db]&text=[search-text]&format=json` if the `-d` flag is also provided. The `format=json` is provided as a flag for use in scripts. See the CGI Example below for details on implementation.

The title displayed in the HTML output is the database file name. If you'd like to override this, you can use the `-t <title>`/`--title <title>` option. This is full HTML page. If you just want the content without `<html>` and `<body>` tags, include `-c`/`--content-only`.

Expand All @@ -270,7 +283,7 @@ The formatting string must contain `{curie}`, and optionally contain `{db}`. Any

#### Predicates

When displaying a term, `gizmos.tree` will display all predicate-value pairs listed in alphabetical order by predicate label on the right-hand side of the window. You can define which predicates to include with the `-p`/`--predicate` and `-P`/`--predicates` options.
When displaying a term, `tree` will display all predicate-value pairs listed in alphabetical order by predicate label on the right-hand side of the window. You can define which predicates to include with the `-p`/`--predicate` and `-P`/`--predicates` options.

You can pass one or more predicate CURIEs in the command line using `-p`/`--predicate`. These will appear in the order that you pass:
```
Expand Down
177 changes: 177 additions & 0 deletions gizmos/expand.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
import csv
import logging
import sys

from argparse import ArgumentParser
from collections import defaultdict, OrderedDict
from io import StringIO
from typing import List

from sqlalchemy.engine.base import Connection
from sqlalchemy.sql.expression import text as sql_text
from .helpers import get_ancestors, get_children, get_connection, get_descendants, get_parents


def main():
parser = ArgumentParser()
parser.add_argument(
"-d", "--database", required=True, help="Database file (.db) or configuration (.ini)"
)
parser.add_argument(
"-i", "--imports", required=True, help="TSV or CSV file containing import module details"
)
parser.add_argument(
"-I",
"--intermediates",
help="Included ancestor/descendant intermediates (default: all)",
default="all",
)
parser.add_argument(
"-L",
"--limit",
help="Max number of terms to display as reason (default: 3)",
type=int,
default=3,
)
parser.add_argument("-f", "--format", help="Output table format (default: tsv)", default="tsv")
args = parser.parse_args()
sys.stdout.write(run_expand(args))


def run_expand(args):
conn = get_connection(args.database)
sep = "\t"
if args.imports.endswith(".csv"):
sep = ","

with open(args.imports, "r") as f:
reader = csv.DictReader(f, delimiter=sep)
explicit_rows = expand(conn, list(reader), intermediates=args.intermediates, limit=args.limit)

out = StringIO()
sep = "\t"
if args.format == "csv":
sep = ","
elif args.format != "tsv":
sep = "\t"
logging.warning(f"Unknown output format ({args.format}) - output will be written as TSV")
headers = list(explicit_rows[0].keys())
if "Related" in headers:
headers.remove("Related")
writer = csv.DictWriter(out, delimiter=sep, fieldnames=headers, extrasaction="ignore")
writer.writeheader()
writer.writerows(explicit_rows)
return out.getvalue()


def expand(conn: Connection, rows: List[dict], intermediates="all", limit=3) -> List[dict]:
# dict of term ID -> row from import table
terms = {}
# track row number
i = 1
# get headers for writing table
headers = []
# create terms dict & get headers
for row in rows:
i += 1
row["Reason"] = "defined in input"
terms[row["ID"]] = dict(row)
if not headers:
headers = list(row.keys())
headers.remove("Related")

# create dict of explict terms and the reason(s) they are included
explicit_terms = {}
for term_id, row in terms.items():
logging.error(term_id)
related = row.get("Related")
if not related:
continue
label = row.get("Label")
for rel in related.split(","):
rel = rel.strip()
if rel == "ancestors":
ancestors = get_ancestors(conn, term_id, set(terms.keys()), intermediates)
logging.error(ancestors)
if term_id in ancestors:
# remove self relation
ancestors.remove(term_id)
for a in ancestors:
explicit_terms = update_explicit_terms(label or term_id, a, explicit_terms, "ancestor_of")
elif rel == "children":
children = get_children(conn, term_id)
for c in children:
explicit_terms = update_explicit_terms(label or term_id, c, explicit_terms, "child_of")
elif rel == "descendants":
descendants = get_descendants(conn, term_id, intermediates)
if term_id in descendants:
# remove self relation
descendants.remove(term_id)
for d in descendants:
explicit_terms = update_explicit_terms(label or term_id, d, explicit_terms, "descendant_of")
elif rel == "parents":
parents = get_parents(conn, term_id)
for p in parents:
explicit_terms = update_explicit_terms(label or term_id, p, explicit_terms, "parent_of")
else:
raise Exception(f"Unknown relation for {term_id} on row {i}: {rel}")

# add explicit terms to all terms dict
for term_id, reasons in explicit_terms.items():
row = {"ID": term_id}
if term_id in terms:
continue
if "Label" in headers:
query = sql_text(
"SELECT value FROM statements WHERE subject = :term_id AND predicate = 'rdfs:label'"
)
res = conn.execute(query, term_id=term_id).fetchone()
if res:
row["Label"] = res["value"]
reason_str = []
if "ancestor_of" in reasons:
reason_str.append(create_reason_str(reasons, "ancestor", limit=limit))
if "child_of" in reasons:
reason_str.append(create_reason_str(reasons, "child", limit=limit))
if "descendant_of" in reasons:
reason_str.append(create_reason_str(reasons, "descendant", limit=limit))
if "parent_of" in reasons:
reason_str.append(create_reason_str(reasons, "parent", limit=limit))
row["Reason"] = " & ".join(reason_str)
terms[term_id] = row

return list(OrderedDict(sorted(terms.items())).values())


def create_reason_str(reasons: dict, relation: str, limit: int = 3):
"""Return a string defining the reason a term is included in the explict output."""
ancestor_of = reasons[f"{relation}_of"]
if len(ancestor_of) > limit:
return f"{relation} of {len(ancestor_of)} terms"
return f"{relation} of " + ", ".join(ancestor_of)


def update_explicit_terms(
term_id_or_label: str, related_term_id: str, explicit_terms: dict, key: str
):
"""Update the explict terms dictionary with the relation between the related term and the given term."""
# check if this term already exists, and get existing relations if so
term_dict = defaultdict(set)
if related_term_id in explicit_terms:
term_dict = explicit_terms[related_term_id]
if key not in term_dict:
term_dict[key] = set()

# add the term either by label or ID
if " " in term_id_or_label:
term_dict[key].add(f"'{term_id_or_label}'")
else:
term_dict[key].add(term_id_or_label)

# update the master dict
explicit_terms[related_term_id] = term_dict
return explicit_terms


if __name__ == "__main__":
main()
Loading