ontodev · beckyjackson · Nov 26, 2021 · Nov 26, 2021 · May 24, 2022
diff --git a/README.md b/README.md
@@ -56,9 +56,9 @@ When using `gizmos` as a Python module, all operations accept a `sqlalchemy` Con
 
 ## Modules
 
-### `gizmos.check`
+### `check`
 
-The `check` module validates a SQLite database for use with other `gizmos` modules. We recommend running your database through `gizmos.check` before using the other commands.
+The `check` module validates a SQLite database for use with other `gizmos` modules. We recommend running your database through `check` before using the other commands.
 ```
 python3 -m gizmos.check [path-to-database]
 ```
@@ -75,7 +75,20 @@ This command will check that both the `prefix` and `statements` tables exist wit
 
 All errors are logged, and if errors are found, the command will exit with status code `1`. Only the first 10 messages about specific rows in the `statements` table are logged to save time. If you wish to override this, use the `--limit <int>`/`-l <int>` option. To print all messages, include `--limit none`.
 
-### `gizmos.export`
+### `expand`
+
+The `expand` module takes an [import table](#creating-import-modules) and creates an explicit import table that contains all terms that will be in the extracted module and the reason that they are included. The reason will be one of: ancestor, child, descendant, or parent of an included term.
+```
+python3 -m gizmos.expand -d [path-to-database] -i [import-table] > [output-tsv]
+```
+
+The default output format is TSV, but if you want to write a CSV, you can include `-f csv`/`--format csv`.
+
+By default for each reason, up to three terms are included (e.g. if a term is a descendant of included terms X, Y, Z). When the term is related to more than three terms, the reason will be shown as "descendant of N terms", where N is the number of terms. You can change this limit with `-l`/`--limit`. The limit must be a whole integer.
+
+`expand` also includes the `-I`/`--intermediates` option, like `extract`. You should include this option if you plan to include it when you create your extracted module. For more details on this option, see [`extract`](#extract).
+
+### `export`
 
 The `export` module creates a table (default TSV) output containing the terms and their predicates written to stdout.
 ```
@@ -120,7 +133,7 @@ If an ontology term has more than one value for a given predicate, it will be re
 
 If you have many predicates to include, you can use `-P <file>`/`--predicates <file>` for a list of predicates (CURIE or label), each on one line.
 
-### `gizmos.extract`
+### `extract`
 
 The `extract` module creates a TTL or JSON-LD file containing the term, predicates, and ancestors written to stdout.
 ```
@@ -141,7 +154,7 @@ Finally, if you want to annotate all extracted terms with a source ontology IRI,
 
 #### Creating Import Modules
 
-`gizmos.extract` can also be used with import configuration files (`-i <file>`/`--imports <file>`):
+`extract` can also be used with import configuration files (`-i <file>`/`--imports <file>`):
 
 ```
 python3 -m gizmos.extract -d [path-to-database] -i [path-to-imports] > [output-ttl]
@@ -181,7 +194,7 @@ This is a TSV or CSV with the following headers:
 
 The config file can be useful for handling multiple imports with different options in a `Makefile`. If your imports all use the same `--intermediates` option and the same predicates, there is no need to specify a config file.
 
-### `gizmos.search`
+### `search`
 
 The `search` module returns a list of JSON objects for use with the tree browser search bar.
 
@@ -222,7 +235,7 @@ Search is run over all three properties, so even if a term's label does not matc
 
 Finally, the search only returns the first 30 results by default. If you wish to return less or more, you can specify this with `--limit <int>`/`-l <int>`.
 
-### `gizmos.tree`
+### `tree`
 
 The `tree` module produces a CGI tree browser for a given term contained in a SQL database.
 
@@ -248,7 +261,7 @@ The `term` should be a CURIE with a prefix already defined in the `prefix` table
 
 This can be useful when writing scripts that return trees from different databases.
 
-If you provide the `-s`/`--include-search` flag, a search bar will be included in the page. This search bar uses [typeahead.js](https://twitter.github.io/typeahead.js/) and expects the output of [`gizmos.search`](#gizmos.search). The URL for the fetching the data for [Bloodhound](https://github.com/twitter/typeahead.js/blob/master/doc/bloodhound.md) is `?text=[search-text]&format=json`, or `?db=[db]&text=[search-text]&format=json` if the `-d` flag is also provided. The `format=json` is provided as a flag for use in scripts. See the CGI Example below for details on implementation.
+If you provide the `-s`/`--include-search` flag, a search bar will be included in the page. This search bar uses [typeahead.js](https://twitter.github.io/typeahead.js/) and expects the output of [`search`](#gizmos.search). The URL for the fetching the data for [Bloodhound](https://github.com/twitter/typeahead.js/blob/master/doc/bloodhound.md) is `?text=[search-text]&format=json`, or `?db=[db]&text=[search-text]&format=json` if the `-d` flag is also provided. The `format=json` is provided as a flag for use in scripts. See the CGI Example below for details on implementation.
 
 The title displayed in the HTML output is the database file name. If you'd like to override this, you can use the `-t <title>`/`--title <title>` option. This is full HTML page. If you just want the content without `<html>` and `<body>` tags, include `-c`/`--content-only`.
 
@@ -270,7 +283,7 @@ The formatting string must contain `{curie}`, and optionally contain `{db}`. Any
 
 #### Predicates
 
-When displaying a term, `gizmos.tree` will display all predicate-value pairs listed in alphabetical order by predicate label on the right-hand side of the window. You can define which predicates to include with the `-p`/`--predicate` and `-P`/`--predicates` options.
+When displaying a term, `tree` will display all predicate-value pairs listed in alphabetical order by predicate label on the right-hand side of the window. You can define which predicates to include with the `-p`/`--predicate` and `-P`/`--predicates` options.
 
 You can pass one or more predicate CURIEs in the command line using `-p`/`--predicate`. These will appear in the order that you pass:
 ```

diff --git a/gizmos/expand.py b/gizmos/expand.py
@@ -0,0 +1,177 @@
+import csv
+import logging
+import sys
+
+from argparse import ArgumentParser
+from collections import defaultdict, OrderedDict
+from io import StringIO
+from typing import List
+
+from sqlalchemy.engine.base import Connection
+from sqlalchemy.sql.expression import text as sql_text
+from .helpers import get_ancestors, get_children, get_connection, get_descendants, get_parents
+
+
+def main():
+    parser = ArgumentParser()
+    parser.add_argument(
+        "-d", "--database", required=True, help="Database file (.db) or configuration (.ini)"
+    )
+    parser.add_argument(
+        "-i", "--imports", required=True, help="TSV or CSV file containing import module details"
+    )
+    parser.add_argument(
+        "-I",
+        "--intermediates",
+        help="Included ancestor/descendant intermediates (default: all)",
+        default="all",
+    )
+    parser.add_argument(
+        "-L",
+        "--limit",
+        help="Max number of terms to display as reason (default: 3)",
+        type=int,
+        default=3,
+    )
+    parser.add_argument("-f", "--format", help="Output table format (default: tsv)", default="tsv")
+    args = parser.parse_args()
+    sys.stdout.write(run_expand(args))
+
+
+def run_expand(args):
+    conn = get_connection(args.database)
+    sep = "\t"
+    if args.imports.endswith(".csv"):
+        sep = ","
+
+    with open(args.imports, "r") as f:
+        reader = csv.DictReader(f, delimiter=sep)
+        explicit_rows = expand(conn, list(reader), intermediates=args.intermediates, limit=args.limit)
+
+    out = StringIO()
+    sep = "\t"
+    if args.format == "csv":
+        sep = ","
+    elif args.format != "tsv":
+        sep = "\t"
+        logging.warning(f"Unknown output format ({args.format}) - output will be written as TSV")
+    headers = list(explicit_rows[0].keys())
+    if "Related" in headers:
+        headers.remove("Related")
+    writer = csv.DictWriter(out, delimiter=sep, fieldnames=headers, extrasaction="ignore")
+    writer.writeheader()
+    writer.writerows(explicit_rows)
+    return out.getvalue()
+
+
+def expand(conn: Connection, rows: List[dict], intermediates="all", limit=3) -> List[dict]:
+    # dict of term ID -> row from import table
+    terms = {}
+    # track row number
+    i = 1
+    # get headers for writing table
+    headers = []
+    # create terms dict & get headers
+    for row in rows:
+        i += 1
+        row["Reason"] = "defined in input"
+        terms[row["ID"]] = dict(row)
+        if not headers:
+            headers = list(row.keys())
+    headers.remove("Related")
+
+    # create dict of explict terms and the reason(s) they are included
+    explicit_terms = {}
+    for term_id, row in terms.items():
+        logging.error(term_id)
+        related = row.get("Related")
+        if not related:
+            continue
+        label = row.get("Label")
+        for rel in related.split(","):
+            rel = rel.strip()
+            if rel == "ancestors":
+                ancestors = get_ancestors(conn, term_id, set(terms.keys()), intermediates)
+                logging.error(ancestors)
+                if term_id in ancestors:
+                    # remove self relation
+                    ancestors.remove(term_id)
+                for a in ancestors:
+                    explicit_terms = update_explicit_terms(label or term_id, a, explicit_terms, "ancestor_of")
+            elif rel == "children":
+                children = get_children(conn, term_id)
+                for c in children:
+                    explicit_terms = update_explicit_terms(label or term_id, c, explicit_terms, "child_of")
+            elif rel == "descendants":
+                descendants = get_descendants(conn, term_id, intermediates)
+                if term_id in descendants:
+                    # remove self relation
+                    descendants.remove(term_id)
+                for d in descendants:
+                    explicit_terms = update_explicit_terms(label or term_id, d, explicit_terms, "descendant_of")
+            elif rel == "parents":
+                parents = get_parents(conn, term_id)
+                for p in parents:
+                    explicit_terms = update_explicit_terms(label or term_id, p, explicit_terms, "parent_of")
+            else:
+                raise Exception(f"Unknown relation for {term_id} on row {i}: {rel}")
+
+    # add explicit terms to all terms dict
+    for term_id, reasons in explicit_terms.items():
+        row = {"ID": term_id}
+        if term_id in terms:
+            continue
+        if "Label" in headers:
+            query = sql_text(
+                "SELECT value FROM statements WHERE subject = :term_id AND predicate = 'rdfs:label'"
+            )
+            res = conn.execute(query, term_id=term_id).fetchone()
+            if res:
+                row["Label"] = res["value"]
+        reason_str = []
+        if "ancestor_of" in reasons:
+            reason_str.append(create_reason_str(reasons, "ancestor", limit=limit))
+        if "child_of" in reasons:
+            reason_str.append(create_reason_str(reasons, "child", limit=limit))
+        if "descendant_of" in reasons:
+            reason_str.append(create_reason_str(reasons, "descendant", limit=limit))
+        if "parent_of" in reasons:
+            reason_str.append(create_reason_str(reasons, "parent", limit=limit))
+        row["Reason"] = " & ".join(reason_str)
+        terms[term_id] = row
+
+    return list(OrderedDict(sorted(terms.items())).values())
+
+
+def create_reason_str(reasons: dict, relation: str, limit: int = 3):
+    """Return a string defining the reason a term is included in the explict output."""
+    ancestor_of = reasons[f"{relation}_of"]
+    if len(ancestor_of) > limit:
+        return f"{relation} of {len(ancestor_of)} terms"
+    return f"{relation} of " + ", ".join(ancestor_of)
+
+
+def update_explicit_terms(
+    term_id_or_label: str, related_term_id: str, explicit_terms: dict, key: str
+):
+    """Update the explict terms dictionary with the relation between the related term and the given term."""
+    # check if this term already exists, and get existing relations if so
+    term_dict = defaultdict(set)
+    if related_term_id in explicit_terms:
+        term_dict = explicit_terms[related_term_id]
+    if key not in term_dict:
+        term_dict[key] = set()
+
+    # add the term either by label or ID
+    if " " in term_id_or_label:
+        term_dict[key].add(f"'{term_id_or_label}'")
+    else:
+        term_dict[key].add(term_id_or_label)
+
+    # update the master dict
+    explicit_terms[related_term_id] = term_dict
+    return explicit_terms
+
+
+if __name__ == "__main__":
+    main()