title | title_short | tags | authors | affiliations | date | cito-bibliography | event | biohackathon_name | biohackathon_url | biohackathon_location | group | git_url | authors_short | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BioHackrXiv template this is an example of a (too) long title mpla mpla mpla mpla mpla mpla mpla mpla mpla mpla mpla c wjfc wjknwjek nwjkwen jk |
Logic Programming for the Biomedical Sciences |
|
|
|
3 March 2020 |
paper.bib |
Fukuoka2019 |
NBDC/DBCLS BioHackathon |
Fukuoka, Japan, 2019 |
Logic programming group |
Chris Mungall & Hirokazu Chiba \emph{et al.} |
As part of the one week Biohackathion 2019 in Fukuoka Japan, we formed a working group on logic programming for the biomedical sciences. Logic programming is understood by many bioinformaticians when it is presented in the form of relational SQL queries or SPARQL queries. More advanced logic programming, however, is underutilized in bioinformatics. Prolog, for example, is a high-level programming language that has its roots in first-order logic or first-order predicate calculus. Another example, miniKanren, is an embedded Domain Specific Language for logic programming. Core miniKanren is exceptionally simple, with only three logical operators and one interface operator [@uses_method_in:reasoned2nd].
The introduction of logic programming is particularly relevant in the context of multi-model data representations where data can be accessed in memory as free data structures, but also on disk where data can be represented as tables, trees (documents), and graphs. In bioinformatics we can make use of all these different data sources and have a query engine that can mine them all efficiently.
Logic programming is well-suited for biological research. Essentially, a researcher writes a number of statements that include variables representing unknown information. The logic engine then goes through the solution space (all data) to find possible matches (see figure \ref{fig}). Much more detail on the rationale and implementations of miniKanren and logic programming are well summarized in Byrd's book \emph{The Reasoned Schemer, Second Edition} [@agreesWith:reasoned2nd], PhD thesis [@ByrdPhD], and online talks.
The `Logic Programming' working group at the 2019 edition of the annual Japanese BioHackathon applied logic programming to various problems. The working group: \begin{itemize} \item researched state-of-the-art mapping between graph stores and logic programming; \item created methods for bridging between SPARQL and in-memory data representations using Prolog; \item extended the Biolink model; \item and added Relational Biolink type inference for mediKanren. \end{itemize}
The working group researched current solutions for combining logic programming with SPARQL. ClioPatria is an in-memory RDF quad-store tightly coupled with SWI-Prolog by Jan Wielemaker, the main author of SWI-Prolog [@WielemakerBHO15]. SWI-Prolog is published under a BSD license, and there even exist bindings for ClioPatria and Python, for example, although we were unable to locate the source code. We think ClioPatria and SWI-Prolog are particularly useful for teaching, and for (in-memory) semantic web applications. SWI-Prolog comes with client libraries for SQL and SPARQL queries.
A number of biological databases make their data available in RDF format, supporting SPARQL access---for example, Uniprot, NCBI Pubchem and the EBI RDF platform. SPARQL provides a subset of what logic programming can do. However, SPARQL queries lack the property of composability and there is no way to reuse modular components across queries. For example, to execute a range query on a genomic region using the FALDO model [@agreesWith:Bolleman2016] requires authoring a complex query over many triples. If we then wish to reuse parts of that query in a more complex query, we have to manually compose them together.
The working group added codes to SPARQLProg which provides a way to define modular query components using logic programming. SPARQLProg is written in SWI-Prolog and has a Python interface library. All code has been made available in the example directory of SPARQLProg which provides sophisticated mapping of logic queries to SPARQL.
For example, a 4-part predicate feature_in_range
can be composed
with a binary
has_mouse_ortholog
predicate:
feature_in_range(grch38:"X", 10000000, 20000000, HumanGene),
has_mouse_ortholog(HumanGene, MouseGene)
This will compile down to a more complex SPARQL query, and execute it against a remote endpoint.
SPARQLProg now includes bindings for many common biological SPARQL endpoints. As part of this hackathon we developed codes to access RDF databases of MBGD [@Chiba2015], KEGG OC, TogoVar, JCM, Allie, EBI BioSamples, UniProt, and DisGeNET [@Queralt2016]. Future work includes using these Prolog codes as building blocks for integrative analysis.
The Biolink Model (see above) is a data model developed for representing biological and biomedical knowledge. It includes a schema and generated objects for the data model and upper ontology. The BioLink Model was designed with the goal of standardizing the way information is represented in a graph store, regardless of the formalism used. The working group focused on extending this model to support representation of a wide variety of knowledge.
The following tasks were accomplished as part of the BioHackathon:
\begin{enumerate} \item Represent datasets and their related metadata \item Represent family and pedigree information to support clinical knowledge \item Make the provenance model more rich and descriptive \end{enumerate}
(note the list is written in embedded LaTeX)
For future work, the group will ensure that the new classes added to the model will have appropriate mappings to other schemas and ontologies.
miniKanren is an embedded Domain Specific Language for logic
programming. The goal was to implement a relational type inferencer
for the Biolink Model in
miniKanren, which can be integrated into mediKanren. The working group
added a yaml
subdirectory to the mediKanren GitHub page, and created
multiple files in https://github.com/webyrd/mediKanren/yaml where
yaml2sexp.py
generates the biolink.scm
file which contains an
s-expression version of the Biolink yaml file. yaml.scm
contains
miniKanren relations, and Chez Scheme code that generates miniKanren
relations based on biolink.scm
. These are giant miniKanren conde
clauses that can be thought of as relational tables. yaml.scm
also
contains tests for the relations.
Future work includes:
- integrating this work into the Racket mediKanren code;
- integrating with the data categories in the KGs;
- and creating query editor with decent type error messages, autocompletion, query synthesis, etc.
The working group concluded that there is ample scope for logic programming in bioinformatics. Future work includes expansion of accessing semantic web databases using SPARQLProg, expanding the BioLink model, and adding dynamic SPARQL support to miniKanren.
We thank the organizers of the NBDC/DBCLS BioHackathon 2019 for travel support for some of the authors.
We use pandoc flavoured markdown, similar to Rstudio see \url{https://garrettgman.github.io/rmarkdown/authoring_pandoc_markdown.html}.
Tables can be added in the following way, though alternatives are possible:
Header 1 | Header 2 |
---|---|
item 1 | item 2 |
item 3 | item 4 |
Table: Note that table caption is automatically numbered.
Term | MB tools/ontologies using this term | Frequency on Biology Stack Exchange | Search Term |
---|---|---|---|
Part | iGEM | 9065 | part + parts |
Component | SBOL, SBOLDesigner, SBOLCanvas | 2163 | component |
Module | SBOL | 311 | module |
Device | 677 | device | |
System | 16098 | system | |
RBS | 548 | rbs | |
Ribosome Entry Site | SO | 8 | ribosome entry site |
LaTeX table:
\begin{tabular}{|l|l|}\hline Age & Frequency \ \hline 18--25 & 15 \ 26--35 & 33 \ 36--45 & 22 \ \hline \end{tabular}
This is an example of embedding a graph
graph TD;
A-->B;
A-->C;
B-->D;
C-->D;
Unfortunately it does not work without the mermaid plugin and that requires headless chrome(?!). If you run the command line version of gen-pdf
it may be possible to get it to work with the right packages. Please tell us if you succeed.