title

title_short

Introduction

As part of the one week Biohackathion 2019 in Fukuoka Japan, we formed a working group on logic programming for the biomedical sciences. Logic programming is understood by many bioinformaticians when it is presented in the form of relational SQL queries or SPARQL queries. More advanced logic programming, however, is underutilized in bioinformatics. Prolog, for example, is a high-level programming language that has its roots in first-order logic or first-order predicate calculus. Another example, miniKanren, is an embedded Domain Specific Language for logic programming. Core miniKanren is exceptionally simple, with only three logical operators and one interface operator [@uses_method_in:reasoned2nd].

$Logic programming resolver traverses the solution space to find all matches \label{fig}$

The introduction of logic programming is particularly relevant in the context of multi-model data representations where data can be accessed in memory as free data structures, but also on disk where data can be represented as tables, trees (documents), and graphs. In bioinformatics we can make use of all these different data sources and have a query engine that can mine them all efficiently.

Logic programming is well-suited for biological research. Essentially, a researcher writes a number of statements that include variables representing unknown information. The logic engine then goes through the solution space (all data) to find possible matches (see figure \ref{fig}). Much more detail on the rationale and implementations of miniKanren and logic programming are well summarized in Byrd's book \emph{The Reasoned Schemer, Second Edition} [@agreesWith:reasoned2nd], PhD thesis [@ByrdPhD], and online talks.

The `Logic Programming' working group at the 2019 edition of the annual Japanese BioHackathon applied logic programming to various problems. The working group: \begin{itemize} \item researched state-of-the-art mapping between graph stores and logic programming; \item created methods for bridging between SPARQL and in-memory data representations using Prolog; \item extended the Biolink model; \item and added Relational Biolink type inference for mediKanren. \end{itemize}

Research of state-of-the-art logic programming facilities for SPARQL

The working group researched current solutions for combining logic programming with SPARQL. ClioPatria is an in-memory RDF quad-store tightly coupled with SWI-Prolog by Jan Wielemaker, the main author of SWI-Prolog [@WielemakerBHO15]. SWI-Prolog is published under a BSD license, and there even exist bindings for ClioPatria and Python, for example, although we were unable to locate the source code. We think ClioPatria and SWI-Prolog are particularly useful for teaching, and for (in-memory) semantic web applications. SWI-Prolog comes with client libraries for SQL and SPARQL queries.

Accessing biological databases using SPARQLProg

A number of biological databases make their data available in RDF format, supporting SPARQL access---for example, Uniprot, NCBI Pubchem and the EBI RDF platform. SPARQL provides a subset of what logic programming can do. However, SPARQL queries lack the property of composability and there is no way to reuse modular components across queries. For example, to execute a range query on a genomic region using the FALDO model [@agreesWith:Bolleman2016] requires authoring a complex query over many triples. If we then wish to reuse parts of that query in a more complex query, we have to manually compose them together.

The working group added codes to SPARQLProg which provides a way to define modular query components using logic programming. SPARQLProg is written in SWI-Prolog and has a Python interface library. All code has been made available in the example directory of SPARQLProg which provides sophisticated mapping of logic queries to SPARQL.

For example, a 4-part predicate feature_in_range can be composed with a binary
has_mouse_ortholog predicate:

    feature_in_range(grch38:"X", 10000000, 20000000, HumanGene),
    has_mouse_ortholog(HumanGene, MouseGene)

This will compile down to a more complex SPARQL query, and execute it against a remote endpoint.

SPARQLProg now includes bindings for many common biological SPARQL endpoints. As part of this hackathon we developed codes to access RDF databases of MBGD [@Chiba2015], KEGG OC, TogoVar, JCM, Allie, EBI BioSamples, UniProt, and DisGeNET [@Queralt2016]. Future work includes using these Prolog codes as building blocks for integrative analysis.

Extending the Biolink Model

The Biolink Model (see above) is a data model developed for representing biological and biomedical knowledge. It includes a schema and generated objects for the data model and upper ontology. The BioLink Model was designed with the goal of standardizing the way information is represented in a graph store, regardless of the formalism used. The working group focused on extending this model to support representation of a wide variety of knowledge.

The following tasks were accomplished as part of the BioHackathon:

\begin{enumerate} \item Represent datasets and their related metadata \item Represent family and pedigree information to support clinical knowledge \item Make the provenance model more rich and descriptive \end{enumerate}

(note the list is written in embedded LaTeX)

For future work, the group will ensure that the new classes added to the model will have appropriate mappings to other schemas and ontologies.

Relational Biolink type inference for mediKanren

miniKanren is an embedded Domain Specific Language for logic programming. The goal was to implement a relational type inferencer for the Biolink Model in miniKanren, which can be integrated into mediKanren. The working group added a yaml subdirectory to the mediKanren GitHub page, and created multiple files in https://github.com/webyrd/mediKanren/yaml where yaml2sexp.py generates the biolink.scm file which contains an s-expression version of the Biolink yaml file. yaml.scm contains miniKanren relations, and Chez Scheme code that generates miniKanren relations based on biolink.scm. These are giant miniKanren conde clauses that can be thought of as relational tables. yaml.scm also contains tests for the relations.

Future work includes:

integrating this work into the Racket mediKanren code;
integrating with the data categories in the KGs;
and creating query editor with decent type error messages, autocompletion, query synthesis, etc.

Discussion

The working group concluded that there is ample scope for logic programming in bioinformatics. Future work includes expansion of accessing semantic web databases using SPARQLProg, expanding the BioLink model, and adding dynamic SPARQL support to miniKanren.

Acknowledgements

We thank the organizers of the NBDC/DBCLS BioHackathon 2019 for travel support for some of the authors.

Supplemental information

We use pandoc flavoured markdown, similar to Rstudio see \url{https://garrettgman.github.io/rmarkdown/authoring_pandoc_markdown.html}.

Tables and figures

Tables can be added in the following way, though alternatives are possible:

Header 1	Header 2
item 1	item 2
item 3	item 4

Table: Note that table caption is automatically numbered.

Term	MB tools/ontologies using this term	Frequency on Biology Stack Exchange	Search Term
Part	iGEM	9065	part + parts
Component	SBOL, SBOLDesigner, SBOLCanvas	2163	component
Module	SBOL	311	module
Device		677	device
System		16098	system
RBS		548	rbs
Ribosome Entry Site	SO	8	ribosome entry site

LaTeX table:

\begin{tabular}{|l|l|}\hline Age & Frequency \ \hline 18--25 & 15 \ 26--35 & 33 \ 36--45 & 22 \ \hline \end{tabular}

Mermaid graphs

This is an example of embedding a graph

graph TD;
    A-->B;
    A-->C;
    B-->D;
    C-->D;

Loading

Unfortunately it does not work without the mermaid plugin and that requires headless chrome(?!). If you run the command line version of gen-pdf it may be possible to get it to work with the right packages. Please tell us if you succeed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paper.md

paper.md

Introduction

Research of state-of-the-art logic programming facilities for SPARQL

Accessing biological databases using SPARQLProg

Extending the Biolink Model

Relational Biolink type inference for mediKanren

Discussion

Acknowledgements

Supplemental information

Tables and figures

Mermaid graphs

References

Files

paper.md

Latest commit

History

paper.md

File metadata and controls

Introduction

Research of state-of-the-art logic programming facilities for SPARQL

Accessing biological databases using SPARQLProg

Extending the Biolink Model

Relational Biolink type inference for mediKanren

Discussion

Acknowledgements

Supplemental information

Tables and figures

Mermaid graphs

References