Skip to content

Latest commit

 

History

History
315 lines (255 loc) · 12.1 KB

paper.md

File metadata and controls

315 lines (255 loc) · 12.1 KB
title title_short tags authors affiliations date cito-bibliography event biohackathon_name biohackathon_url biohackathon_location group git_url authors_short
BioHackrXiv template this is an example of a (too) long title mpla mpla mpla mpla mpla mpla mpla mpla mpla mpla mpla c wjfc wjknwjek nwjkwen jk
Logic Programming for the Biomedical Sciences
logic programming
name orcid affiliation
Chris Mungall
0000-0002-6601-2165
1
name affiliation orcid
Hirokazu Chiba
2
0000-0003-4062-8903
name affiliation orcid
Shuichi Kawashima
2
0000-0001-7883-3756
name affiliation orcid
Yasunori Yamamoto
2
0000-0002-6943-6887
name orcid affiliation
Pjotr Prins
0000-0002-8021-9162
3
name affiliation orcid
Nada Amin
4
0000-0002-0830-7248
name affiliation orcid
Deepak Unni
5
0000-0002-3583-7340
name affiliation orcid
<nobr>William&nbsp;E.&nbsp;Byrd</nobr>
6
0000-0003-4730-5293
name index
Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
1
name index
Database Center for Life Science, Research Organization of Information and Systems, Japan
2
name index
Department of Genetics, Genomics and Informatics, The University of Tennessee Health Science Center, Memphis, TN, USA.
3
name index ror
Harvard University, USA
4
03vek6s52
name index
Berkeley Lab, USA
5
name index
University of Alabama at Birmingham, USA
6
3 March 2020
paper.bib
Fukuoka2019
NBDC/DBCLS BioHackathon
Fukuoka, Japan, 2019
Logic programming group
Chris Mungall & Hirokazu Chiba \emph{et al.}

Introduction

As part of the one week Biohackathion 2019 in Fukuoka Japan, we formed a working group on logic programming for the biomedical sciences. Logic programming is understood by many bioinformaticians when it is presented in the form of relational SQL queries or SPARQL queries. More advanced logic programming, however, is underutilized in bioinformatics. Prolog, for example, is a high-level programming language that has its roots in first-order logic or first-order predicate calculus. Another example, miniKanren, is an embedded Domain Specific Language for logic programming. Core miniKanren is exceptionally simple, with only three logical operators and one interface operator [@uses_method_in:reasoned2nd].

Logic programming resolver traverses the solution space to find all matches \label{fig}

An SVG example

The introduction of logic programming is particularly relevant in the context of multi-model data representations where data can be accessed in memory as free data structures, but also on disk where data can be represented as tables, trees (documents), and graphs. In bioinformatics we can make use of all these different data sources and have a query engine that can mine them all efficiently.

Logic programming is well-suited for biological research. Essentially, a researcher writes a number of statements that include variables representing unknown information. The logic engine then goes through the solution space (all data) to find possible matches (see figure \ref{fig}). Much more detail on the rationale and implementations of miniKanren and logic programming are well summarized in Byrd's book \emph{The Reasoned Schemer, Second Edition} [@agreesWith:reasoned2nd], PhD thesis [@ByrdPhD], and online talks.

The `Logic Programming' working group at the 2019 edition of the annual Japanese BioHackathon applied logic programming to various problems. The working group: \begin{itemize} \item researched state-of-the-art mapping between graph stores and logic programming; \item created methods for bridging between SPARQL and in-memory data representations using Prolog; \item extended the Biolink model; \item and added Relational Biolink type inference for mediKanren. \end{itemize}

Research of state-of-the-art logic programming facilities for SPARQL

The working group researched current solutions for combining logic programming with SPARQL. ClioPatria is an in-memory RDF quad-store tightly coupled with SWI-Prolog by Jan Wielemaker, the main author of SWI-Prolog [@WielemakerBHO15]. SWI-Prolog is published under a BSD license, and there even exist bindings for ClioPatria and Python, for example, although we were unable to locate the source code. We think ClioPatria and SWI-Prolog are particularly useful for teaching, and for (in-memory) semantic web applications. SWI-Prolog comes with client libraries for SQL and SPARQL queries.

Accessing biological databases using SPARQLProg

A number of biological databases make their data available in RDF format, supporting SPARQL access---for example, Uniprot, NCBI Pubchem and the EBI RDF platform. SPARQL provides a subset of what logic programming can do. However, SPARQL queries lack the property of composability and there is no way to reuse modular components across queries. For example, to execute a range query on a genomic region using the FALDO model [@agreesWith:Bolleman2016] requires authoring a complex query over many triples. If we then wish to reuse parts of that query in a more complex query, we have to manually compose them together.

The working group added codes to SPARQLProg which provides a way to define modular query components using logic programming. SPARQLProg is written in SWI-Prolog and has a Python interface library. All code has been made available in the example directory of SPARQLProg which provides sophisticated mapping of logic queries to SPARQL.

For example, a 4-part predicate feature_in_range can be composed with a binary
has_mouse_ortholog predicate:

    feature_in_range(grch38:"X", 10000000, 20000000, HumanGene),
    has_mouse_ortholog(HumanGene, MouseGene)

This will compile down to a more complex SPARQL query, and execute it against a remote endpoint.

SPARQLProg now includes bindings for many common biological SPARQL endpoints. As part of this hackathon we developed codes to access RDF databases of MBGD [@Chiba2015], KEGG OC, TogoVar, JCM, Allie, EBI BioSamples, UniProt, and DisGeNET [@Queralt2016]. Future work includes using these Prolog codes as building blocks for integrative analysis.

Extending the Biolink Model

The Biolink Model (see above) is a data model developed for representing biological and biomedical knowledge. It includes a schema and generated objects for the data model and upper ontology. The BioLink Model was designed with the goal of standardizing the way information is represented in a graph store, regardless of the formalism used. The working group focused on extending this model to support representation of a wide variety of knowledge.

The following tasks were accomplished as part of the BioHackathon:

\begin{enumerate} \item Represent datasets and their related metadata \item Represent family and pedigree information to support clinical knowledge \item Make the provenance model more rich and descriptive \end{enumerate}

(note the list is written in embedded LaTeX)

For future work, the group will ensure that the new classes added to the model will have appropriate mappings to other schemas and ontologies.

Relational Biolink type inference for mediKanren

miniKanren is an embedded Domain Specific Language for logic programming. The goal was to implement a relational type inferencer for the Biolink Model in miniKanren, which can be integrated into mediKanren. The working group added a yaml subdirectory to the mediKanren GitHub page, and created multiple files in https://github.com/webyrd/mediKanren/yaml where yaml2sexp.py generates the biolink.scm file which contains an s-expression version of the Biolink yaml file. yaml.scm contains miniKanren relations, and Chez Scheme code that generates miniKanren relations based on biolink.scm. These are giant miniKanren conde clauses that can be thought of as relational tables. yaml.scm also contains tests for the relations.

Future work includes:

  1. integrating this work into the Racket mediKanren code;
  2. integrating with the data categories in the KGs;
  3. and creating query editor with decent type error messages, autocompletion, query synthesis, etc.

Discussion

The working group concluded that there is ample scope for logic programming in bioinformatics. Future work includes expansion of accessing semantic web databases using SPARQLProg, expanding the BioLink model, and adding dynamic SPARQL support to miniKanren.

Acknowledgements

We thank the organizers of the NBDC/DBCLS BioHackathon 2019 for travel support for some of the authors.

Supplemental information

We use pandoc flavoured markdown, similar to Rstudio see \url{https://garrettgman.github.io/rmarkdown/authoring_pandoc_markdown.html}.

Tables and figures

Tables can be added in the following way, though alternatives are possible:

Header 1 Header 2
item 1 item 2
item 3 item 4

Table: Note that table caption is automatically numbered.

Term MB tools/ontologies using this term Frequency on Biology Stack Exchange Search Term
Part iGEM 9065 part + parts
Component SBOL, SBOLDesigner, SBOLCanvas 2163 component
Module SBOL 311 module
Device 677 device
System 16098 system
RBS 548 rbs
Ribosome Entry Site SO 8 ribosome entry site

LaTeX table:

\begin{tabular}{|l|l|}\hline Age & Frequency \ \hline 18--25 & 15 \ 26--35 & 33 \ 36--45 & 22 \ \hline \end{tabular}

Mermaid graphs

This is an example of embedding a graph

graph TD;
    A-->B;
    A-->C;
    B-->D;
    C-->D;
Loading

Unfortunately it does not work without the mermaid plugin and that requires headless chrome(?!). If you run the command line version of gen-pdf it may be possible to get it to work with the right packages. Please tell us if you succeed.

References