The goal of this project is to extract identifiable genes, proteins and metabolites from published pathway figures. In addition to all the code for assembling and running the Pathway Figure OCR pipeline, this repo contains scripts specific to the QC, analysis and figure generation involved in our publications of the work. Here we document a few of the key files and folders relevant to each paper:
-
25 Years of Pathway Figures (Genome Biology 2020)
- Interactive search tool for 65k pathway figures and their gene content: shiny app and code
- NIH Figshare of identified pathway figures and OCR results as RDS datasets: collection
- UpSet plot of top text and figure genes: script
- Pie chart data for top disease terms for text and figure genes: script
- Overlap matrix for Hippo Signaling pathway figure genes: script
- Machine learning progression plots: script
- Local database name:
pfocr20200224
-
Identifying Genes in Published Pathway Figure Images (BioRxiv 2018)
- Performance assessment figures: folder
- Local database name:
pfocr2018121717
This work is supported by NIGMS, R01GM100039
Warning: this project is still in development and is not ready for production or even dev releases by external teams. So, don't expect things to work without some troubleshooting :) Contact us via Issues if you're interested in contributing to the development. All our code is open source.
- Install Nix
- Clone this repo:
git clone https://github.com/wikipathways/pathway-figure-ocr.git
- Enter environment:
cd pathway-figure-ocr && nix-shell
- Launch Jupyter:
jupyter lab
(automatically opens notebook in browser)
The Jupyter Notebooks used to run the PFOCR pipeline are all in ./notebooks
. Run them in the following order:
pfocr_fetch.R.ipynb
: get a list of likely pathway figuresget_figures.ipynb
: download those figuresgcv_automl.ipynb
: use a machine learning model we trained earlier to distinguish pathway vs. non-pathway figuresgcv_ocr.ipynb
: run OCR on the figures classified as pathwayget_lexicon.ipynb
: note that we actually just re-used the20200224
lexicon for20210515
, so we didn't really finish this file.pp_classic.ipynb
: extract genes (pp_ahocorasick.ipynb
is an alternative that should work even better once validated.)pubtator.ipynb
: Extract chemicals and diseases via PubTator.merge_2020_2021.ipynb
: this was just for the merge of20200224
and20210515
. Obviously, it would require being updated for any other merge. Note this notebook is also where we get the metadata for the papers.bte_export.ipynb
: Export chemicals, diseases and genes for use in BioThings Explorer.bte_export_csv_files.ipynb
: Export figure data as CSV files for use in BioThings Explorer.
Note that we used a database for 20200224
but not for 20210515
. Any future runs or merges will probably not need to use the old database.
In ./xpm2nix
, you'll find packages from external package manager(s) made available as Nix packages. xpm
is just an abbreviation we made up to refer to any eXternal Package Manager.
cd xpm2nix/python-modules
To add a package:
poetry add --lock jupytext
To update packages:
poetry update --lock
cd xpm2nix/node-packages
To add a package:
npm install --package-lock-only --save @arbennett/base16-gruvbox-dark
To update packages:
./update