Pathway Figure OCR

The goal of this project is to extract identifiable genes, proteins and metabolites from published pathway figures. In addition to all the code for assembling and running the Pathway Figure OCR pipeline, this repo contains scripts specific to the QC, analysis and figure generation involved in our publications of the work. Here we document a few of the key files and folders relevant to each paper:

25 Years of Pathway Figures (Genome Biology 2020)
- Interactive search tool for 65k pathway figures and their gene content: shiny app and code
- NIH Figshare of identified pathway figures and OCR results as RDS datasets: collection
- UpSet plot of top text and figure genes: script
- Pie chart data for top disease terms for text and figure genes: script
- Overlap matrix for Hippo Signaling pathway figure genes: script
- Machine learning progression plots: script
- Local database name: pfocr20200224
Identifying Genes in Published Pathway Figure Images (BioRxiv 2018)
- Performance assessment figures: folder
- Local database name: pfocr2018121717

This work is supported by NIGMS, R01GM100039

Install

Warning: this project is still in development and is not ready for production or even dev releases by external teams. So, don't expect things to work without some troubleshooting :) Contact us via Issues if you're interested in contributing to the development. All our code is open source.

Install Nix
Clone this repo: git clone https://github.com/wikipathways/pathway-figure-ocr.git
Enter environment: cd pathway-figure-ocr && nix-shell
Launch Jupyter: jupyter lab (automatically opens notebook in browser)

Pipeline

The Jupyter Notebooks used to run the PFOCR pipeline are all in ./notebooks. Run them in the following order:

pfocr_fetch.R.ipynb: get a list of likely pathway figures
get_figures.ipynb: download those figures
gcv_automl.ipynb: use a machine learning model we trained earlier to distinguish pathway vs. non-pathway figures
gcv_ocr.ipynb: run OCR on the figures classified as pathway
get_lexicon.ipynb: note that we actually just re-used the 20200224 lexicon for 20210515, so we didn't really finish this file.
pp_classic.ipynb: extract genes (pp_ahocorasick.ipynb is an alternative that should work even better once validated.)
pubtator.ipynb: Extract chemicals and diseases via PubTator.
merge_2020_2021.ipynb: this was just for the merge of 20200224 and 20210515. Obviously, it would require being updated for any other merge. Note this notebook is also where we get the metadata for the papers.
bte_export.ipynb: Export chemicals, diseases and genes for use in BioThings Explorer.
bte_export_csv_files.ipynb: Export figure data as CSV files for use in BioThings Explorer.

Note that we used a database for 20200224 but not for 20210515. Any future runs or merges will probably not need to use the old database.

Internal Notes

xpm2nix

In ./xpm2nix, you'll find packages from external package manager(s) made available as Nix packages. xpm is just an abbreviation we made up to refer to any eXternal Package Manager.

For Python, we're using poetry2nix.

cd xpm2nix/python-modules

To add a package:

poetry add --lock jupytext

To update packages:

poetry update --lock

For JavaScript / Node.js, we're using node2nix.

cd xpm2nix/node-packages

To add a package:

npm install --package-lock-only --save @arbennett/base16-gruvbox-dark

To update packages:

./update

Name		Name	Last commit message	Last commit date
Latest commit History 545 Commits
archive		archive
lexicon		lexicon
notebooks		notebooks
performance		performance
pfocr-pubtator-pubmed		pfocr-pubtator-pubmed
share-src		share-src
shiny-25years		shiny-25years
shiny-adrd		shiny-adrd
shiny-covidpathways		shiny-covidpathways
shiny-curator		shiny-curator
shiny-display		shiny-display
shiny-screen-images		shiny-screen-images
shiny-screen		shiny-screen
shiny-terpene		shiny-terpene
shiny-terpeneadh		shiny-terpeneadh
shiny-terpeneplant		shiny-terpeneplant
shiny-titles		shiny-titles
transforms		transforms
xpm2nix		xpm2nix
.envrc		.envrc
.gitignore		.gitignore
DIRENV.md		DIRENV.md
LICENSE		LICENSE
README.md		README.md
default.nix		default.nix
europepmc_metadata.R		europepmc_metadata.R
matrix-visualization.R		matrix-visualization.R
pfocr-gmt-enrich.R		pfocr-gmt-enrich.R
pfocr_curate.R		pfocr_curate.R
pfocr_plot.R		pfocr_plot.R
pfocr_qc.R		pfocr_qc.R
pfocr_upsetR.R		pfocr_upsetR.R
sample-size.R		sample-size.R
shell.nix		shell.nix
wp-gmt-enrich.R		wp-gmt-enrich.R
wp-gmt-overlaps.R		wp-gmt-overlaps.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pathway Figure OCR

Install

Pipeline

Internal Notes

xpm2nix

For Python, we're using poetry2nix.

For JavaScript / Node.js, we're using node2nix.

About

Releases 2

Packages

Contributors 5

Languages

License

wikipathways/pathway-figure-ocr

Folders and files

Latest commit

History

Repository files navigation

Pathway Figure OCR

Install

Pipeline

Internal Notes

xpm2nix

For Python, we're using poetry2nix.

For JavaScript / Node.js, we're using node2nix.

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 5

Languages

Packages