The scientific literature is vast, and valuable information connecting findings from disparate works is easily missed. Teams of collaborators address this problem up to a point but could still benefit from systematic “big data” approaches that mine the entire literature to generate testable hypotheses on a large scale.
MeTeOR, or the MeSH Term Objective Reasoning network, mines the PubMed literature, revealing knowledge previously hidden in a sea of information. Given one biological entity (a gene, drug, or disease), it can give a ranked list of associations with other biological entities, and it can highlight papers pertaining to any two biological entities.
This MeTeOR network was assembled with python 3 and it was assessed and predicted upon using MATLAB.
A website serving the resulting network can be found here.
There is a shell script file that can be run to assemble MeTeOR and to assess the resulting network. This may be relevant if you wish to have the latest PubMed articles or if you wish to modify some aspect of the creation process. For example, you could create a custom weighting process or create a subnetwork based only on a certain part of the literature.
Alternatively, you can download the results and use those for your project. This can be done as described below:
- python3
- MATLAB (for prediction and network assessment)
- Graphviz (for the network visualization)
- python3-dev(Ex.)
- python3-tk
- npm
- dat
pyupset is required, but a modified version is already provided in this repository.
- PubMed Data: This project runs on the NLM bulk downloads. The raw XML can take upwards of 280 GB of space. The pipeline can also be run off specific queries, as was done for the publication version of the code; however, this method takes a very long time to obtain data from PubMed, so please allow 2-3 days depending on download speeds.
- All code was run on an Intel® Core™ i7-4820K CPU @ 3.70GHz × 8 with 64 GB RAM. From start to finish, everything should complete a day for bulk downloads or within a week otherwise.
- The Non-negative Matrix Factorization (NMF) conducted in the analysis part of the pipeline and run in MATLAB can be very time and memory intensive. If you chose to, you can download pre-computed results to greatly increase the speed of analysis.
chmod +x ./run.sh
./run.sh
python main.py
Because this downloads a lot of data, you can specify the storage directory for the PubMed XML.
python main.py --storagedir /path/to/dir
cd matlabpipeline
matlab -r pipeline
cd ../EGFR
python runEGFR.py
The MeTeOR network can be download via the results bulk download or as flat files. All downloads are available at here.
For smooth integration into the git project, I also used dat. The run script automatically downloads the dat data for ease of use. See below for use of dat.
To download data using dat, ensure node (version >= 4) is installed:
node -v
If it needs to be installed. go to their website. or:
curl -sL https://deb.nodesource.com/setup_10.x | sudo -E bash -
sudo apt-get install -y nodejs
To install dat:
sudo npm install -g dat
Navigate to the directory you wish to download, either data or results, and use:
dat clone ./