Skip to content

Latest commit

 

History

History
105 lines (86 loc) · 4.99 KB

README.md

File metadata and controls

105 lines (86 loc) · 4.99 KB

MISHMASH

MIShMASh (MIcrobiome Sequence and Metadata Availability Standards) is a Python 3 package to download and evaluate sequence data and metadata associated with publications. It enables programmatic access to literature on PubMed Central and to data on the Sequence Read Archive (and other INSDC databases). Users can easily assess the openness of publication data and obtain said data for themselves.

Installation

Use the package manager pip to install MISHMASH and its dependencies.

pip install git+https://github.com/bokulich-lab/mishmash.git

Usage

MIShMASh provides two main commands to evaluate sequence and metadata reporting in publications.

Evaluate sequencing data reporting

To classify the quality of sequencing data reporting, run assess_sequences:

mishmash assess_sequences \
  --pmc_list  \
  --output_file

where:

  • --pmc_list is space-separated list of PubMed Central IDs to search for associated INSDC accession IDs e.g. the PMC ID PMC6240460 for the 2017 article by Naymagon et al.
  • --output_file specifies the output file to write the analysis to.

An alternative to --pmc_list is the flag --pmc_input_file, which takes the full path to a file containing accession IDs. This should be a text file containing a single ID per line.

Evaluate metadata reporting

To retrieve metadata associated with a sequence record from an INSDC (e.g. SRA, DDBJ, ENA) database, run assess_metadata:

mishmash assess_metadata \
  --email  \
  --accession_list  \
  --output_file

where:

  • --email is your email address (required by NCBI).
  • --accession_ids is a space-separated list of accession IDs to retrieve metadata for. These can be BioProject, BioSample, BioExperiment, or likewise accession IDs that are used within INSDC interfaces e.g. the BioProject ID PRJNA607574 for the collection of samples uploaded by the Memorial Sloan Kettering Cancer Center.
  • --output_file specifies the output file to write the retrieved metadata to.

Optional parameters to assess_metadata include:

  • --n_jobs: an integer value for number of threads in parallelization
  • --verbose: a flag to print intermediate process outputs to standard output; use in debugging

Outputs

assess_sequences

This module generates a comma-separated file with the following information:

  • PMC ID: Input PubMed Central ID for query
  • Sequence Accessibility Badge: Bronze, Silver, or Gold (or "Cannot be determined") as an evaluation of the accessibility of sequencing data from the paper
  • INSDC Accessions Numbers: Accession numbers corresponding to the sequencing data uploaded to INSDC databases
  • INSDC Database: Database associated with the uploaded sequencing data i.e. SRA, ENA, or DDBJ
  • Number of Sequence Records: Total number of sequencing records (INSDC Runs) associated with the input article
  • Primer Sequences: If an amplicon-based study, sequences of primers used to amplify variable regions for sequencing; output as a comma-separated string
  • Sequencing Method: Probability of sequencing method as either amplicon- or shotgun-based; output as a dictionary
  • Includes Code: True/False whether a code repository has been found for the paper
  • Code URL: Links to code repositories found in paper; output as a list of strings

Known Issues

"Invalid URL"

Your local Internet connection may be unstable. If the following error arises, simply relaunch your command.

The download URL https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC10913621 is likely invalid.

Traceback (most recent call last):
  File "/.venv/test_mm_3/bin/mishmash", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "mishmash/mishmash/cli.py", line 75, in main
    output_df = args.func(args)
                ^^^^^^^^^^^^^^^
  File "mishmash/mishmash/scrape_pdf.py", line 370, in analyze_pdf
    scrape_objects = [
                     ^
  File "mishmash/mishmash/scrape_pdf.py", line 371, in <lambda>
    x for x in filter(lambda el: not el.contains_blocking_comment(),
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "mishmash/mishmash/scrape_pdf.py", line 96, in contains_blocking_comment
    return _contains_blocking_comment(self.get_xml())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "mishmash/mishmash/scrape_pdf.py", line 38, in _contains_blocking_comment
    for element in content(string=lambda text: isinstance(text, Comment)):
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'int' object is not callable

Contributions

Pull requests

To set up a development environment, use Poetry.

pip install poetry
poetry install

Test the code by running

poetry run pytest

License

MIShMASh is released under a BSD-3-Clause license. See LICENSE for more details.