Skip to content

Commit

Permalink
Merge pull request #51 from sbslee/0.30.0-dev
Browse files Browse the repository at this point in the history
0.30.0 dev
  • Loading branch information
sbslee authored Feb 5, 2022
2 parents f4eb5f6 + 67d8082 commit 5972cab
Show file tree
Hide file tree
Showing 15 changed files with 871 additions and 43 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
Changelog
*********

0.30.0 (2022-02-05)
-------------------

* Update :command:`fuc-find` command to allow users to control whether to use recursive retrieving.
* Add new command :command:`ngs-trim`.
* Add new command :command:`ngs-quant`.
* Add new submodule ``pykallisto``.
* Update :meth:`pycov.CovFrame.from_bam` method to use filename as sample name when the SM tag is missing.
* Add new method :meth:`pyvcf.row_phased`. From now on, it's used to get the ``pyvcf.VcfFrame.phased`` property.
* Add new method :meth:`pyvcf.split` and :command:`vcf-split` command for splitting VCF by individual.
* Update :meth:`pyvcf.merge` method, :meth:`pyvcf.VcfFrame.merge` method, and :command:`vcf-merge` command to automatically handle the 'chr' string.

0.29.0 (2021-12-19)
-------------------

Expand Down
12 changes: 11 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Additionally, fuc can be used to parse output data from the following programs:
- Ensembl Variant Effect Predictor (VEP)
- SnpEff
- bcl2fastq and bcl2fastq2
- Kallisto

Your contributions (e.g. feature ideas, pull requests) are most welcome.

Expand Down Expand Up @@ -128,7 +129,8 @@ For getting help on the fuc CLI:
fuc-compf Compare the contents of two files.
fuc-demux Parse the Reports directory from bcl2fastq.
fuc-exist Check whether certain files exist.
fuc-find Find all filenames matching a specified pattern recursively.
fuc-find Retrieve absolute paths of files whose name matches a
specified pattern, optionally recursively.
fuc-undetm Compute top unknown barcodes using undertermined FASTQ from bcl2fastq.
maf-maf2vcf Convert a MAF file to a VCF file.
maf-oncoplt Create an oncoplot with a MAF file.
Expand All @@ -139,6 +141,9 @@ For getting help on the fuc CLI:
ngs-hc Pipeline for germline short variant discovery.
ngs-m2 Pipeline for somatic short variant discovery.
ngs-pon Pipeline for constructing a panel of normals (PoN).
ngs-quant Pipeline for running RNAseq quantification from FASTQ files
with Kallisto.
ngs-trim Pipeline for trimming adapters from FASTQ files.
tabix-index Index a GFF/BED/SAM/VCF file with Tabix.
tabix-slice Slice a GFF/BED/SAM/VCF file with Tabix.
tbl-merge Merge two table files.
Expand All @@ -148,6 +153,7 @@ For getting help on the fuc CLI:
vcf-merge Merge two or more VCF files.
vcf-rename Rename the samples in a VCF file.
vcf-slice Slice a VCF file for specified regions.
vcf-split Split a VCF file by individual.
vcf-vcf2bed Convert a VCF file to a BED file.
vcf-vep Filter a VCF file by annotations from Ensembl VEP.
Expand All @@ -169,6 +175,7 @@ Below is the list of submodules available in the fuc API:
- **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
- **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
- **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
- **pykallisto** : The pykallisto submodule is designed for working with RNAseq quantification data from Kallisto. It implements ``pykallisto.KallistoFrame`` which stores Kallisto's output data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pykallisto.KallistoFrame`` class also contains many useful plotting methods such as ``KallistoFrame.plot_differential_abundance``.
- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
Expand All @@ -181,6 +188,9 @@ For getting help on a specific submodule (e.g. pyvcf):
>>> from fuc import pyvcf
>>> help(pyvcf)
In Jupyter Notebook and Lab, you can see the documentation for a python
function by hitting ``SHIFT + TAB``. Hit it twice to expand the view.

CLI examples
============

Expand Down
7 changes: 7 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Below is the list of submodules available in the fuc API:
- **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
- **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
- **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
- **pykallisto** : The pykallisto submodule is designed for working with RNAseq quantification data from Kallisto. It implements ``pykallisto.KallistoFrame`` which stores Kallisto's output data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pykallisto.KallistoFrame`` class also contains many useful plotting methods such as ``KallistoFrame.plot_differential_abundance``.
- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
Expand Down Expand Up @@ -65,6 +66,12 @@ fuc.pygff
.. automodule:: fuc.api.pygff
:members:

fuc.pykallisto
==============

.. automodule:: fuc.api.pykallisto
:members:

fuc.pymaf
=========

Expand Down
145 changes: 130 additions & 15 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ For getting help on the fuc CLI:
fuc-compf Compare the contents of two files.
fuc-demux Parse the Reports directory from bcl2fastq.
fuc-exist Check whether certain files exist.
fuc-find Find all filenames matching a specified pattern recursively.
fuc-find Retrieve absolute paths of files whose name matches a
specified pattern, optionally recursively.
fuc-undetm Compute top unknown barcodes using undertermined FASTQ from bcl2fastq.
maf-maf2vcf Convert a MAF file to a VCF file.
maf-oncoplt Create an oncoplot with a MAF file.
Expand All @@ -46,6 +47,9 @@ For getting help on the fuc CLI:
ngs-hc Pipeline for germline short variant discovery.
ngs-m2 Pipeline for somatic short variant discovery.
ngs-pon Pipeline for constructing a panel of normals (PoN).
ngs-quant Pipeline for running RNAseq quantification from FASTQ files
with Kallisto.
ngs-trim Pipeline for trimming adapters from FASTQ files.
tabix-index Index a GFF/BED/SAM/VCF file with Tabix.
tabix-slice Slice a GFF/BED/SAM/VCF file with Tabix.
tbl-merge Merge two table files.
Expand All @@ -55,6 +59,7 @@ For getting help on the fuc CLI:
vcf-merge Merge two or more VCF files.
vcf-rename Rename the samples in a VCF file.
vcf-slice Slice a VCF file for specified regions.
vcf-split Split a VCF file by individual.
vcf-vcf2bed Convert a VCF file to a BED file.
vcf-vep Filter a VCF file by annotations from Ensembl VEP.
Expand Down Expand Up @@ -524,28 +529,28 @@ fuc-find
.. code-block:: text
$ fuc fuc-find -h
usage: fuc fuc-find [-h] [--dir PATH] pattern
usage: fuc fuc-find [-h] [-r] [-d PATH] pattern
Find all filenames matching a specified pattern recursively.
This command will recursively find all the filenames matching a specified
pattern and then return their absolute paths.
Retrieve absolute paths of files whose name matches a specified pattern,
optionally recursively.
Positional arguments:
pattern Filename pattern.
pattern Filename pattern.
Optional arguments:
-h, --help Show this help message and exit.
--dir PATH Directory to search in (default: current directory).
-h, --help Show this help message and exit.
-r, --recursive Turn on recursive retrieving.
-d PATH, --directory PATH
Directory to search in (default: current directory).
[Example] Find VCF files in the current directory:
[Example] Retrieve VCF files in the current directory only:
$ fuc fuc-find "*.vcf"
[Example] Find specific VCF files:
$ fuc fuc-find "*.vcf.*"
[Example] Retrieve VCF files recursively:
$ fuc fuc-find "*.vcf" -r
[Example] Find zipped VCF files in a specific directory:
$ fuc fuc-find "*.vcf.gz" --dir ~/test_dir
[Example] Retrieve VCF files in a specific directory:
$ fuc fuc-find "*.vcf" -d /path/to/dir
fuc-undetm
==========
Expand Down Expand Up @@ -980,6 +985,89 @@ ngs-pon
"-l h='node_A|node_B'" \
"-Xmx15g -Xms15g"
ngs-quant
=========

.. code-block:: text
$ fuc ngs-quant -h
usage: fuc ngs-quant [-h] [--thread INT] [--bootstrap INT] [--job TEXT]
[--force] [--posix]
manifest index output qsub
Pipeline for running RNAseq quantification from FASTQ files with Kallisto.
External dependencies:
- SGE: Required for job submission (i.e. qsub).
- kallisto: Required for RNAseq quantification.
Manifest columns:
- Name: Sample name.
- Read1: Path to forward FASTA file.
- Read2: Path to reverse FASTA file.
Positional arguments:
manifest Sample manifest CSV file.
index Kallisto index file.
output Output directory.
qsub SGE resoruce to request for qsub.
Optional arguments:
-h, --help Show this help message and exit.
--thread INT Number of threads to use (default: 1).
--bootstrap INT Number of bootstrap samples (default: 50).
--job TEXT Job submission ID for SGE.
--force Overwrite the output directory if it already exists.
--posix Set the environment variable HDF5_USE_FILE_LOCKING=FALSE
before running Kallisto. This is required for shared Posix
Filesystems (e.g. NFS, Lustre).
[Example] Specify queue:
$ fuc ngs-quant \
manifest.csv \
transcripts.idx \
output_dir \
"-q queue_name -pe pe_name 10" \
--thread 10
ngs-trim
========

.. code-block:: text
$ fuc ngs-trim -h
usage: fuc ngs-trim [-h] [--thread INT] [--job TEXT] [--force]
manifest output qsub
Pipeline for trimming adapters from FASTQ files.
External dependencies:
- SGE: Required for job submission (i.e. qsub).
- cutadapt: Required for trimming adapters.
Manifest columns:
- Name: Sample name.
- Read1: Path to forward FASTA file.
- Read2: Path to reverse FASTA file.
Positional arguments:
manifest Sample manifest CSV file.
output Output directory.
qsub SGE resoruce to request for qsub.
Optional arguments:
-h, --help Show this help message and exit.
--thread INT Number of threads to use (default: 1).
--job TEXT Job submission ID for SGE.
--force Overwrite the output directory if it already exists.
[Example] Specify queue:
$ fuc ngs-trim \
manifest.csv \
output_dir \
"-q queue_name -pe pe_name 10" \
--thread 10
tabix-index
===========

Expand Down Expand Up @@ -1218,7 +1306,10 @@ vcf-merge
Merge two or more VCF files.
Positional arguments:
vcf_files VCF files (compressed or uncompressed).
vcf_files VCF files (compressed or uncompressed). Note that the 'chr'
prefix in contig names (e.g. 'chr1' vs. '1') will be
automatically added or removed as necessary to match the
contig names of the first VCF.
Optional arguments:
-h, --help Show this help message and exit.
Expand Down Expand Up @@ -1314,6 +1405,30 @@ vcf-slice
[Example] Output a compressed file:
$ fuc vcf-slice in.vcf.gz regions.bed | fuc fuc-bgzip > out.vcf.gz
vcf-split
=========

.. code-block:: text
$ fuc vcf-split -h
usage: fuc vcf-split [-h] [--clean] [--force] vcf output
Split a VCF file by individual.
Positional arguments:
vcf VCF file to be split.
output Output directory.
Optional arguments:
-h, --help Show this help message and exit.
--clean By default, the command will only return variants present in
each individual. Use the tag to stop this behavior and make
sure that all individuals have the same number of variants.
--force Overwrite the output directory if it already exists.
[Example] Split a VCF file by individual:
$ fuc vcf-split in.vcf output_dir
vcf-vcf2bed
===========

Expand Down
4 changes: 4 additions & 0 deletions docs/create.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@
- Ensembl Variant Effect Predictor (VEP)
- SnpEff
- bcl2fastq and bcl2fastq2
- Kallisto
Your contributions (e.g. feature ideas, pull requests) are most welcome.
Expand Down Expand Up @@ -153,6 +154,9 @@
>>> from fuc import pyvcf
>>> help(pyvcf)
In Jupyter Notebook and Lab, you can see the documentation for a python
function by hitting ``SHIFT + TAB``. Hit it twice to expand the view.
CLI examples
============
Expand Down
13 changes: 13 additions & 0 deletions fuc/api/pycov.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@
and ``CovFrame.plot_uniformity``.
"""
from io import StringIO, IOBase
import warnings
import gzip
from pathlib import Path

from . import common, pybam, pybed

Expand Down Expand Up @@ -181,6 +183,9 @@ def from_bam(
Under the hood, the method computes read depth using the
:command:`samtools depth` command.
Sample name is extracted from the SM tag. If the tag is missing, the
method will set the filename as sample name.
Parameters
----------
bam : str or list, optional
Expand Down Expand Up @@ -257,6 +262,14 @@ def from_bam(
name = names[i]
else:
samples = pybam.tag_sm(bam_file)
if not samples:
basename = Path(bam_file).stem
message = (
f'SM tags were not found for {bam_file}, will use '
f'file name as sample name ({basename})'
)
samples = [basename]
warnings.warn(message)
if len(samples) > 1:
m = f'multiple sample names detected: {bam_file}'
raise ValueError(m)
Expand Down
Loading

0 comments on commit 5972cab

Please sign in to comment.