Merge pull request #51 from sbslee/0.30.0-dev

0.30.0 dev
sbslee · Feb 5, 2022 · 5972cab · 5972cab
2 parents f4eb5f6 + 67d8082
commit 5972cab
Show file tree

Hide file tree

Showing 15 changed files with 871 additions and 43 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,18 @@
 Changelog
 *********
 
+0.30.0 (2022-02-05)
+-------------------
+
+* Update :command:`fuc-find` command to allow users to control whether to use recursive retrieving.
+* Add new command :command:`ngs-trim`.
+* Add new command :command:`ngs-quant`.
+* Add new submodule ``pykallisto``.
+* Update :meth:`pycov.CovFrame.from_bam` method to use filename as sample name when the SM tag is missing.
+* Add new method :meth:`pyvcf.row_phased`. From now on, it's used to get the ``pyvcf.VcfFrame.phased`` property.
+* Add new method :meth:`pyvcf.split` and :command:`vcf-split` command for splitting VCF by individual.
+* Update :meth:`pyvcf.merge` method, :meth:`pyvcf.VcfFrame.merge` method, and :command:`vcf-merge` command to automatically handle the 'chr' string.
+
 0.29.0 (2021-12-19)
 -------------------
 

diff --git a/README.rst b/README.rst
@@ -49,6 +49,7 @@ Additionally, fuc can be used to parse output data from the following programs:
 - Ensembl Variant Effect Predictor (VEP)
 - SnpEff
 - bcl2fastq and bcl2fastq2
+- Kallisto
 
 Your contributions (e.g. feature ideas, pull requests) are most welcome.
 
@@ -128,7 +129,8 @@ For getting help on the fuc CLI:
        fuc-compf    Compare the contents of two files.
        fuc-demux    Parse the Reports directory from bcl2fastq.
        fuc-exist    Check whether certain files exist.
-       fuc-find     Find all filenames matching a specified pattern recursively.
+       fuc-find     Retrieve absolute paths of files whose name matches a 
+                    specified pattern, optionally recursively.
        fuc-undetm   Compute top unknown barcodes using undertermined FASTQ from bcl2fastq.
        maf-maf2vcf  Convert a MAF file to a VCF file.
        maf-oncoplt  Create an oncoplot with a MAF file.
@@ -139,6 +141,9 @@ For getting help on the fuc CLI:
        ngs-hc       Pipeline for germline short variant discovery.
        ngs-m2       Pipeline for somatic short variant discovery.
        ngs-pon      Pipeline for constructing a panel of normals (PoN).
+       ngs-quant    Pipeline for running RNAseq quantification from FASTQ files 
+                    with Kallisto.
+       ngs-trim     Pipeline for trimming adapters from FASTQ files.
        tabix-index  Index a GFF/BED/SAM/VCF file with Tabix.
        tabix-slice  Slice a GFF/BED/SAM/VCF file with Tabix.
        tbl-merge    Merge two table files.
@@ -148,6 +153,7 @@ For getting help on the fuc CLI:
        vcf-merge    Merge two or more VCF files.
        vcf-rename   Rename the samples in a VCF file.
        vcf-slice    Slice a VCF file for specified regions.
+       vcf-split    Split a VCF file by individual.
        vcf-vcf2bed  Convert a VCF file to a BED file.
        vcf-vep      Filter a VCF file by annotations from Ensembl VEP.
    
@@ -169,6 +175,7 @@ Below is the list of submodules available in the fuc API:
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
 - **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
+- **pykallisto** : The pykallisto submodule is designed for working with RNAseq quantification data from Kallisto. It implements ``pykallisto.KallistoFrame`` which stores Kallisto's output data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pykallisto.KallistoFrame`` class also contains many useful plotting methods such as ``KallistoFrame.plot_differential_abundance``.
 - **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
 - **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
 - **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
@@ -181,6 +188,9 @@ For getting help on a specific submodule (e.g. pyvcf):
    >>> from fuc import pyvcf
    >>> help(pyvcf)
 
+In Jupyter Notebook and Lab, you can see the documentation for a python
+function by hitting ``SHIFT + TAB``. Hit it twice to expand the view.
+
 CLI examples
 ============
 

diff --git a/docs/api.rst b/docs/api.rst
@@ -17,6 +17,7 @@ Below is the list of submodules available in the fuc API:
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation. The ``pycov.CovFrame`` class also contains many useful plotting methods such as ``CovFrame.plot_region`` and ``CovFrame.plot_uniformity``.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
 - **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
+- **pykallisto** : The pykallisto submodule is designed for working with RNAseq quantification data from Kallisto. It implements ``pykallisto.KallistoFrame`` which stores Kallisto's output data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pykallisto.KallistoFrame`` class also contains many useful plotting methods such as ``KallistoFrame.plot_differential_abundance``.
 - **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
 - **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
 - **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
@@ -65,6 +66,12 @@ fuc.pygff
 .. automodule:: fuc.api.pygff
    :members:
 
+fuc.pykallisto
+==============
+
+.. automodule:: fuc.api.pykallisto
+   :members:
+
 fuc.pymaf
 =========
 

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -35,7 +35,8 @@ For getting help on the fuc CLI:
        fuc-compf    Compare the contents of two files.
        fuc-demux    Parse the Reports directory from bcl2fastq.
        fuc-exist    Check whether certain files exist.
-       fuc-find     Find all filenames matching a specified pattern recursively.
+       fuc-find     Retrieve absolute paths of files whose name matches a 
+                    specified pattern, optionally recursively.
        fuc-undetm   Compute top unknown barcodes using undertermined FASTQ from bcl2fastq.
        maf-maf2vcf  Convert a MAF file to a VCF file.
        maf-oncoplt  Create an oncoplot with a MAF file.
@@ -46,6 +47,9 @@ For getting help on the fuc CLI:
        ngs-hc       Pipeline for germline short variant discovery.
        ngs-m2       Pipeline for somatic short variant discovery.
        ngs-pon      Pipeline for constructing a panel of normals (PoN).
+       ngs-quant    Pipeline for running RNAseq quantification from FASTQ files 
+                    with Kallisto.
+       ngs-trim     Pipeline for trimming adapters from FASTQ files.
        tabix-index  Index a GFF/BED/SAM/VCF file with Tabix.
        tabix-slice  Slice a GFF/BED/SAM/VCF file with Tabix.
        tbl-merge    Merge two table files.
@@ -55,6 +59,7 @@ For getting help on the fuc CLI:
        vcf-merge    Merge two or more VCF files.
        vcf-rename   Rename the samples in a VCF file.
        vcf-slice    Slice a VCF file for specified regions.
+       vcf-split    Split a VCF file by individual.
        vcf-vcf2bed  Convert a VCF file to a BED file.
        vcf-vep      Filter a VCF file by annotations from Ensembl VEP.
    
@@ -524,28 +529,28 @@ fuc-find
 .. code-block:: text
 
    $ fuc fuc-find -h
-   usage: fuc fuc-find [-h] [--dir PATH] pattern
+   usage: fuc fuc-find [-h] [-r] [-d PATH] pattern
    
-   Find all filenames matching a specified pattern recursively.
-   
-   This command will recursively find all the filenames matching a specified
-   pattern and then return their absolute paths.
+   Retrieve absolute paths of files whose name matches a specified pattern,
+   optionally recursively.
    
    Positional arguments:
-     pattern     Filename pattern.
+     pattern               Filename pattern.
    
    Optional arguments:
-     -h, --help  Show this help message and exit.
-     --dir PATH  Directory to search in (default: current directory).
+     -h, --help            Show this help message and exit.
+     -r, --recursive       Turn on recursive retrieving.
+     -d PATH, --directory PATH
+                           Directory to search in (default: current directory).
    
-   [Example] Find VCF files in the current directory:
+   [Example] Retrieve VCF files in the current directory only:
      $ fuc fuc-find "*.vcf"
    
-   [Example] Find specific VCF files:
-     $ fuc fuc-find "*.vcf.*"
+   [Example] Retrieve VCF files recursively:
+     $ fuc fuc-find "*.vcf" -r
    
-   [Example] Find zipped VCF files in a specific directory:
-     $ fuc fuc-find "*.vcf.gz" --dir ~/test_dir
+   [Example] Retrieve VCF files in a specific directory:
+     $ fuc fuc-find "*.vcf" -d /path/to/dir
 
 fuc-undetm
 ==========
@@ -980,6 +985,89 @@ ngs-pon
      "-l h='node_A|node_B'" \
      "-Xmx15g -Xms15g"
 
+ngs-quant
+=========
+
+.. code-block:: text
+
+   $ fuc ngs-quant -h
+   usage: fuc ngs-quant [-h] [--thread INT] [--bootstrap INT] [--job TEXT]
+                        [--force] [--posix]
+                        manifest index output qsub
+   
+   Pipeline for running RNAseq quantification from FASTQ files with Kallisto.
+   
+   External dependencies:
+     - SGE: Required for job submission (i.e. qsub).
+     - kallisto: Required for RNAseq quantification.
+   
+   Manifest columns:
+     - Name: Sample name.
+     - Read1: Path to forward FASTA file.
+     - Read2: Path to reverse FASTA file.
+   
+   Positional arguments:
+     manifest         Sample manifest CSV file.
+     index            Kallisto index file.
+     output           Output directory.
+     qsub             SGE resoruce to request for qsub.
+   
+   Optional arguments:
+     -h, --help       Show this help message and exit.
+     --thread INT     Number of threads to use (default: 1).
+     --bootstrap INT  Number of bootstrap samples (default: 50).
+     --job TEXT       Job submission ID for SGE.
+     --force          Overwrite the output directory if it already exists.
+     --posix          Set the environment variable HDF5_USE_FILE_LOCKING=FALSE 
+                      before running Kallisto. This is required for shared Posix 
+                      Filesystems (e.g. NFS, Lustre).
+   
+   [Example] Specify queue:
+     $ fuc ngs-quant \
+     manifest.csv \
+     transcripts.idx \
+     output_dir \
+     "-q queue_name -pe pe_name 10" \
+     --thread 10
+
+ngs-trim
+========
+
+.. code-block:: text
+
+   $ fuc ngs-trim -h
+   usage: fuc ngs-trim [-h] [--thread INT] [--job TEXT] [--force]
+                       manifest output qsub
+   
+   Pipeline for trimming adapters from FASTQ files.
+   
+   External dependencies:
+     - SGE: Required for job submission (i.e. qsub).
+     - cutadapt: Required for trimming adapters.
+   
+   Manifest columns:
+     - Name: Sample name.
+     - Read1: Path to forward FASTA file.
+     - Read2: Path to reverse FASTA file.
+   
+   Positional arguments:
+     manifest      Sample manifest CSV file.
+     output        Output directory.
+     qsub          SGE resoruce to request for qsub.
+   
+   Optional arguments:
+     -h, --help    Show this help message and exit.
+     --thread INT  Number of threads to use (default: 1).
+     --job TEXT    Job submission ID for SGE.
+     --force       Overwrite the output directory if it already exists.
+   
+   [Example] Specify queue:
+     $ fuc ngs-trim \
+     manifest.csv \
+     output_dir \
+     "-q queue_name -pe pe_name 10" \
+     --thread 10
+
 tabix-index
 ===========
 
@@ -1218,7 +1306,10 @@ vcf-merge
    Merge two or more VCF files.
    
    Positional arguments:
-     vcf_files      VCF files (compressed or uncompressed).
+     vcf_files      VCF files (compressed or uncompressed). Note that the 'chr'
+                    prefix in contig names (e.g. 'chr1' vs. '1') will be 
+                    automatically added or removed as necessary to match the 
+                    contig names of the first VCF.
    
    Optional arguments:
      -h, --help     Show this help message and exit.
@@ -1314,6 +1405,30 @@ vcf-slice
    [Example] Output a compressed file:
      $ fuc vcf-slice in.vcf.gz regions.bed | fuc fuc-bgzip > out.vcf.gz
 
+vcf-split
+=========
+
+.. code-block:: text
+
+   $ fuc vcf-split -h
+   usage: fuc vcf-split [-h] [--clean] [--force] vcf output
+   
+   Split a VCF file by individual.
+   
+   Positional arguments:
+     vcf         VCF file to be split.
+     output      Output directory.
+   
+   Optional arguments:
+     -h, --help  Show this help message and exit.
+     --clean     By default, the command will only return variants present in 
+                 each individual. Use the tag to stop this behavior and make 
+                 sure that all individuals have the same number of variants.
+     --force     Overwrite the output directory if it already exists.
+   
+   [Example] Split a VCF file by individual:
+     $ fuc vcf-split in.vcf output_dir
+
 vcf-vcf2bed
 ===========
 

diff --git a/docs/create.py b/docs/create.py
@@ -77,6 +77,7 @@
 - Ensembl Variant Effect Predictor (VEP)
 - SnpEff
 - bcl2fastq and bcl2fastq2
+- Kallisto
 
 Your contributions (e.g. feature ideas, pull requests) are most welcome.
 
@@ -153,6 +154,9 @@
    >>> from fuc import pyvcf
    >>> help(pyvcf)
 
+In Jupyter Notebook and Lab, you can see the documentation for a python
+function by hitting ``SHIFT + TAB``. Hit it twice to expand the view.
+
 CLI examples
 ============
 

diff --git a/fuc/api/pycov.py b/fuc/api/pycov.py
@@ -8,7 +8,9 @@
 and ``CovFrame.plot_uniformity``.
 """
 from io import StringIO, IOBase
+import warnings
 import gzip
+from pathlib import Path
 
 from . import common, pybam, pybed
 
@@ -181,6 +183,9 @@ def from_bam(
         Under the hood, the method computes read depth using the
         :command:`samtools depth` command.
 
+        Sample name is extracted from the SM tag. If the tag is missing, the
+        method will set the filename as sample name.
+
         Parameters
         ----------
         bam : str or list, optional
@@ -257,6 +262,14 @@ def from_bam(
                 name = names[i]
             else:
                 samples = pybam.tag_sm(bam_file)
+                if not samples:
+                    basename = Path(bam_file).stem
+                    message = (
+                        f'SM tags were not found for {bam_file}, will use '
+                        f'file name as sample name ({basename})'
+                    )
+                    samples = [basename]
+                    warnings.warn(message)
                 if len(samples) > 1:
                     m = f'multiple sample names detected: {bam_file}'
                     raise ValueError(m)