Releases: mossmatters/HybPiper
HybPiper version 2.3.2
- Bugfix: allow
hybpiper stats
to be run when gene names in the target file contain a dot (e.g.taxon1-gene001.01
). See issue#164.
HybPiper version 2.3.1
- Re-write of the modules for commands
hybpiper stats
,hybpiper retrieve_sequences
, andhybpiper paralog_retriever
, to vastly speed up processing of compressed sample (*.tar.gz
) folders.- Much improved speed using a single thread (the hard-coded default for HybPiper version 2.3.0)
- Samples can now be processed in parallel when using these commands; new option
--cpu
added (default is to use all available CPUs minus one).
- Bugfix: ensure all intron sequences are recovered when running
hybpiper retrieve_sequences
using theintron
option.
HybPiper version 2.3.0
-
Add option
--compress_sample_folder
to commandhybpiper assemble
. Tarball and compress the sample folder after assembly has completed i.e.<sample_name>.tar.gz
.- This is useful when running HybPiper on HPC clusters with file number limits.
- If both an uncompressed and compressed folder exist for a sample, a warning is shown and HybPiper exits.
- All HybPiper subcommands (
stats
,recovery_heatmap
,retrieve_sequences
,paralog_retriever
,filter_by_length
) work with either compressed or uncompressed sample files/folders, or a combination of both. - If a
<sample_name>.tar.gz
file already exists for a sample, it will be extracted and used for the current run ofhybpiper assemble
, and the<sample_name>.tar.gz
file will be deleted.
-
When using BWA for read mapping, the command
samtools flagstat
is now run during thehybpiper assemble
step, rather than duringhybpiper stats
, and the results are written to a<sample_name>_bam_flagstat.tsv
\<sample_name>_unpaired_bam_flagstat.tsv
file(s).- If the
<sample_name>_bam_flagstat.tsv
\<sample_name>_unpaired_bam_flagstat.tsv
file(s) are not present in a sample directory (i.e. the sample was assembled with HybPiper version <2.3.0),samtools flagstat
will be run duringhybpiper stats
. If the sample is a*.tar.gz
file, the*.bam
file(s) will first be extracted to disk to a temporary directory calledtemp_bam_files
, within your current working directory. This temporary directory will be deleted aftersamtools flagstat
has been run.
- If the
-
Add option
--not_protein_coding
tohybpiper assemble
. When this option is provided, sequences matching your target file references will be extracted from SPAdes contigs using BLASTn, rather than Exonerate. This should improve recovery when using a target file with non-protein-coding sequences. Note that this feature is new and might have bugs - please report any issues.- Only nucleotide
*.FNA
sequences will be produced (i.e. no amino-acid sequences). - Intronerate will not be run; intron and supercontig sequences will not be produced.
- If BLASTx or DIAMOND is selected for read mapping (i.e. protein vs translated-nucleotide searches), a warning will be displayed and read mapping will switch to BWA.
- Only nucleotide
-
Add the following options to control BLASTn searches of SPAdes contigs when option
--not_protein_coding
is used:--extract_contigs_blast_task
. Task to use for blastn searches (blastn, blastn-short, megablast, dc-megablast). Default is blastn.--extract_contigs_blast_evalue
. Expectation value (E) threshold for saving hits. Default is 10.--extract_contigs_blast_word_size
. Word size for wordfinder algorithm (length of best perfect match).--extract_contigs_blast_gapopen
. Cost to open a gap.--extract_contigs_blast_gapextend
. Cost to extend a gap.--extract_contigs_blast_penalty
. Penalty for a nucleotide mismatch.--extract_contigs_blast_reward
. Reward for a nucleotide match.--extract_contigs_blast_perc_identity
. Percent identity.--extract_contigs_blast_max_target_seqs
. Maximum number of aligned sequences to keep (value of 5 or more is recommended). Default is 500.
-
The final step of the
hybpiper assemble
pipeline has been renamed fromexonerate_contigs
toextract_contigs
(as either Exonerate or BLASTn can now be used). -
Reorganised grouping of help options when running
hybpiper assemble --help
to improve clarity. -
Changed option
--timeout_assemble
forhybpiper assemble
to--timeout_assemble_reads
to match the step name. -
Changed option
--timeout_exonerate_contigs
forhybpiper assemble
to--timeout_extract_contigs
to match the step name. -
Changed option
--exonerate_hit_sliding_window_size
forhybpiper assemble
to--trim_hit_sliding_window_size
. This option now applies to either Exonerate hits (and is measured in amino-acids) or BLASTn (measured in nucleotides). Defaults are 5 amino-acids (Exonerate; changed from previous default of 3) or 15 nucleotides (BLASTn). -
Changed option
--exonerate_hit_sliding_window_thresh
forhybpiper assemble
to--trim_hit_sliding_window_thresh
. This option now applies to either Exonerate hits (and is measured via amino-acid similarity) or BLASTn (measured via nucleotide similarity). Defaults are 75 for amino-acids (Exonerate; changed from previous default of 55) or 65 for nucleotides (BLASTn). -
Fixed a bug in
fix_targetfile.py
-MAFFT
is now called viasubprocess
rather thanBio.Align.Applications.MafftCommandline
when checking for best match translations (see issue#156). -
Added a more informative error message if running
hybpiper retrieve_sequences
orhybpiper paralog_retriever
from HybPiper version >=2.2.0 on sample folders from HybPiper version >2.2.0. This error occurs because the sample folders do not contain a<prefix>_chimera_check_performed.txt
file (see issue#155). -
When extracting coding sequences from SPAdes contigs using Exonerate, changed the initial Exonerate run to not use the option
--refine full
(see Exonerate docs), unless the option--exonerate_refine_full
is provided tohybpiper assemble
. Although the Exonerate option--refine full
should improve output alignments, in some cases it can result in spurious alignment regions (e.g. an intron/non-coding region being included as an "exon" alignment) that can get incorporated in to the HybPiper output sequence.
HybPiper version 2.2.0
- Add option
--end_with
to commandhybpiper assemble
. Allows the user to end the assembly pipeline at a chosen step (map_reads, distribute_reads, assemble_reads, exonerate_contigs). - Add option
--exonerate_skip_hits_with_frameshifts
to commandhybpiper assemble
. If provided, skip Exonerate hits where the SPAdes contig contains frameshifts when considering hits for assembly of an*.FNA
sequence. Default behaviour in HybPiper v2.2.0 is to include these hits; previous versions allowed them automatically. - Add option
--exonerate_skip_hits_with_internal_stop_codons
to commandhybpiper assemble
. If provided, skip Exonerate hits where the SPAdes contig contains internal in-frame stop codon(s) when considering hits for assembly of an*.FNA
sequence. A single terminal stop codon is allowed. Default behaviour in HybPiper v2.2.0 is to include these hits; previous versions allowed them automatically. - Add option
--exonerate_skip_hits_with_terminal_stop_codons
to commandhybpiper assemble
. If provided, skip Exonerate hits where the SPAdes sequence contains a single terminal stop codon. Only applies when option--exonerate_skip_hits_with_internal_stop_codons
is also provided. Only use this flag if your target file exclusively contains protein-coding genes with no stop codons included, and you would like to prevent any in-frame stop codons in the output sequences. Default behaviour in HybPiper v2.2.0 is to include these hits; previous versions allowed them automatically. - Add option
--chimeric_stitched_contig_check
to commandhybpiper assemble
. If provided, HybPiper will attempt to determine whether a stitched contig is a potential chimera of contigs from multiple paralogs. Default behaviour in HybPiper v2.2.0 is to skip this check; previous versions performed the check automatically. Skipping this check speeds up the final 'exonerate_contigs' step of the pipeline, significantly. - Add option
--no_pad_stitched_contig_gaps_with_n
to commandhybpiper assemble
. If provided, when constructing stitched contigs, do not pad any gaps between hits (with respect to the "best" protein reference) with a number of Ns corresponding to the reference gap multiplied by 3. Default behaviour in HybPiper v2.2.0 is to pad gaps with Ns; previous versions did this automatically. - Add option
--skip_targetfile_checks
to commandhybpiper assemble
. Skip the target file checks. Can be used if you are confident that your target file has no issues (e.g. if you have previously runhybpiper check_targetfile
). - Add option
--no_spades_eta
to commandhybpiper assemble
. When SPAdes is run concurrently using GNU parallel, the "--eta" flag can result in many "sh: /dev/tty: Device not configured" errors written to stderr. Using this option removes the "--eta" flag to GNU parallel, silencing both ETA output and the error message. - Fixed a bug in
exonerate_hits.py
that could (rarely) result in a duplicated region in the output*.FNA
sequence. - Fixed a bug in
exonerate_hits.py
that occurred when more than two Exonerate hits had identical query ranges and similarity scores; this could result in a sequence not being returned for the given gene. - Added
tests
folder containing initial unit tests. Some tests require python packagepyfakefs
to run. - Refactor of the hybpiper package. New module
hybpiper_main.py
with entry point (moved fromassemble.py
), and someassemble.py
functions moved toutils.py
. Target file checking functionality has been consolidated. - HybPiper now logs to
stdout
rather thanstderr
. - Commands
hybpiper check_targetfile
andhybpiper assemble
now write a report file when checking the target file (check_targetfile_report-<target file name>.txt
), rather than logging details to the main sample log. Commandhybpiper check_targefile
writes the report to the current working directory, whereas commandhybpiper assemble
writes it to the sample directory. - If the option
--cpu
is not specified forhybpiper assemble
, HybPiper will now use all available CPUs minus one, rather than all available CPUs. - Command
hybpiper assemble
now checks for output from previous runs for the pipeline steps selected via--start_from
and--end_with
(default is to select all steps). If previous output is found, HybPiper will exit with an error unless the option--force_overwrite
is provided. - Corrected the reading frame of sequence
Artocarpus-gene660
in the test dataset target file. - Command
hybpiper assemble
now writes the file<prefix>_chimera_check_performed.txt
to the sample directory. This is a text file containing 'True' or 'False' depending on whether the option--skip_chimeric_genes
was provided to commandhybpiper assemble
. Used byhybpiper retrieve_sequences
andhybpiper paralog_retriever
.
HybPiper version 2.1.8
- Add new subcommand
hybpiper filter_by_length
, used to filter the sequence output ofhybpiper retrieve sequences
by absolute length and/or length relative to mean length in target file representatives. This is done on a per-sample/per-gene basis, rather than the sample-level filtering available inhybpiper retrieve_sequences
. See wiki for more information. - Update the regex used to check target file fasta header formatting, to capture scenarios where a name contains multiple dashes and also ends with a dash.
- In the
fix_targetfile.py
module, remove the import ofBio.Align.Applications.MafftCommandline
and callMAFFT
viasubprocess
(see issue#147). - In the
gene_recovery_heatmap.py
module, cast the dataframe from theseq_lengths_file
to objectdtype
to avoid a deprecation warning . - Add option
--no_heatmap
to commandhybpiper paralog_retriever
(see issue#150). - Fix an Exonerate-related debug message in
exonerate_hits.py
.
HybPiper version 2.1.7
- The flag
--run_intronerate
was removed from thehybpiper assemble
command in therun_hybpiper_test_dataset.sh
file. - Removed the legacy check and attempted download of the test dataset in the
run_hybpiper_test_dataset.sh
file. - Added a check to
hybpiper stats
andhybpiper retrieve_sequences
to ensure sample names in thenamelist.txt
file do not contain forward slashes issue#143. - When checking for putative chimeric gene sequences in
hybpiper retrieve_sequences
andhybpiper paralog_retriever
, generate a warning rather than an error if the file<sample_name>_genes_derived_from_putative_chimeric_stitched_contig.csv
can't be found for a given sample. This file will not be written if no gene sequences were produced for this sample (i.e. no reads mapped, no SPAdes contigs, no sequences extracted from SPAdes contigs via Exonerate). - Check that target file FASTA headers do not contain quotation marks (
"
or'
); issue#125. - Updated the installation instructions in the README and Wiki to use the Bioconda package, and added installation instruction for Macs with Apple Silicon (M1/M2/M3 chips).
- Fixed a bug in
exonerate_hits.py
that meant that hits were not always trimmed to start with the first amino-acid with full alignment identity. This bug could potentially have had an effect on output sequences only if the values for--exonerate_hit_sliding_window_size
and/or--exonerate_hit_sliding_window_thresh
were changed from default values. - Use
importlib.metadata
rather thanpkg_resources
for module version checks, due to deprecation of the latter.
HybPiper version 2.1.6
- Intronerate is now run by default. The flag
--run_intronerate
for subcommandhybpiper assemble
has been changed to--no_intronerate
. - If Intronerate fails, failed genes and errors will be printed and logged; the exonerate_contigs step of the pipeline will continue.
- Updated error handling and logging for the exonerate_contigs step of the pipeline.
- Change default DPI of heatmaps to 100 (previously 150) for
hybpiper recovery_heatmap
andhybpiper paralog_retriever
- Enforce rendering of all loci (x-axis) and sample (y-axis) labels in heatmaps; previously, matplotlib/seaborn would dynamically drop labels if they were too closely spaced.
- Added flags
--no_xlabels
and--no_ylabels
forhybpiper recovery_heatmap
andhybpiper paralog_retriever
; turns off rendering of the corresponding labels in the saved figures. - If the auto-calculated size of heatmaps for
hybpiper recovery_heatmap
andhybpiper paralog_retriever
is greater than the maximum number of pixels (65536) in either/or length and height, resize the figure to 400 inches and 100 DPI. Note that large datasets can fail to render fully in the saved figure even if the pixel dimensions are less than the maximum (see e.g. https://stackoverflow.com/questions/64393779/how-to-render-a-heatmap-for-a-large-array), but reducing the size/DPI further allows the full figure to be rendered. - Added module
version.py
for a single location of HybPiper version number. - Print and log HybPiper version when calling all subcommands.
- Added column 'TotalBasesRecovered' to the
hybpiper stats
report, listing the total number of nucleotides recovered for each sample (not counting N characters). Added 'TotalBasesRecovered' as a filtering option inhybpiper retrieve_sequences
.
HybPiper version 2.1.5
- Bugfix: fixed an issue in
exonerate_hits.py
that could result in initial Exonerate hits being trimmed too aggressively at their 3' ends. - Bugfix: fixed an issue in
exonerate_hits.py
that could introduce minor insertions in to the supercontig (concatenated exon and partial intron) sequence used when running Intronerate.
HybPiper version 2.1.4
Bugfix: fixed an issue when using --run_intronerate
that could cause an error and result in no *.FNA
sequence being produced for some genes.
HybPiper version 2.1.3
v2.1.3 Update to 2.1.3