LRScaf: improving draft genomes using long noisy reads
Hybrid assembly strategy is a reasonable and promising approach to utilize strengths and settle weaknesses in Next-Generation Sequencing (NGS) and Third-Generation Sequencing (TGS) technologies. According to this principle, we here present a new toolkit named LRScaf (Long Reads Scaffolder) by applied TGS data to improve draft genome assembly. The main features are: short running time, accuracy, and being contiguity. To scaffold rice genome, it could be done in 20 mins with minimap mapper. In human, LRScaf could improve the draft assembly NG50 from 127.5 Kb to 10.4 Mb on 20x PacBio CHM1 dataset and NG50 from 115.7 Kb to 17.4 Mb on ~35x Nanopore NA12878 dataset.
################################################################################
Requirements
################################################################################
Java version: 1.8+.
################################################################################
Building LRScaf project
################################################################################
There are two ways to build and run this project:
>unzip lrscaf-<version>.zip
# 2. change the working folder
>cd lrscaf-<version>
# 3. complie source code and package the project, and a jar package named
# LRScaf-<version>.jar would be under the target folder.
>mvn package
################################################################################
Quick starting
################################################################################
# XML configuration style
>java -jar LRScaf-<version>.jar -x <configure.xml>
# or command-line in short style
>java -jar LRScaf-<version>.jar -c <draft_assembly.fasta> -a <alignment.m4> -t <m4> -o <output_foloder> [options]
# or command-line in long style
>java -jar LRScaf-<version>.jar --contig <draft_assembly.fasta> --alignedFile <alignment.m4> -t <m4> --output <output_foloder> [options]
################################################################################
A Oryza sativa L. Tutorials
################################################################################
# Improving a draft assemblies using LRScaf is generally by three steps.
# The first step: generated a draft assemblies using NGS reads.
# Download the NGS dataset (prefecth SRR8446493) and extract NGS reads (fastq-dump SRR8446493);
# Download the TGS dataset under the project PRJNA318714 on NCBI and extract TGS reads of about 20-fold coverages;
# Counstruct the NGS draft assemlbies using SOAPdenovo2 (More details: https://sourceforge.net/projects/soapdenovo2/)
>SOAPdenovo127mer pregraph -s ./assembly.config -d 1 -K 83 -R -p 48 -o ./83/83
>SOAPdenovo127mer contig -R -g ./83/83
>SOAPdenovo127mer map -p 48 -s ./assembly.config -g ./83/83
>SOAPdenovo127mer scaff -p 48 -L 150 -F -g ./83/83
# The content of "assembly.config" file:
# max_rd_len=150
# [LIB]
# avg_ins=300
# reverse_seq=0
# asm_flags=3
# q1=read_R1.fq
# q2=read_R2.fq
# The second step: alignment the TGS long reads against the draft assemblies.
# mapping the TGS long reads against the draft assemblies with minimap2 or BLASR.
>minimap2 -t 8 ./draft.fa ./tgs20x.fa >./aln.mm
# The last step: improving draft assemblies using LRScaf.
>java -Xms100g -Xmx100g -jar LRScaf.jar -x ./scafconf.xml
# The content of "scafconf.xml" file:
# <scaffold>
# <input>
# <contig>./draft.fa</contig>
# <mm>./aln.mm</mm>
# </input>
# <output>./</output>
# <paras>
# <min_contig_length>1200</min_contig_length>
# <identity>0.1</identity>
# <min_overlap_length>960</min_overlap_length>
# <min_overlap_ratio>0.8</min_overlap_ratio>
# <max_overhang_length>1000</max_overhang_length>
# <max_overhang_ratio>0.1</max_overhang_ratio>
# <max_end_length>1000</max_end_length>
# <max_end_ratio>0.1</max_end_ratio>
# <min_supported_links>1</min_supported_links>
# <tips_length>1000</tips_length>
# <ratio>0.2</ratio>
# <repeat_mask>true</repeat_mask>
# <iqr_time>3</iqr_time>
# <mmcm>20</mmcm> <!--only for Minimap Alignment.-->
# <process>4</process>
# </paras>
# </scaffold>
################################################################################
Parameters of LRScaf
################################################################################
LRScaf supports parameters set by XML confiuration file or command-line. It recommends to use XML configuration file. There is a template configuration file of XML format, named "scafconf.xml", in the project. In command-line, LRScaf supports long (dash-dash) and short (dash) style of GNU like options. And the following table would show each parameter meaning and default value if available.
The first and second columns are the command-line paremeters in long and its coressponding short style.
The third column is the code in XML configuration file. NA is not available in XML configuration file.
The fourth column is the details and default value of this option if available.
Parameter | Abbreviation | XML Code | Details |
---|---|---|---|
xml | x | NA | The XML configuration file. All command-line parameters would be omitted if this is set. |
contig | c | contig | The contigs file of draft assembly in fasta format. |
m5 | m5 | m5 | The alignment file in -m 5 format of BLASR. |
m4 | m4 | m4 | The alignment file in -m 4 format of BLASR. |
sam | sam | sam | The alignment file in sam format of BLASR. |
mm | mm | mm | The alignment file in PAF format of Minimap. |
output | o | output | The output folder. |
miniCntLen | micl | min_contig_length | The minimum contigs length to be included for scaffolding. Default: <200> bp. |
identity | i | identity | The identity threshold for filtering invalid alignment. Default: <0.8>. This value must be modify according to the mapper. For the BLASR alignment file, the higher value means the higher identity. For the Minimap alignment file, the value should not be larger than 0.3 and the value could be set to 0.1. |
miniOLLen | mioll | min_overlap_length | The minimum overlap length of contig. Default: <160> bp. |
miniOLRatio | miolr | min_overlap_ratio | The minimum overlap length ratio of contig. Default: <0.8>. If the overlap length is large than the miniOLLen, it will compute the ratio of overlap length which is overlap_length/contig_length. |
maOHLen | maohl | max_overhang_length | The maximum overhang length of contig. Default: <300> bp. |
maOHRatio | maohr | max_overhang_ratio | The maximum overhang ratio of contig. Default: <0.1>. If the overhang length is less than the maohl, it will compute the ratio of overhang length which is overhang_lenght/contig_length. |
maELen | mael | max_end_length | The maximum ending length of long read. Default: <300> bp. |
maERatio | maer | max_end_ratio | The maximum ending ratio of long read. Default: <0.1>. It will compute the ending length (ending_len) by long_read_length * maer, then def_ending_len = (mael >= ending_len ? ending_len : mael). |
miSLN | misl | min_supported_links | The minimum support links. Default: <1>. If the depth of long reads less than 10x, the misl could be set to 1. |
ratio | r | ratio | The ratio for deleting error prone edges in divergence nodes. Default: <0.2>. |
mr | mr | repeat_mask | The indicator for masking repeats. Default: <true>. Masking repeats will reduce the divergent nodes in the scaffolding graph and improve the contiguity of assemblies. It recommends to be true. |
tiplength | tl | tip_length | The maximum tip length. Default: <1500> bp. |
iqrtime | iqrt | iqr_time | The IQR times for setting contigs as repeats by their coverages. Default: <1.5>. |
mmcm | mmcm | mmcm | The parameter to filter invalid Minimap alignments. Default: <8>. Only for Minimap alignment. |
process | p | process | The multi-threads settings. Default:<4>. |
help | h | NA | Print this help information. |
################################################################################
XML Configuration File Content
################################################################################
<?xml version="1.0" encoding="UTF-8"?>
<scaffold>
<!--The input file for scaffolding, including contigs and aligned files (i.e. m5, m4 or mm file) -->
<input>
<contig>Draft assembly in fasta format.</contig>
<m4>The aligned file in BLASR -m 4 format.</m4>
</input>
<!-- The output folder for scaffolding -->
<output>The output folder.</output>
<!-- The parameters for scaffolding-->
<paras>
<!--More details are showed in README.md-->
<min_contig_length>500</min_contig_length>
<identity>0.8</identity>
<min_overlap_length>400</min_overlap_length>
<min_overlap_ratio>0.8</min_overlap_ratio>
<max_overhang_length>500</max_overhang_length>
<max_overhang_ratio>0.1</max_overhang_ratio>
<max_end_length>500</max_end_length>
<max_end_ratio>0.1</max_end_ratio>
<min_supported_links>2</min_supported_links>
<tip_length>1500</tip_length>
<ratio>0.2</ratio>
<repeat_mask>true</repeat_mask>
<iqr_time>3</iqr_time>
<mmcm>8</mmcm> <!--only for Minimap Alignment.-->
<process>4</process>
</paras>
</scaffold>
################################################################################
Licence
################################################################################
LRScaf is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
If you have any questions, please feel free to contact me <qinmao@caas.cn>.