Skip to content

Latest commit

 

History

History
172 lines (110 loc) · 6.11 KB

split_fasta.md

File metadata and controls

172 lines (110 loc) · 6.11 KB

Split a FASTA file

Review of the main ways to split a FASTA file

tool language One sequence per file Can select nb of output files Can select nb seq by file Can select size of output files Overlap possible (when sequence cut) Can cut sequences Subsample possible Example Comment
awk awk yes no yes no no no no example
split bash yes no yes yes no no no example Fasta must be single line fasta (one header + one single sequence line)
bash bash yes no no no no no no example Individual files will have the name of the corresponding sequence, without leading >
gaas_fasta_splitter.pl from GAAS Perl yes yes yes no yes yes yes (stop when nb of files with the nb of seq asked reached) example
PyFasta Python yes yes no no yes yes NA example
pyfaidx Python yes no no no no no no example
GenomeTools Mostly C yes yes no yes no no no example
seqretsplit from EMBOSS C yes no no no no no no example
bp_seqretsplit.pl from Bioperl perl yes no no no no no no example
faSplit from Kent utils C yes yes no yes yes yes no example
partition.sh from BBMap Java no yes no no no no no example multithreaded
seqkit Go yes yes yes no no no yes (subsequence of given region) example
SEDA java yes yes yes no no no yes (randomizable) example GUI only. Using Independent extractions and Randomize options give the possibility to get sequences picked several times. There is an extra function called regular expression split (use of regex for selecting sequence by matching headers)

Example

Awk

size = chunk size pre = output file prefix pad = padding width (the width of the numeric suffix).

awk -v size=1000 -v pre=prefix -v pad=5 '
   /^>/ { n++; if (n % size == 1) { close(fname); fname = sprintf("%s.%0" pad "d", pre, n) } }
   { print >> fname }
' input.fasta

Split

split -l 2000 input.fasta

Bash

while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=${line#>}.fa
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done < myseq.fa

GAAS

split the fasta file into one file per sequence

gaas_fasta_splitter.pl -f input.fa --nb_seq_by_chunk 1

split the fasta file into files of 100 sequences

gaas_fasta_splitter.pl --nb_seq_by_chunk 100

split the fasta file into 10 files

gaas_fasta_splitter.pl --nb_chunks 10

split the fasta file into 10 files and cut the sequence in chunk of 1000000 bp

gaas_fasta_splitter.pl --nb_chunks 10 --size_seq 1000000

split the fasta file into 10 files and cut the sequence in chunk of 1000000 bp with overlap of 2000 bp

gaas_fasta_splitter.pl --nb_chunks 10 --size_seq 1000000 --overlap 2000

split the fasta file into 10 files of 20 sequences and the original sequences are cut in chunk of 1000000 bp with overlap of 2000 bp. If all the input data cannot be contained into the 10 files of 20 sequences, the output is actually a subsample of the input data.

gaas_fasta_splitter.pl --nb_chunks 10 --nb_seq_by_chunk 20 --size_seq 1000000 --overlap 2000

PyFasta

split a fasta file into 6 new files of relatively even size:

pyfasta split -n 6 original.fasta

split the fasta file into one new file per header with “%(seqid)s” being filled into each filename.:

pyfasta split –header “%(seqid)s.fasta” original.fasta

create 1 new fasta file with the sequence split into 10K-mers:

pyfasta split -n 1 -k 10000 original.fasta

2 new fasta files with the sequence split into 10K-mers with 2K overlap:

pyfasta split -n 2 -k 10000 -o 2000 original.fasta

pyfaidx

faidx --split-files original.fasta

GenomeTools

gt splitfasta -splitdesc multifastafile.fa

EMBOSS

seqretsplit input.fa

bp_seqretsplit

bp_seqretsplit file1 file2

Similar to:

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-format => 'fasta',
                         -fh   => \*ARGV);
while( my $s = $in->next_seq ) {
    my ($id) = ($s->id =~ /^(?:\w+)\|(\S+)\|/);
    Bio::SeqIO->new(-format => 'fasta',
                    -file   => ">".$id.".fasta")->write_seq($s);
}

faSplit

Break up scaffolds.fa using sequence names as file names (one sequence per file). Use the terminating / on the outRoot to get it to work correctly:

faSplit byname scaffolds.fa outRoot/

break up estAll.fa into 100 files (numbered est001.fa est002.fa, ... est100.fa Files will only be broken at fa record boundaries:

faSplit sequence estAll.fa 100 est

break up chr1.fa into 10 files:

faSplit base chr1.fa 10 1_

break up input.fa into 2000 base chunks:

faSplit size input.fa 2000 outRoot

break up est.fa into files of about 20000 bytes each by record:

faSplit about est.fa 20000 outRoot

Break up chrN.fa into files of at most 20000 bases each, at gap boundaries if possible. If the sequence ends in N's, the last piece, if larger than 20000, will be all one piece:

faSplit gap chrN.fa 20000 outRoot

BBMap

Split the fasta file into 5 files:

partition.sh in=file.fasta out=part%.fasta ways=5

Reference

https://www.biostars.org/p/229441/