Split a FASTA file

Review of the main ways to split a FASTA file

tool	language	One sequence per file	Can select nb of output files	Can select nb seq by file	Can select size of output files	Overlap possible (when sequence cut)	Can cut sequences	Subsample possible	Example	Comment
awk	awk	yes	no	yes	no	no	no	no	example
split	bash	yes	no	yes	yes	no	no	no	example	Fasta must be single line fasta (one header + one single sequence line)
bash	bash	yes	no	no	no	no	no	no	example	Individual files will have the name of the corresponding sequence, without leading >
gaas_fasta_splitter.pl from GAAS	Perl	yes	yes	yes	no	yes	yes	yes (stop when nb of files with the nb of seq asked reached)	example
PyFasta	Python	yes	yes	no	no	yes	yes	NA	example
pyfaidx	Python	yes	no	no	no	no	no	no	example
GenomeTools	Mostly C	yes	yes	no	yes	no	no	no	example
seqretsplit from EMBOSS	C	yes	no	no	no	no	no	no	example
bp_seqretsplit.pl from Bioperl	perl	yes	no	no	no	no	no	no	example
faSplit from Kent utils	C	yes	yes	no	yes	yes	yes	no	example
partition.sh from BBMap	Java	no	yes	no	no	no	no	no	example	multithreaded
seqkit	Go	yes	yes	yes	no	no	no	yes (subsequence of given region)	example
SEDA	java	yes	yes	yes	no	no	no	yes (randomizable)	example	GUI only. Using `Independent extractions` and `Randomize` options give the possibility to get sequences picked several times. There is an extra function called regular expression split (use of regex for selecting sequence by matching headers)

Example

Awk

size = chunk size pre = output file prefix pad = padding width (the width of the numeric suffix).

awk -v size=1000 -v pre=prefix -v pad=5 '
   /^>/ { n++; if (n % size == 1) { close(fname); fname = sprintf("%s.%0" pad "d", pre, n) } }
   { print >> fname }
' input.fasta

Split

split -l 2000 input.fasta

Bash

while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=${line#>}.fa
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done < myseq.fa

GAAS

split the fasta file into one file per sequence

gaas_fasta_splitter.pl -f input.fa --nb_seq_by_chunk 1

split the fasta file into files of 100 sequences

gaas_fasta_splitter.pl --nb_seq_by_chunk 100

split the fasta file into 10 files

gaas_fasta_splitter.pl --nb_chunks 10

split the fasta file into 10 files and cut the sequence in chunk of 1000000 bp

gaas_fasta_splitter.pl --nb_chunks 10 --size_seq 1000000

split the fasta file into 10 files and cut the sequence in chunk of 1000000 bp with overlap of 2000 bp

gaas_fasta_splitter.pl --nb_chunks 10 --size_seq 1000000 --overlap 2000

split the fasta file into 10 files of 20 sequences and the original sequences are cut in chunk of 1000000 bp with overlap of 2000 bp. If all the input data cannot be contained into the 10 files of 20 sequences, the output is actually a subsample of the input data.

gaas_fasta_splitter.pl --nb_chunks 10 --nb_seq_by_chunk 20 --size_seq 1000000 --overlap 2000

PyFasta

split a fasta file into 6 new files of relatively even size:

pyfasta split -n 6 original.fasta

split the fasta file into one new file per header with “%(seqid)s” being filled into each filename.:

pyfasta split –header “%(seqid)s.fasta” original.fasta

create 1 new fasta file with the sequence split into 10K-mers:

pyfasta split -n 1 -k 10000 original.fasta

2 new fasta files with the sequence split into 10K-mers with 2K overlap:

pyfasta split -n 2 -k 10000 -o 2000 original.fasta

pyfaidx

faidx --split-files original.fasta

GenomeTools

gt splitfasta -splitdesc multifastafile.fa

EMBOSS

seqretsplit input.fa

bp_seqretsplit

bp_seqretsplit file1 file2

Similar to:

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new(-format => 'fasta',
                         -fh   => \*ARGV);
while( my $s = $in->next_seq ) {
    my ($id) = ($s->id =~ /^(?:\w+)\|(\S+)\|/);
    Bio::SeqIO->new(-format => 'fasta',
                    -file   => ">".$id.".fasta")->write_seq($s);
}

faSplit

Break up scaffolds.fa using sequence names as file names (one sequence per file). Use the terminating / on the outRoot to get it to work correctly:

faSplit byname scaffolds.fa outRoot/

break up estAll.fa into 100 files (numbered est001.fa est002.fa, ... est100.fa Files will only be broken at fa record boundaries:

faSplit sequence estAll.fa 100 est

break up chr1.fa into 10 files:

faSplit base chr1.fa 10 1_

break up input.fa into 2000 base chunks:

faSplit size input.fa 2000 outRoot

break up est.fa into files of about 20000 bytes each by record:

faSplit about est.fa 20000 outRoot

Break up chrN.fa into files of at most 20000 bases each, at gap boundaries if possible. If the sequence ends in N's, the last piece, if larger than 20000, will be all one piece:

faSplit gap chrN.fa 20000 outRoot

BBMap

Split the fasta file into 5 files:

partition.sh in=file.fasta out=part%.fasta ways=5

Reference

https://www.biostars.org/p/229441/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split_fasta.md

split_fasta.md

Split a FASTA file

Review of the main ways to split a FASTA file

Example

Awk

Split

Bash

GAAS

PyFasta

pyfaidx

GenomeTools

EMBOSS

bp_seqretsplit

faSplit

BBMap

Reference

Files

split_fasta.md

Latest commit

History

split_fasta.md

File metadata and controls

Split a FASTA file

Review of the main ways to split a FASTA file

Example

Awk

Split

Bash

GAAS

PyFasta

pyfaidx

GenomeTools

EMBOSS

bp_seqretsplit

faSplit

BBMap

Reference