A SEDA pipeline created in Compi that implements the "Preparing datasets for large scale phylogenetic analyses" SEDA-based protocol. Created using the SEDA-Compi pipelines framework.
This protocol shows how to retrieve and process a large amount of coding sequences of a given gene. The portrayed example concerns GULO, a gene that encodes for the protein that catalyzes the final oxidation step of the Vitamin C biosynthetic pathway in animals (http://doi.org/10.7554/eLife.06369).
It is recommended to run this quick-start example to check that everything works fine:
- Download this ZIP and decompress it. The path where it is extracted will be referred as "working directory" (
/path/to/working_dir
). - Move to the working directory and run
./run.sh "$(pwd)"
. This will run the entire pipeline with eight input files, 4 from RefSeq and 4 from GenBank.
After running the quick-start, remove the output
folder that was created in the working directory.
Then, download the input data as explained here. GenBank data must be placed in input/rename-ncbi_1
and RefSeq data in input/rename-ncbi_2
.
Before running the pipeline, you can:
- Edit the
compi.params
to change the batch size of the four "NCBI rename" (rename-ncbi
) and "BLAST" (blast
) operations. The batch size is the maximum number of files each SEDA command will process at the same time. If no provided, it means that the command will attempt to process all files at the same time. The provided values are appropriate in most cases (workstations with 8-16GB of RAM). For higher values, the amount of RAM memory that SEDA can use must be increased. - Change the amount of RAM memory that SEDA can use by exporting the variable
SEDA_JAVA_MEMORY
. This must be done before running therun.sh
script (e.g.export SEDA_JAVA_MEMORY="-Xmx8G"
). - Reduce disk usage (at the cost of increasing running time) by creating files
params/rename-ncbi_1.cliParams
andparams/rename-ncbi_2.cliParams
with the following contents--output-gzip
. This way, the outputs of such operations will be compressed using GZIP and therefore reducing the amount of disk space they require.
And now, the full pipeline can be executed with ./run.sh "$(pwd)"
.
To run specific tasks an additional parameter can be passed to the run.sh
script: ./run.sh "$(pwd)" "--single-task rename-ncb_1"
or ./run.sh "$(pwd)" "--until merge_3"
.
Made with contrib.rocks.