sr2silo can convert millions of Short-Read nucleotide read in the form of a .bam CIGAR alignments to cleartext alignments. Further, it will gracefully extract insertions and deletions. Optionally, sr2silo can translate and align each read using diamond / blastX. And again handle insertions and deletions.
Your input .bam/.sam
with one line as:
294 163 NC_045512.2 79 60 31S220M = 197 400 CTCTTGTAGAT FGGGHHHHLMM ...
sr2silo outputs per read a JSON (mock output):
{
"metadata":{
"read_id":"AV233803:AV044:2411515907:1:10805:5199:3294",
...
},
"nucleotideInsertions":{
"main":[10 : ACTG]
},
"aminoAcidInsertions":{
"E":[],
...
"ORF1a":[2323 : TG, 2389 : CA],
...
"S":[23 : A]
},
"alignedNucleotideSequences":
{
"main":"NNNNNNNNNNNNNNNNNNCGGTTTCGTCCGTGTTGCAGCCG...GTGTCAACATCTTAAAGATGGCACTTGTGNNNNNNNNNNNNNNNNNNNNNNNN"
},
"unalignedNucleotideSequences":{
"main":"CGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTCCGGGTGTGA...TACAGGTTCGCGACGTGCTCGTGTGAAAGATGGCACTTGTG"
},
"alignedAminoAcidSequences":{
"E":"",
...
"ORF1a":"...XXXMESLVPGFNEKTHVQLSLPVLQVRVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVXXXXXX...",
...
"S":""}
}
The total output is handled in an .ndjson.zst
.
When running sr2silo, particularly the import-to-loculus
command, be aware of memory and storage requirements:
- Standard configuration uses 8GB RAM and one CPU core
- Processing batches of 100k reads requires ~3GB RAM plus ~3GB for Diamond
- Temporary storage needs (especially on clusters) can reach 30-50GB
For detailed information about resource requirements, especially for cluster environments, please refer to the Resource Requirements documentation.
Originally this was started for wargeling short-read genomic alignments for from wastewater-sampling, into a format for easy import into Loculus and its sequence database SILO.
sr2silo is designed to process a nucliotide alignments from .bam
files with metadata, translate and align reads in amino acids, gracefully handling all insertions and deletions and upload the results to the backend LAPIS-SILO.
For the V-Pipe to Silo implementation we carry through the following metadata:
"metadata":{
"read_id":"AV233803:AV044:2411515907:1:10805:5199:3294",
"sample_id":"A1_05_2024_10_08",
"batch_id":"20241024_2411515907",
"sampling_date":"2024-10-08",
"sequencing_date":"2024-10-24",
"location_name":"Lugano (TI)",
"read_length":"250","primer_protocol":"v532",
"location_code":"05",
"flow_cell_serial_number":"2411515907"
"sequencing_well_position":"A1",
"primer_protocol_name":"SARS-CoV-2 ARTIC V5.3.2",
"nextclade_reference":"sars-cov-2"
}
To build the package and maintain dependencies, we use Poetry. In particular, it's good to install it and become familiar with its basic functionalities by reading the documentation.
sr2silo can be installed either from Bioconda or from source.
The easiest way to install sr2silo is through the Bioconda channel:
# Add necessary channels if you haven't already
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# Install sr2silo
conda install sr2silo
For development purposes or to install the latest version, you can install from source using Poetry:
The project uses a modular environment system to separate core functionality, development requirements, and workflow dependencies. Environment files are located in the environments/
directory:
For basic usage of sr2silo:
make setup
This creates the core conda environment with essential dependencies and installs the package using Poetry.
For development work:
make setup-dev
This command sets up the development environment with Poetry.
For working with the snakemake workflow:
make setup-workflow
This creates an environment specifically configured for running the sr2silo in snakemake workflows.
You can set up all environments at once:
make setup-all
After setting up the development environment:
conda activate sr2silo-dev
poetry install --with dev
poetry run pre-commit install
make test
or
conda activate sr2silo-dev
pytest
The sr2silo CLI has two main commands:
run
- Not yet implemented command for future functionalityimport-to-loculus
- Convert BAM alignments to SILO format and optionally upload
The main command you'll use is import-to-loculus
:
sr2silo import-to-loculus \
--input-file INPUT.bam \
--sample-id SAMPLE_ID \
--batch-id BATCH_ID \
--timeline-file TIMELINE.tsv \
--primer-file PRIMERS.yaml \
--output-fp OUTPUT.ndjson \
--reference sars-cov-2
--input-file, -i
: Path to the input BAM alignment file--sample-id, -s
: Sample ID to use for metadata--batch-id, -b
: Batch ID to use for metadata--timeline-file, -t
: Path to the timeline metadata file--primer-file, -p
: Path to the primers configuration file--output-fp, -o
: Path for the output file (will be auto-suffixed with .ndjson.zst)
--reference, -r
: Reference genome to use (default: "sars-cov-2")--upload/--no-upload
: Whether to upload results to S3 and submit to SILO (default: no-upload)
Here's a complete example with sample data:
sr2silo import-to-loculus \
--input-file ./data/sample/alignments/REF_aln_trim.bam \
--sample-id "A1_05_2024_10_08" \
--batch-id "20241024_2411515907" \
--timeline-file ./data/timeline.tsv \
--primer-file ./data/primers.yaml \
--output-fp ./results/output.ndjson \
--reference sars-cov-2
To also upload the results to SILO, add the --upload
flag:
sr2silo import-to-loculus \
# ...same arguments as above... \
--upload
The code quality checks run on GitHub can be seen in
.github/workflows/test.yml
for the python package CI/CD,
We are using:
- Ruff to lint the code.
- Black to format the code.
- Pyright to check the types.
- Pytest to run the unit tests code and workflows.
- Interrogate to check the documentation.
This project welcomes contributions and suggestions. For details, visit the repository's Contributor License Agreement (CLA) and Code of Conduct pages.