snippy.qmd

# Find Sequence Variants {#sec-snippy}

[`snippy`](https://github.com/tseemann/snippy) is a sequence variant finding pipeline, designed for haploid genomes such as prokaryotes and viruses. It uses `BWA-MEM` to align reads to the reference genome, but it also performs a number of filtering and statistical steps to improve the quality of SNP identification.

- [`snippy` documentation](https://github.com/tseemann/snippy)

As with most kinds of bioinformatics data, there is a data format designed specifically to capture the important elements of the data, and `snippy` outputs its results in this format (called **VCF**) as well as in a data table.

::: { .callout-important }
## Good practice

When reporting how you identified variants for your manuscript or dissertation, you should always state:

1. The software tool you used, with its version number and a citation of the paper describing it (if available; provide a URL to the software if there is no paper)
2. The parameters used when running the tool (if default parameters were used, state this)
:::

## Using `snippy`

1. Navigate to the `snippy` tool
2. Select `snippy`
3. Select `Use a genome from history and build index` from `Will you select a reference genome…`
4. Choose the `GCF_000006765.1_ASM676v1_genomic.fna` file in `Use the following dataset as the reference sequence`
5. Select `Paired` under `Single or Paired-end reads`
6. Choose the cleaned post-`trimmomatic` `(R1 paired)` data for the forward reads
7. Choose the cleaned post-`trimmomatic` `(R2 paired)` data for the reverse reads
8. Click `Run Tool`

::: { .callout-note }
## Video: Identifying sequence variants using `snippy`

{{< video assets/movies/galaxy-14-snippy.mp4 title="Identifying sequence variants using `snippy`" >}}
:::

::: { .callout-caution }
**`snippy` can take a few minutes to run to completion.**
:::

## `snippy` Output

`snippy` provides output in `.vcf` format, which is designed to be processed by bioinformatics software and is not the most understandable form for humans. To see the `snippy` output in a more human-readable form, click on the `eye` icon for the `snps table` output (@fig-snippy-output).


![The `snippy` `snps table` output. Key columns in the output are: column 2, indicating the position of the variant; column 4, indicating the reference base(s); column 5, indicating the dominant variant in the mapped data; and column 6, showing the counts of each variant in the mapped data.](assets/images/snippy_table.png){#fig-snippy-output width="80%"}

::: { .callout-note }
## Video: Inspecting `snippy` output

{{< video assets/movies/galaxy-15-snippy.mp4 title="Inspecting `snippy` output" >}}
:::

::: { .callout-questions }
1. How many variants are there in the SNPs table? How does this compare with your estimate from `JBrowse` visualisation?
2. Can you find SNPs from the `snippy` output manually using JBrowse in your earlier visualisation?
3. Can you find any SNPs that are different or inconsistent between the `JBrowse` and `snippy` outputs?

::: { .callout-tip collapse="true" }
## Hint 1

Inspect the SNPs table output box. How many lines are in the table? I estimated about 49,000 SNPs from visual inspection of `JBrowse`, which was much greater than the ≈22,000 reported by `snippy`.
:::

::: { .callout-caution collapse="true" }
## Explaining your observations

Bioinformatic tools that perform the same kind of task may differ in exactly how they carry out that task.
This can result in them generating output that is not identical, even if they claim to calculate the same kind of value or make the same kind of measurement.

Even different versions of the same tool might differ and produce different output, so **it is good practice to explain exactly which version of a software tool you use in your work**.
:::

:::

::: { .callout-challenge }
The PAO1 isolate was sequenced at the turn of the millennium (@Stover2000-fp) and is a common laboratory reference straiin.
It would be useful to map your isolate's reads against an assembled genome from your study.
You have one such genome assembly in `ERR531413_asm.fasta`

1. Map your `ERR531380` isolate reads to the `ERR531413_asm.fasta` assembly using `snippy`.
2. How many sequence variants are there between the two genomes?
3. Why might you think that your study isolate shares a more, or less, recent common ancestor with `ERR531413` than with PAO1?
:::

## Next Steps

In the next section you will assemble the genome of your isolate. The resulting sequence will be informative for tracing the likely routes of transmission in the hospital ward.