Skip to content

Changing References

Adam Novak edited this page Nov 15, 2023 · 13 revisions

Some vg formats, like GBZ, distinguish between paths of different types, or "senses".

For example, we can put a P-line path chr1 and a haplotype for sample sample on contig chr1 into a GBZ file:

cat >demo.gfa <<EOF
H	VN:Z:1.1
S	1	GATTACT
S	2	A
S	3	T
S	4	CATTAG
L	1	+	2	+	*
L	1	+	3	+	*
L	2	+	4	+	*
L	3	+	4	+	*
P	chr1	1+,2+,4+	*
W	sample1	0	chr1	0	14	>1>3>4
W	sample2	0	chr1	0	14	>1>2>4
EOF
vg gbwt -G demo.gfa --gbz-format -g demo.gbz
rm demo.gfa

Then we can inspect the path metadata as a TSV with vg paths -M:

vg paths -M -x demo.gbz

The result will be:

#NAME	SENSE	SAMPLE	HAPLOTYPE	LOCUS	PHASE_BLOCK	SUBRANGE
chr1	GENERIC	NO_SAMPLE_NAME	NO_HAPLOTYPE	chr1	NO_PHASE_BLOCK	NO_SUBRANGE
sample1#0#chr1#0	HAPLOTYPE	sample1	0	chr1	0	NO_SUBRANGE
sample2#0#chr1#0	HAPLOTYPE	sample2	0	chr1	0	NO_SUBRANGE

Changing at GBZ Build Time

Tools like vg surject won't operate on haplotype paths by default. If we want to change the graph to make a haplotype path into a reference path, we can apply a transformation when converting formats to promote haplotypes for a sample:

vg convert -a --ref-sample sample1 demo.gbz >demo-promoted.vg

If we check the path metadata again:

vg paths -M -x demo-promoted.vg

The result will be:

#NAME	SENSE	SAMPLE	HAPLOTYPE	LOCUS	PHASE_BLOCK	SUBRANGE
chr1	GENERIC	NO_SAMPLE_NAME	NO_HAPLOTYPE	chr1	NO_PHASE_BLOCK	NO_SUBRANGE
sample1#0#chr1	REFERENCE	sample1	0	chr1	NO_PHASE_BLOCK	NO_SUBRANGE
sample2#0#chr1#0	HAPLOTYPE	sample2	0	chr1	0	NO_SUBRANGE

Note that the sense of the path for sample sample1 has changed to reference, and that its haplotype and phase block information have been removed. (It is safe to remove the haplotype because it is 0, the value used for a haploid haplotype. If we were using a diploid sample, 1 and 2 would have been used for the haplotypes, and they would have been preserved.)

If we decide we don't want the old original chr1 path in there anymore, we can remove all paths with a chr1 prefix:

vg paths -x demo-promoted.vg -d -Q chr1 >demo-removed.vg

Then we can list the path names:

vg paths -L -x demo-removed.vg

And we will see that it is no longer there:

sample1#0#chr1
sample2#0#chr1#0

Converting the resulting graph back to GBZ format, for use with vg giraffe, can be done like this:

vg gbwt -x demo-removed.vg --index-paths -g demo-reindexed.gbz --gbz-format 

Then we can look for generic paths:

vg paths -L -G -x demo-reindexed.gbz | wc -l

And we will see that there aren't any:

0

Large GBZ and GBWT files with Many Haplotypes

When there are a lot of haplotypes stored in a GBWT or GBZ, it can be infeasible to convert to .vg format and back to manipulate the references. In this case, you can edit the GBWT or GBZ file to adjust how the stored paths are interpreted.

In a GBWT or GBZ, some samples have reference paths, and other samples have haplotype paths. The samples which have reference paths are stored in a reference_samples tag in the GBWT tags section, analogous to the RS GFA header tag that vg supports. This tag contains a space-separated list of reference sample names (e.g. RS:Z:GRCh38 CHM13). See File Formats for further information.

If we look at this tag in the GBZ file with a reference that we just prepared:

vg gbwt -Z --tags demo-reindexed.gbz | grep "^reference_samples"

The result will be:

reference_samples	sample1

We can adjust the tag and write a modified GBZ file:

vg gbwt -Z --set-tag "reference_samples=sample2" --gbz-format -g demo-retagged.gbz demo-reindexed.gbz

Starting from version 1.53.0, the --set-tag "reference_samples=sample2 option can be replaced with --set-reference sample2.

Note that the sample name should not contain any # characters; these are used in PanSN sequence names to delimit sample, phase, and contig portions of the name, and additionally in vg to append a phase block number to haplotypes. (See Path Metadata Model.) If your graph has a path HG002#1#ctg1234#0 which you would like to make into a reference path, add the tag reference_samples=HG002.

After that, if we inspect the path metadata:

vg paths -M -x demo-retagged.gbz

The result will be:

#NAME	SENSE	SAMPLE	HAPLOTYPE	LOCUS	PHASE_BLOCK	SUBRANGE
sample1#0#chr1#0	HAPLOTYPE	sample1	0	chr1	0	NO_SUBRANGE
sample2#0#chr1	REFERENCE	sample2	0	chr1	NO_PHASE_BLOCK	NO_SUBRANGE

Note that path name has changed to reflect its new status as a reference. Haplotypes have a #0 (or #1, #2, etc.) at the end to denote "phase block", in case the entire underlying contig is not phased together in the sample. Reference paths do not have this, but can instead have a trailing bracketed number to indicate that they represent only part of a full contig, starting at that offset (as would be found in an extracted subgraph or if part of the graph is removed). See Path Metadata Model for more information on how these names are generated for GBZ graphs from the GBZ's structured metadata.

Multiple References

Instead of a single reference sample name, we can use a space-separated list of them:

vg gbwt -Z --set-tag "reference_samples=sample1 sample2" --gbz-format -g demo-multi.gbz demo-reindexed.gbz

Starting from version 1.53.0, you can instead use the --set-reference option multiple times: --set-reference sample1 --set-reference sample2.

If we inspect the path metadata once again:

vg paths -M -x demo-multi.gbz

The result will be:

#NAME	SENSE	SAMPLE	HAPLOTYPE	LOCUS	PHASE_BLOCK	SUBRANGE
sample1#0#chr1	REFERENCE	sample1	0	chr1	NO_PHASE_BLOCK	NO_SUBRANGE
sample2#0#chr1	REFERENCE	sample2	0	chr1	NO_PHASE_BLOCK	NO_SUBRANGE

You can use this to allow you to vg surject into paths from different references using the same GBZ, or to vg annotate -p a set of GAM reads with their positions along paths from multiple samples. But, be careful when using very large numbers of references in a GBZ. There is per-reference startup overhead when loading a GBZ, because each path in each reference sample needs to be traced to build position lookup data.

Cleanup

To clean up:

rm demo.gbz demo-promoted.vg demo-removed.vg demo-reindexed.gbz demo-retagged.gbz
Clone this wiki locally