Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When importing bcf/vcf to plink, alleles can get flipped. #321

Open
rajwanir opened this issue Sep 13, 2024 · 1 comment
Open

When importing bcf/vcf to plink, alleles can get flipped. #321

rajwanir opened this issue Sep 13, 2024 · 1 comment

Comments

@rajwanir
Copy link

rajwanir commented Sep 13, 2024

          > > No observable difference with standard GTC input.

I don't observe any difference in the QC_Report.docx. But we do see differences in the corresponding Plink files created (BED/BIM/FAM). For example:

$ diff samples.bim ../../020a/sample_level/samples.bim | head
5c5
< 1     rs3131972       0       817341  G       A
---
> 1     rs3131972       0       817341  A       G

We see the alleles flipped. Can you address this?

This seems a flip on the gtc-to-vcf workflow when it imports to plink rather than on bcf entry point with bcf to plink. In the bcf entry point I have protected the allele order with --keep-allele-order switch so allele order would always be preserved from bcf to plink conversion. The bcf/vcf encodes alleles in the REF/ALT space but plink bim file encodes in the major/minor space. If we want to preserve the VCF allele order and go with the typical notion that REF allele is major allele. So,

vcf/bcf record (CHROM ID REF ALT):
chr1 rs3131972 A G
should translated into bim record (CHROM ID CM POS ALLELE1(minor) ALLELE2 (major)):
1 rs3131972 0 817341 G A

bim format is described here: https://www.cog-genomics.org/plink/1.9/formats#bim

Another simpler and emperical way to verify this is if you re-export samples_level/samples.bim to vcf, you would see the orignal allele order as in vcf.

You can reexport plink bed/bim/fam to vcf.gz with the following:
plink --bfile bcf_out/samples --recode vcf-iid bgz --out bcf_out/samples --allow-extra-chr --keep-allele-order --const-fid 0
And compare the plink rexported with the orignal input like this:
diff <(bcftools query -f"%ID %REF %ALT" /scratch/marksja/samples.bcf |sort) <(bcftools query -f"%ID %REF %ALT" bcf_out/samples.vcf.gz |sort)

You should expect to see no difference except for multi-allellics not imported which is not supported by plink/1.9 and would need to be fixed upstream in preparation of bcf.

Overall, I can open a separate issue and we can adress this more holistically throughout the pipeline in a different PR. However, for the BCF entry point, I see the transformation is as expected with no flip.

Feel free to let me know if this seems acceptable.

Originally posted by @rajwanir in #314 (comment)

@rajwanir
Copy link
Author

We will need to:

  1. Verify that gtc-to-vcf also includes the --keep-allele-order switch.
  2. Also may be worth bring up in meeting with all users to ensure which allele order would they prefer to have in the output. REF/ALT manner or MAJOR/MINOR. If --keep-allele-order is not added at the import, plink will switch alleles depending on the dataset cohort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant