Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-doSaf retains no site with -sites #385

Closed
kwiyounghan opened this issue Mar 16, 2021 · 16 comments · Fixed by #533
Closed

-doSaf retains no site with -sites #385

kwiyounghan opened this issue Mar 16, 2021 · 16 comments · Fixed by #533

Comments

@kwiyounghan
Copy link

kwiyounghan commented Mar 16, 2021

Hi

I am trying to get SFS estimation for whole genome reseq data with 16 bams with very uneven coverages.
The ref sequence is around 0.7gb.

I am running with angsd version: 0.933 (htslib: 1.9) build(May 6 2020 21:25:11)

To get the SFS estimations, my workflow looked like

  1. Filter sites based on SNP filters or other quality filters
  2. save it as a site file then index the site file
  3. run -doSaf 1

SFS1 select sites
$FILTERS="-uniqueOnly 1 -minMapQ 30 -minQ 30 -minInd 12 -SNP_pval 1e-6 -skipTriallelic 1 -sb_pval 1e-5"
$TODO="-doMajorMinor 1 -doMaf 1 -dosnpstat 1 -doHWE 1"
$angsd -b bams -GL 1 -anc $REFGEN -P 32 -out SFS1_sites1 $TODO $FILTERS

-> Total number of sites analyzed: 601,095,263
-> Number of sites retained after filtering: 7,221,984

  1. saving and indexing sites that pass the filters (excluding mitochondrion and unplaced contigs)
    $zcat SFS1_sites1.mafs.gz | cut -f 1,2 | tail -n +2 | grep "NC.044" > SFS1_sites1

then, to only get chr1,

$grep 'NC_044048.1' > SFS1_sites1_chr1
$angsd sites index SFS1_sites1_chr1

$head SFS1_sites1_chr1
NC_044048.1 60
NC_044048.1 66
NC_044048.1 114
NC_044048.1 116

SFS1_sites1_chr1 contains 311,175 sites. (only first chr from the ref seq and filtered sites)

  1. -doSaf 1 with -sites

$FILTERS=""
$TODO="-doSaf 1 -doMajorMinor 1 -doMaf 1 -dosnpstat 1 -doHWE 1"
$angsd -b bams -GL 1 -anc $REFGEN -P 4 -sites SFS1_sites1_chr1 -out SFS3_SAF5 $TODO

The problem is, the result doesnt retain any sites in the end..
-> Total number of sites analyzed: 645625341
-> Number of sites retained after filtering: 0

Here you see the sites retained after filtering is 0.

And also, when i try to print the result file with realSFS :
$realSFS print SFS3_SAF5.saf.idx
-> Version of fname:SFS3_SAF5.saf.idx is:2
[safreader.cpp.persaf_init():112] Problem reading data: SFS3_SAF5.saf.idx

realSFS cannot read the output file.

I have played around with different options to figure out what the problem is.
With other filters, and/or with and without -r or -rf, also with and without other TODOs than -doSaf, and different combinations of these elements.
All the results behave the same way more or less.

Then, I realized when I apply filters directly with -doSaf without -sites, angsd gives me a different results with different issues.

$FILTERS="-uniqueOnly 1 -minMapQ 30 -minQ 30 -minInd 12 -SNP_pval 1e-6 -skipTriallelic 1 -sb_pval 1e-5"
$TODO="-doSaf 1 -doMajorMinor 1 -doMaf 1 -dosnpstat 1 -doHWE 1"
$angsd -b bams -GL 1 -anc $REFGEN -P 4 -r NC_044048.1 -out SFS3_SAF7 $TODO

-> Total number of sites analyzed: 29600590
-> Number of sites retained after filtering: 29584867

Here, barely any sites get filtered out, which shouldn't be the case.
Please note that, based on step 1) above, I know a lot more sites should be filtered out.

But also, when I try to read the output, realSFS throws an error..
$realSFS print SFS3_SAF7.saf.idx
-> Version of fname:SFS3_SAF7.saf.idx is:2
[safreader.cpp->persaf_init():117] Problem reading data: SFS3_SAF7.saf.idx

There seem to be a multitude of issues here.
With -sites, -doSaf doesn't retain any sites in the end.
without -sites, sites don't get filtered out based on the filters I applied.
Both cases, realSFS can't read the output files.
Are these from some bugs of the program?
Or are there any ways to fix this?

Thanks,
Kwi

@James-S-Santangelo
Copy link

Hey Kwi,

Out of curiosity are you using ANGSD installed through Conda (see here)? I've noticed at least 2 of these issues (error reading the SAF files and 0 sites retained when using -sites) when using ANGSD v0.933 (HTSLIB v.1.10.2) installed through Conda.

By contrast, when using ANGSD v0.933 (HTSLIB v1.11) installed into a standalone Singularity container (see here), these issues disappeared and everything worked as expected.

I'm not sure if this is because of the different HTSLIB versions (1.10.2 through Conda vs. 1.11 in the Singularity container) or some other compilation issue. Nonetheless, these problems suggest some issue with the ANGSD installation. It might be worth trying to re-compile ANGSD or perhaps using the Singularity container I linked above, which solved these issues for me.

Hope this helps!
James

@TeresaPegan
Copy link

TeresaPegan commented Mar 26, 2021

I am having this exact same problem as well! When I give -dosaf a sites file (of sites I identified with the exact same dataset), and no other filters, it returns 0 sites after filtering.

I just tried downloading a fresh version of angsd/htslib. I did not use conda, I used git clone etc. Unfortunately, I get the exact same result -- 0 sites retained.

Is this a recent problem? Might this work if I were to install/use an older version of angsd?

In the meantime, is there any other way to filter a saf file? E.g. to get a saf file for the entire genome and then filter that down to the sites that I need, using awk or something? Unfortunately the fact that the saf file is in an unusual format means that I am not sure how I would do that, but if anyone knows a way it would be great to hear!

Also, I am using the saf file with the aim of making a site frequency spectrum (as I believe many people who use -dosaf are). Could I apply a sites filter at the realSFS step? Would that give me an SFS built with only my sites of interest even if the saf input contains all sites in the genome?

Thanks,
-Teresa

@kwiyounghan
Copy link
Author

Hi James,

I have used your Singularity container and it did solve the problems.
Both -doSaf with -sites and realSFS worked fine with reasonable number of sites retained and no error.
I guess erros are because of some bugs regarding the htslib version?

Thanks a lot !
Kwi

@kwiyounghan
Copy link
Author

Dear Teresa,

As I commented above, James' solution worked for me.
Check the versions of your angsd and htslib and try using the Singularity container he provided or install the exact versions as above.

For your other questions, I am not sure if you can do that.
But as long as -doSaf works fine, I don't see much point in filtering sites in later steps?

I hope you find your answers!
Best,
Kwi

@James-S-Santangelo
Copy link

@kwiyounghan Glad to hear the Singularity container worked!

@TeresaPegan I'm not sure about filtering the SAF file. You could print it using realSFS print and pipe that to awk but I'm not sure you'll be able to recover the output in a format that is useful for downstream processing. However, you could use the -sites argument to realSFS while passing the genome-wide SAF file, as you suggested. This is what I had done prior to fixing the -sites issue with -doSaf and it seemed to work perfectly fine. The only downside is having to run -doSaf on the whole genome, which takes time and disk space but so long as that is not issue, that approach should work well in your case.

James

@TeresaPegan
Copy link

TeresaPegan commented Mar 31, 2021

Thank you both for your feedback. I will look into trying to set up the singularity container you linked with my university's computing cluster support group. In the meantime, though, I tried just installing and using ANGSD v0.933 and HTSLIB v1.11 on my cluster account. This did not work, however, because -dosaf just had some other kind of error. I'll paste some of the output below. I wonder if this means the singularity container would not work for me anyway?

It's great to hear that using the -sites filter on realSFS should work. It is too bad that you have to -dosaf on the whole genome first to use it, but I have enough flexibility in time and storage space that I think it's the best thing for me to do at this point. I'll keep an eye out for future updates to ANGSD that might fix this bug with the sites file and -dosaf!

Thanks,
-Teresa

         -> angsd version: 0.933 (htslib: 1.11) build(Mar 31 2021 15:34:10)
        -> [prep_sites.cpp] Reading binary representation of '/scratch/wingerb_root/wingerb1/tmpegan/BWASWTHv2/ANGSD_demog/Cgutt3x_SNPlist_rarefied.txt'
        -> [prep_sites.cpp] nChr: 41 loaded from binary filter file
        -> [abcFilter.cpp] -sites is still beta, use at own risk...
        -> Reading fasta: /scratch/wingerb_root/wingerb1/tmpegan/GCA_009819885.2_bCatUst1.pri.v2_genomic.fna
        -> Reading fasta: /scratch/wingerb_root/wingerb1/tmpegan/GCA_009819885.2_bCatUst1.pri.v2_genomic.fna
[bammer_main] 38 samples in 38 input files
        -> Parsing 38 number of samples 
No data for chromoId=0 chromoname=CM020336.1
This could either indicate that there really is no data for this chromosome
Or it could be problem with this program regSize=0 notDone=38
No data for chromoId=1 chromoname=CM020337.1
This could either indicate that there really is no data for this chromosome
Or it could be problem with this program regSize=0 notDone=38
No data for chromoId=2 chromoname=CM020338.1
This could either indicate that there really is no data for this chromosome
Or it could be problem with this program regSize=0 notDone=38
No data for chromoId=3 chromoname=CM020339.1
This could either indicate that there really is no data for this chromosome
Or it could be problem with this program regSize=0 notDone=38
No data for chromoId=4 chromoname=CM020340.1
This could either indicate that there really is no data for this chromosome
Or it could be problem with this program regSize=0 notDone=38
No data for chromoId=5 chromoname=CM020341.1
This could either indicate that there really is no data for this chromosome
Or it could be problem with this program regSize=0 notDone=38
No data for chromoId=6 chromoname=CM020342.1
This could either indicate that there really is no data for this chromosome
Or it could be problem with this program regSize=0 notDone=38
No data for chromoId=7 chromoname=CM020343.1
This could either indicate that there really is no data for this chromosome
Or it could be problem with this program regSize=0 notDone=38
No data for chromoId=8 chromoname=CM020344.1

...and this continues for all of my chromosomes, so the program exits after about 10 seconds and nothing is done. When I ran the same code using the most up-to-date versions of ANGSD and HTSLIB I did not get this error about not finding data for the chromosomes (though of course I did get the 0 sites retained issue), so I assume this has something to do with the older versions I was trying out.

@nspope
Copy link
Contributor

nspope commented Mar 31, 2021

The nspope_bandedDP branch has storage/memory requirements for SAF files that are far less than the master branch and is probably better if you're outputting the entire genome (at least, it'll be better if you have more than a few samples). git clone -b nspope_bandedDP https://github.com/ANGSD/angsd.git for now, should be merged into the master branch before too long.

@ANGSD given all these htslib issues that seem to be popping up, maybe need a container or at least a Dockerfile with well-behaved commits ... ?

@TeresaPegan
Copy link

@James-S-Santangelo I was able to get your singularity container on my cluster and run dosaf with -sites successfully on it, thanks!!

@esnielsen
Copy link

Hi it seems I am having the same problem as Kwi, and I am interested in using the singularity container version, but the link to that by James isn't working... I tried googling it and it got me here:

https://github.com/ANGSD/angsd/blob/master/README.md

Is this correct?

Thanks!
Erica

@James-S-Santangelo
Copy link

Hey Erica,

Singularity seems to have changed how they build and store containers in the past few days so that link is now broken. However, the previously linked container is still available in their archived repository (see here).

I haven't yet looked into what (if anything) has changed in terms of how these containers are now supposed to be maintained, but I have confirmed that the following command works for downloading the container locally:

singularity pull shub://James-S-Santangelo/singularity-recipes:angsd_v0.933

Hope this helps!

James

@esnielsen
Copy link

Sweet, thanks James!

@James-S-Santangelo
Copy link

This issue looks related to #348

@esnielsen
Copy link

Hi, so I was able to install with the singularity container, but now I am running into errors running angsd in my bash script. My script is:


#!/bin/bash

singularity exec singularity-recipes_angsd_v0.933.sif /opt/bin/angsd

./angsd -bam subsetbam.filelist -GL 1 -doMaf 1 -doMajorMinor 1 -nThreads 1 -out test


When I try the command with './angsd' at the beginning I get the error "./angsd: Is a directory"

And when I try with just "angsd" at the beginning I get the error "angsd: command not found"

Could anyone let me know how they ran angsd with singularity on a cluster?

Many thanks!
Erica

@TeresaPegan
Copy link

TeresaPegan commented May 4, 2021

In my cluster, I have a folder in my home directory called "angsd_sing." Within that is a script called "angsd_sing" that with this in it:

#!/bin/bash
exec singularity exec ~/angsd_sing/singularity-recipes_angsd_v0.933.sif angsd "$@"

The .sif file in referred to here is the singularity script that the cluster IT people helped me with.

I also add this to my path:

export PATH=~/angsd_sing:$PATH

I have to load a module called singularity. Once I do that, I just call the program with

angsd_sing

Hope this helps!
-Teresa

@esnielsen
Copy link

Great, I'll try that- thanks Teresa!

@ANGSD
Copy link
Owner

ANGSD commented Mar 2, 2022

Hello, there is a lot of information in this thread. Some of it relates to a conda version which I am not familiar.

Here is how it is supposed to work with the github angsd version

  1. get a list of sites to be included. This should be in the format
chromsome position
  1. then this should be indexed with
angsd sites index yourfile

Then this should be parsed as an option to angsd with

angsd -sites yourfile [other options and parameters]

you can validate that angsd correctly interprets your original file with

angsd print yourfile

There is more information here
http://www.popgen.dk/angsd/index.php/Sites

If your have already filtered out the "bad" sites when making your sites file, then there will be no benefit of including these parameters again.

Another trick would be to use the -rf in combination with the -sites argument so you only use these chromosomes/scaffolds/contigs that are in your sites files

cut -f1 yourfile | sort|uniq >your.rf
angsd -sites yourfile -rf your.rf

It would seem that the installation issue might have to do with a problem with specific older version of htslib so if this is still relevant you should use the most recent version of angsd and htslib. I will close this issue, but in case it is still causing problems you can reopen this issue.

Best

@ANGSD ANGSD closed this as completed Mar 2, 2022
isinaltinkaya added a commit that referenced this issue Oct 11, 2022
Add function aio::doAssert to replace asserts
Did not use aio::assert as name  since aio.h
namespace complains due to assert being a macro
Fixes the major bug explained in #527
Fixes issues #520 #474 #466 #420 #405 #396 #385
Possibly others; other issues should rerun the
commands using the latest version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants