Skip to content

12. Creation of Specific Datasets

George Pacheco edited this page Aug 2, 2021 · 7 revisions

We used ANGSD--v0.921 together with the wrapper_angsd.sh to create specific datasets to be used by different downstream analyses.

12.1. Dataset I

ALL GOOD SAMPLES With ALL WGS, GBS, WGS-GBS Trios (257 SAMPLES / 184 GBS, 50 WGS & 23 WGS-GBS):

Runs ANGSD (List of samples as in 11.):
xsbatch -c 64 --mem-per-cpu 7800 -J PBGP_AllSites --time 5-00 --force -- $SCRIPTS/scripts/wrapper_angsd.sh -debug 2 -nThreads 64 -ref ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun.fasta -bam ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.list -sites ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged_RemovedPossibleParalogs-g800--Article--Ultra.pos -rf ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun_ChrGreater1kb.id -remove_bads 1 -uniqueOnly 1 -baq 1 -C 50 -minMapQ 30 -minQ 20 -minInd $((257*95/100)) -doCounts 1 -GL 1 -doGlf 2 -doMajorMinor 1 -doMaf 1 -doPost 2 -doGeno 3 -doPlink 2 -geno_minDepth 3 -setMaxDepth $((257*275)) -dumpCounts 2 -postCutoff 0.95 -doHaploCall 1 -doVcf 1 -out ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra
Number of SITES: 1,997,420
Gets a .labels file:
awk '{split($0,a,"/"); print a[9]}' ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.list | awk '{split($0,b,"."); print b[1]}' > ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.labels
Gets Real Coverage (Genotype Likelihoods):
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.counts.gz | tail -n +2 | gawk ' {for (i=1;i<=NF;i++){a[i]+=$i;++count[i]}} END{ for(i=1;i<=NF;i++){print a[i]/count[i]}}' | paste ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.labels - > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/RealCoverage/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.txt
Gets Missing Data (Genotype Likelihoods):
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.beagle.gz | tail -n +2 | perl /groups/hologenomics/fgvieira/scripts/call_geno.pl --skip 3 | cut -f 4- | awk '{ for(i=1;i<=NF; i++){ if($i==-1)x[i]++} } END{ for(i=1;i<=NF; i++) print i"\t"x[i] }' | paste ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.labels - | awk '{print $1"\t"$3"\t"$3*100/1997420}' > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/MissingDataCalc/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.GL-Missing.txt
Gets Missing Data (Random Haplotype Calling):
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.haplo.gz | cut -f 4- | tail -n +2 | awk '{ for(i=1;i<=NF; i++){ if($i=="N")x[i]++} } END{ for(i=1;i<=NF; i++) print i"\t"x[i] }' | paste ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.labels - | awk '{print $1"\t"$3"\t"$3*100/1997420}' > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/MissingDataCalc/PBGP--GoodSamples_WithAllWGS-GBSPairs--Article--Ultra.RHC-Missing.txt

12.2. Dataset II

ALL GOOD SAMPLES with WGS version of the WGS_GBS_WGS-GBS Trios (210 SAMPLES / 161 GBS & 49 WGS):

Gets a list of samples:
find ~/data/Pigeons/Analysis/PaleoMix_GBS/*.bam ~/data/Pigeons/Analysis/PaleoMix_Re-Sequencing/*.bam | grep -f ~/data/Pigeons/Analysis/Lists/ALL_Re-Seqed-GBSBreedPlates--Article.list | grep -v -f ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--BadSamples--Article.list | grep -v -f ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--NoCrupestris--Article.list | grep -v -f ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GBS_Pairs--Article.list > ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithWGSs_NoCrupestris--Article--Ultra.list
Runs ANGSD:
xsbatch -c 34 --mem-per-cpu 9200 -J PBGP_SNPs --time 10-00:00 --force -- $SCRIPTS/scripts/wrapper_angsd.sh -debug 2 -nThreads 34 -ref ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun.fasta -bam ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithWGSs_NoCrupestris--Article--Ultra.list -sites ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged_RemovedPossibleParalogs-g800--Article--Ultra.pos -rf ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun_ChrGreater1kb.id -remove_bads 1 -uniqueOnly 1 -baq 1 -C 50 -minMapQ 30 -minQ 20 -minInd $((210*95/100)) -doCounts 1 -GL 1 -doGlf 2 -doMajorMinor 1 -doMaf 1 -MinMaf 0.005 -SNP_pval 1e-6 -doPost 2 -doGeno 3 -doPlink 2 -geno_minDepth 3 -setMaxDepth $((210*275)) -dumpCounts 2 -postCutoff 0.95 -doHaploCall 1 -doVcf 1 -out ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra
Number of SITES: 26,082
Gets Real Coverage (Genotype Likelihoods):
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra.counts.gz | tail -n +2 | gawk ' {for (i=1;i<=NF;i++){a[i]+=$i;++count[i]}} END{ for(i=1;i<=NF;i++){print a[i]/count[i]}}' | paste ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithWGSs_NoCrupestris--Article--Ultra.labels - > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/RealCoverage/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra-RealCoverage.txt
Gets Missing Data (Genotype Likelihoods):
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra.beagle.gz | tail -n +2 | perl $SCRIPTS/scripts/call_geno.pl --skip 3 | cut -f 4- | awk '{ for(i=1;i<=NF; i++){ if($i==-1)x[i]++} } END{ for(i=1;i<=NF; i++) print i"\t"x[i] }' | paste ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithWGSs_NoCrupestris--Article--Ultra.labels - | awk '{print $1"\t"$3"\t"$3*100/26082}' > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/MissingDataCalc/PBGP--GoodSamples_WithWGSs_NoCrupestris_SNPCalling--Article--Ultra.GL-Missing.txt

12.3. Dataset III

ALL GOOD SAMPLES with WGS version of the WGS_GBS_WGS-GBS Trios, NO Ferals & NO ODD SAMPLES (207 SAMPLES / 159 GBS & 48 WGS):

Gets a list of samples:
find ~/data/Pigeons/Analysis/PaleoMix_GBS/*.bam ~/data/Pigeons/Analysis/PaleoMix_Re-Sequencing/*.bam | grep -f ~/data/Pigeons/Analysis/Lists/ALL_Re-Seqed-GBSBreedPlates--Article.list | grep -v -f ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--BadSamples--Article.list | grep -v -f ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--NoOddSamplesNoFerals--Article.list | grep -v -f ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GBS_Pairs--Article.list > ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals--Article--Ultra.list
Runs ANGSD:
xsbatch -c 40 --mem-per-cpu 6000 -J PBGP_SNPs --time 5-00:00 --force -- $SCRIPTS/scripts/wrapper_angsd.sh -debug 2 -nThreads 40 -ref ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun.fasta -bam ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals--Article--Ultra.list -sites ~/data/Pigeons/Reference/PBGP_FinalRun.EcoT22I_Extended_Merged_RemovedPossibleParalogs-g800--Article--Ultra.pos -rf ~/data/Pigeons/Reference/DanishTumbler_Dovetail_ReRun_ChrGreater1kb.id -remove_bads 1 -uniqueOnly 1 -baq 1 -C 50 -minMapQ 30 -minQ 20 -minInd $((207*95/100)) -doCounts 1 -GL 1 -doGlf 2 -doMajorMinor 1 -doMaf 1 -MinMaf 0.005 -SNP_pval 1e-6 -doPost 2 -doGeno 3 -doPlink 2 -geno_minDepth 3 -setMaxDepth $((207*275)) -dumpCounts 2 -postCutoff 0.95 -doHaploCall 1 -doVcf 1 -out ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals_SNPCalling--Article--Ultra
Number of SITES: 26,504
Gets Real Coverage (Genotype Likelihoods):
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals_SNPCalling--Article--Ultra.counts.gz | tail -n +2 | gawk ' {for (i=1;i<=NF;i++){a[i]+=$i;++count[i]}} END{ for(i=1;i<=NF;i++){print a[i]/count[i]}}' | paste ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals--Article--Ultra.labels - > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/RealCoverage/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals_SNPCalling--Article--Ultra-RealCoverage.txt
Gets Missing Data (Genotype Likelihoods):
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals_SNPCalling--Article--Ultra.beagle.gz | tail -n +2 | perl $SCRIPTS/scripts/call_geno.pl --skip 3 | cut -f 4- | awk '{ for(i=1;i<=NF; i++){ if($i==-1)x[i]++} } END{ for(i=1;i<=NF; i++) print i"\t"x[i] }' | paste ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals--Article--Ultra.labels - | awk '{print $1"\t"$3"\t"$3*100/26504}' > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/MissingDataCalc/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals_SNPCalling--Article--Ultra.GL-Missing.txt
Gets Missing Data (Genotype Calling):
zcat ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--ANGSDRuns/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals_SNPCalling--Article--Ultra.geno.gz | cut -f 5- | awk '{ for(i=1;i<=NF; i++){ if($i==-1)x[i]++} } END{ for(i=1;i<=NF; i++) print i"\t"x[i] }' | paste ~/data/Pigeons/PBGP/PBGP--Analyses/Lists/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals--Article--Ultra.labels - | awk '{print $1"\t"$3"\t"$3*100/26504}' > ~/data/Pigeons/PBGP/PBGP--Analyses/PBGP--Miscellaneous/MissingDataCalc/PBGP--GoodSamples_WithWGSs_NoOddSamplesNoFerals_SNPCalling--Article--Ultra.GC-Missing.txt

Clone this wiki locally