Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added code for lifting over assemblies #574

Merged
merged 1 commit into from
Nov 16, 2020
Merged

added code for lifting over assemblies #574

merged 1 commit into from
Nov 16, 2020

Conversation

jradrion
Copy link
Contributor

@jradrion jradrion commented Aug 5, 2020

Here is my very unpolished code for doing liftOver between assemblies. I've tested this on humans and flies so far, and it is running without issue for me. The code requires that you have the UCSC liftOver tool installed and that you point to the appropriate chain file using the argument --chainFile. The --winLen option controls the interval length over which a weighted average recombination rate is calculated and --gapThresh control the length of gaps where recombination rates are set to zero rather than using the chromosome average rate.

An example command for lifting over the human map would be as follows:

python liftOver_assemblies.py 
--species HomSap 
--map HapMapII_GRCh37
--chainFile ./liftOver_chains/hg19ToHg38.over.chain.gz
--winLen 1000
--gapThresh 1000000

@codecov
Copy link

codecov bot commented Aug 5, 2020

Codecov Report

Merging #574 (dbe28c9) into master (566c38f) will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #574      +/-   ##
==========================================
+ Coverage   99.65%   99.67%   +0.01%     
==========================================
  Files          33       33              
  Lines        2328     2158     -170     
  Branches      293      256      -37     
==========================================
- Hits         2320     2151     -169     
  Misses          3        3              
+ Partials        5        4       -1     
Impacted Files Coverage Δ
stdpopsim/slim_engine.py 99.67% <0.00%> (-0.01%) ⬇️
stdpopsim/cli.py 98.43% <0.00%> (-0.01%) ⬇️
stdpopsim/cache.py 100.00% <0.00%> (ø)
stdpopsim/utils.py 100.00% <0.00%> (ø)
stdpopsim/models.py 100.00% <0.00%> (ø)
stdpopsim/engines.py 100.00% <0.00%> (ø)
stdpopsim/genomes.py 100.00% <0.00%> (ø)
stdpopsim/species.py 100.00% <0.00%> (ø)
stdpopsim/citations.py 100.00% <0.00%> (ø)
stdpopsim/genetic_maps.py 100.00% <0.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 566c38f...2502517. Read the comment docs.

@grahamgower
Copy link
Member

grahamgower commented Aug 5, 2020

Nice one, thanks Jeff!

I'm a noob when it comes to lifting over, so I have some questions about the nature of this liftOver tool that maybe are a bit offtopic, but here goes anyway...

  • UCSC seem to only offer precompiled binaries (https://genome-store.ucsc.edu/). There is mention of source code availablity, but its not obvious where this can be obtained. The linked license terms suggest that one is not allowed to make derivatives (3.2a in https://genome-store.ucsc.edu/media/eula/2019/10/14/UCSC_GB_EULA.pdf). Jeff: does your script then constitute a derivative work?
  • How does liftOver work? Is there a publication describing it? Or is the only description to be found in the sources (which I couldn't easily find)?
  • What is a chain file anyway, and how can one be created?

@jeromekelleher
Copy link
Member

Thanks @jradrion, this is great!

@jeffspence - it would be super helpful if we could get your thoughts on what we're doing here. What we need to do is lift over older genetic maps to GRCh38 (for humans), and to whatever genome build is used in Ensembl for other species. As such, it's something we do rarely so the code doesn't need to be especially polished or automated, but we'd like to have some assurance that we've not introduced artifacts into the maps. Any thoughts would be much appreciated.

@jradrion
Copy link
Contributor Author

jradrion commented Aug 5, 2020

@grahamgower thanks!

does your script then constitute a derivative work?

I'm not a lawyer, but I would argue that this does not constitute a derivative work. All I'm doing is running the UCSC liftOver tool "as is", with some pre- and post-processing going on to convert the file formats for our use. To me this is no different than running liftOver using a shell script, which I'd have to imagine would not break the EULA. If this is a concern for people it would be pretty easy to strip out the call to liftOver and split this code into two scripts, where we run the first half, then we run liftover, then we run the second half.

How does liftOver work?

I'm also not a liftOver expert, and this is a really good question. I guess I've always assumed it uses some kind of hash function to just grab the coordinates from the chain file. To be honest, I have no idea how it works.

What is a chain file anyway, and how can one be created?

Chain files are better described here, but it's basically just a file of coordinates from genome-wide local alignments mapping between the two assemblies. The algorithm for building the chains is published (https://www.pnas.org/content/100/20/11484.full). This page describes how to make your own.

@jeromekelleher
Copy link
Member

I'm not a lawyer, but I would argue that this does not constitute a derivative work. All I'm doing is running the UCSC liftOver tool "as is", with some pre- and post-processing going on to convert the file formats for our use. To me this is no different than running liftOver using a shell script, which I'd have to imagine would not break the EULA. If this is a concern for people it would be pretty easy to strip out the call to liftOver and split this code into two scripts, where we run the first half, then we run liftover, then we run the second half.

Let's not worry about this - we're not distributing the maintenance code to end users, so I don't think we need to worry about license issues.

@jeffspence
Copy link
Contributor

This looks great! If I understand correctly -- you map the hapmap format file to a BED, use liftOver to remap the BED and then do a bit of post-processing (filling in gaps, and averaging within windows) to clean up the bed and map back to hapmap. My only minor quibble is on line 140 you use the mean of the rates to fill in short gaps. I think it would make a bit more sense to use the mean rate instead of the mean of the rates (the mean of the rates up-weights short intervals and down-weights long intervals; in species with hotspots, high rates will tend to be in short intervals and low rates will tend to be in long intervals, so my guess is that the mean of the rates would be a bit higher than the mean rate).

This overall approach is a little bit different than what Adam Auton did to lift the hapmap maps over. This is from the README from the GRCh37 LiftOver of the HapMap recombination maps(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20110106_recombination_hotspots/README_hapmapII_GRCh37_map):

The map was generated by lifting the HapMap Phase II genetic map from build 35 to GRCh37. The original map was generated using LDhat
as described in the 2007 HapMap paper (Nature, 18th Sept 2007). The conversion from b35 to GRCh37 was achieved using the UCSC liftOver
tool.

Once the liftOver was completed, the map was inspected for regions in which the genome assembly had be rearranged. Such rearrangements
result in a dip in the lifted genetic map (i.e. a negative recombination rate). These regions were removed by setting the
recombination to zero within, and for 50 SNPs either side of, such regions. These regions are fairly rare, and this method removed a
total of 2013 SNPs from the following chromosomes.

chr1: 224 SNPs
chr2: 209 SNPs
chr3: 104 SNPs
chr4: 103 SNPs
chr7: 547 SNPs
chr9: 292 SNPs
chr10: 101 SNPs
chr13: 208 SNPs
chr15: 101 SNPs
chrX: 124 SNPs.

All other chromosomes did not have any SNPs removed by this method.

Adam Auton

I'm not 100% sure what this actually means. It sounds like he might have looked at the actual position in genetic distance for each line in the file, and then back-calculated the rate (from the differences in physical distances and genetic distances between adjacent lines), setting negative rates to zero. But I think that what you did makes more sense.

I did something similar to what you're doing in my paper (https://doi.org/10.1126/sciadv.aaw9206 ; Table S2) where I inferred maps on both GRCh37 and GRCh38, and then lifted over the maps from GRCh37 to GRCh38 and computed correlations between the map inferred on GRCh38 and the map inferred on GRCh37 but lifted over to GRCh38. The way I lifted over the map was a bit more naive than what you're doing -- I just converted the recombination rates to bed files and then ran LiftOver on the bed files without post-processing. I only computed correlations on positions that were included in an interval in both maps, and if positions appeared in more than one interval, I used the recombination rate from the last interval. In any case, the correlations were actually quite good. It will obviously depend on the method to infer the recombination maps, but I got correlations (spearman and pearson on the log-recombination rates) of ~0.98 and higher even at the single base pair resolution. Pearson on the natural scale varied much more depending on the population. I'm guessing this is because hotspots were inferred to have somewhat different heats depending on the build, driving the differences in correlation. But overall, my take was that running LiftOver had less of an effect than I might have thought (although of course YMMV especially for lifting over between genome builds that are less close than GRCh37 and GRCh38; also my method also uses unphased data, and I suspect that methods that use phased data might change more between different builds).

Hope this helps, and overall it looks really great!
Jeff

@jradrion
Copy link
Contributor Author

jradrion commented Aug 5, 2020

@jeffspence thank you for taking a look at this!

I think it would make a bit more sense to use the mean rate instead of the mean of the rates

Great catch, I hadn't considered the issue of biased means. I also just realized that I rm the unmapped positions files automatically, which is dumb because we may want to inspect those after running this code. I'll work on modifying this commit to reflect these changes.

I'm not 100% sure what this actually means.

I'm not either. By rearrangements I was thinking he might have meant only where contiguous intervals are reordered in the lifted-over assembly, but I'm really not sure. This also doesn't address how they dealt with gaps in the new assembly when they presumably arose.

In any case, the correlations were actually quite good.

This is comforting to know!

@jradrion
Copy link
Contributor Author

jradrion commented Aug 6, 2020

I edited the code to use a properly weighted mean rate for the chromosome, per @jeffspence's suggestion. I also added the flag --retainIntermediates to retain all of the liftOver intermediate files for inspection. Finally, I added the option --joinThresh which joins adjacent intervals with same rate when rounded to joinThresh digits precision after the decimal point.

@ndukler
Copy link
Contributor

ndukler commented Sep 1, 2020

As per our conference call here's some code for seeing which target regions are covered in a chain file

#!/usr/bin/env bash

## A script designed to take a chain file and create a bed file showing its coverage on the target genome
## where chain has query --> target directionality. Places output file in same directory as source file

bname=`basename $1`
target=$2
query=$3
oname=`readlink -e $1`
oname=`dirname ${oname}`/${target}_covered_by_${query}.bed.gz

if [[ ${bname} == *.gz ]]; then
    zcat $1 | awk 'length {if($1 ~ "chain"){chrom=$8;qChrLen=$9;qStrand=$10; if(qStrand=="+"){start=$11} else{start=qChrLen-$11}} else{if(qStrand=="+"){print chrom,start,start+$1;start=start+$1+$3}; if(qStrand=="-"){print chrom,start-$1,start;start=start-$1-$3}}}' | sort-bed - | bedops -m - | awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' | gzip -c > ${oname}
else    
    awk 'length {if($1 ~ "chain"){chrom=$8;qChrLen=$9;qStrand=$10; if(qStrand=="+"){start=$11} else{start=qChrLen-$11}} else{if(qStrand=="+"){print chrom,start,start+$1;start=start+$1+$3}; if(qStrand=="-"){print chrom,start-$1,start;start=start-$1-$3}}}' $1 | sort-bed - | bedops -m - | awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' | gzip -c > ${oname}
fi

@jradrion
Copy link
Contributor Author

jradrion commented Sep 1, 2020

Thanks, @ndukler. I'll have a look.

@jradrion
Copy link
Contributor Author

I found some issues with the code that I will be fixing over the coming days so we should hold off on merging for the time being

@jeromekelleher
Copy link
Member

It's annoying we're failing on circle because we can't import the new script - I think this is because we're asking nose to track coverage on the maintenance package. Might be simplest to just put matplotlib into the list of dev dependencies, although this might slow down our install time a fair bit.

@grahamgower, any thoughts on the simplest way out here? I don't think we want to try to write tests to cover this code, so I'm not sure what's the best thing to do here.

@grahamgower
Copy link
Member

Hmm... interesting. Does this go away if we stop asking for coverage in the maintenance folder? If its as simple as that, then I'd say lets just not bother tracking coverage there.

@jeromekelleher
Copy link
Member

Hmm... interesting. Does this go away if we stop asking for coverage in the maintenance folder? If its as simple as that, then I'd say lets just not bother tracking coverage there.

Yes, that's probably the best thing all right.

@jradrion
Copy link
Contributor Author

jradrion commented Oct 7, 2020

I've again updated this script. The newest version avoids doing a weighted average for all non-overlapping windows across the chromosome, and instead only performs weighted averaging for overlapping intervals in the lifted map (a relatively rare occurrence). This dramatically reduces the error when mapping back to the original assembly, and I think this is really the way to go. Additionally, I added a new option to fill gaps with either the chromosome mean, or with the mean of the flanking intervals which seems to be a better choice in most instances.

I've also added maintenance code for the liftover of the comeron map from dm3 to dm6, as discussed in #592.

@grahamgower
Copy link
Member

Thanks @jradrion, this is now at the top of my queue of stdpopsim things. It's quite long though, so I probably wont get to it for a few more days. (And Jerome is on parental leave). Perhaps you could look into avoiding the circleci failure in the mean time? It might be sufficient to remove the maintenance folder from the nostests command in .circleci/config.yml (and if that doesn't do the trick, let me know and we'll devise a new strategy).

@jradrion
Copy link
Contributor Author

jradrion commented Oct 9, 2020

@grahamgower OK, looks like removing maintenance from nostests did the trick.

@andrewkern
Copy link
Member

@jradrion do we really want to get rid of those tests?

@grahamgower
Copy link
Member

This change made here doesn't change which tests are run. What it does do, is change which folders code coverage is reported for when running the tests (it was looking under both stdpopsim and maintenance, now it will just look under stdpopsim). Annoyingly, the continuous integration was failing because it wanted to import everything under the maintenance folder to check its coverage. This failed because @jradrion's script(s) import stuff that isn't installed on the circleci machine (and we preferred not to depend on matplotlib et al.)

@grahamgower
Copy link
Member

Hey @jradrion, I've looked over this code a couple of times now, but haven't tried running it. I think there are a few code formatting issues that should be fixed. Can you try running flake8 --max-line-length 89 maintenance/lift*.py? We should be running that in the continuous integration as well, but I guess it hasn't been added yet (you're welcome to add it, if you like).

It would be nice to include a unittest or two for this in the test suite as well. How about including a small chain file for a small section of one chromosome, and store the pre-lifted and post-lifted rate maps for that? Then the unit test would check that the lift over script produces same output as the pre-calculated post-lifted maps.

@jradrion
Copy link
Contributor Author

Thanks for looking it over, @grahamgower. I'll fix the formatting, add some tests, and update.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @jradrion!

There's a few high-level structural issues here that are resulting in us having a lot more code than we really need, but I think they can be resolved pretty easily. I'm happy to provide more input if the comments aren't clear.

from matplotlib import gridspec


def progress_bar(percent, barLen = 50):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tqdm is a handy way of doing this.

sys.stdout.flush()


def assign_task(tmpFile, task_q, nProcs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty heavyweight to need our own task queue here - have you tried using concurrent.futures?

run all functions to translate and liftOver file types
"""
# multithread remapping
task_q = mp.JoinableQueue()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem unneccesarily low-level here - there's a good example in the msprime code of using concurrent.futures along with a tqdm progress bar.

from matplotlib import gridspec


def progress_bar(percent, barLen = 50):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a near-identical copy of the liftover_assemblies.py file. Having multiple copies of the same code is almost never a good idea, and will be a significant maintenance burden over time (especially if we have lots of species - consider what happens if we find a small bug in the liftover code, which has been copied n times).

How about we have a file liftover.py which provides the basic utilities that you're using here like hapmap_to_bed etc, and then the code specific to each liftover that we do is in its own python file that imports liftover?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, basically just the main functions which you currently have?

@jradrion
Copy link
Contributor Author

@jeromekelleher Thank you for your comments! I'll look into implementing both tqdm and concurrent.futures, and I'll fix the redundancies.

@PopSim-bot
Copy link

📖 Docs for this PR can be previewed here

@jradrion
Copy link
Contributor Author

jradrion commented Nov 13, 2020

@jeromekelleher @grahamgower
I made the requested flake8 formatting changes and moved all redundant functions to liftOver.py, which is then imported by both liftOver_catalog.py and liftOver_comeron2012.py. As a reminder, liftOver_comeron2012.py is meant to be the archived code for how we generated our D. melanogaster dm6 map from the Comeron et al 2012 map, where liftOver_catalog.py is meant to be used when we want to liftOver from any of the current maps to a new map in the future (e.g. Hg37 -> Hg38).

I had quite a difficult time trying to wrap my head around how exactly I would modify this code to use concurrent.futures. I think this would take a sizeable amount of restructuring. I'm wondering if we could potentially hold off until after this code if merged and then open a new issue if we still want to restructure with concurrent.futures. Is the main motivation for using concurrent.futures just to have simpler maintenance code going forward? The way the code is currently structured there are really only a few low-level functions pertaining to multiprocessing. Let me know if you would be agreeable to this or this needs to be changed now.

@jradrion
Copy link
Contributor Author

Also, is there a consensus for whether we want to have tests for all maintenance code? This code is still failing on CircleCi. Should I be writing tests for these scripts next?

@jradrion
Copy link
Contributor Author

thank you @mufernando for helping me out of git hell.

@jeromekelleher
Copy link
Member

This is a big improvement, thanks @jradrion! I'm happy for this to be merged as-is.

There's still a bit of cleanup that could be done (variable naming conventions are a bit inconsistent with a mixture of camelCase, CamelCase, ALLCAPS, and snake_case) but it's not that important for this maintenance code like this. The motivation for using concurrent.futures would be to remove redundant code, making it easier to understand what the file is actually for. Readers looking at the code might currently think there's some deep reason for needing to do tricky low-level concurrency stuff, and it takes a few minutes to figure out that there isn't.

Anyway, this is nit-picking; awesome work, thanks!

@grahamgower grahamgower mentioned this pull request Nov 16, 2020
@grahamgower
Copy link
Member

Awesome! Thanks @jradrion! I've added a small commit to fix the circleci failure. I couldn't seem to push to your PR branch, so I had to open a new PR (#691). Will merge that and close this as soon as the CI lights are green.

@grahamgower grahamgower merged commit 62381e3 into popsim-consortium:master Nov 16, 2020
@grahamgower
Copy link
Member

Thanks again @jradrion. And sorry for mess of CI changes you had to deal with during the life of this PR. Would you be able to open a couple of follow-up issues about the stuff raised by @jeromekelleher?

@jradrion
Copy link
Contributor Author

Awesome, thanks @grahamgower. I will open issues about the comments made by @jeromekelleher.

@jradrion
Copy link
Contributor Author

Thank you to @jeromekelleher, @grahamgower, @jeffspence, and @mufernando for all of your help with this.

@jradrion jradrion deleted the liftOver branch April 8, 2021 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants