added code for lifting over assemblies #574

jradrion · 2020-08-05T00:20:03Z

Here is my very unpolished code for doing liftOver between assemblies. I've tested this on humans and flies so far, and it is running without issue for me. The code requires that you have the UCSC liftOver tool installed and that you point to the appropriate chain file using the argument --chainFile. The --winLen option controls the interval length over which a weighted average recombination rate is calculated and --gapThresh control the length of gaps where recombination rates are set to zero rather than using the chromosome average rate.

An example command for lifting over the human map would be as follows:

python liftOver_assemblies.py 
--species HomSap 
--map HapMapII_GRCh37
--chainFile ./liftOver_chains/hg19ToHg38.over.chain.gz
--winLen 1000
--gapThresh 1000000

codecov · 2020-08-05T00:30:06Z

Codecov Report

Merging #574 (dbe28c9) into master (566c38f) will increase coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #574      +/-   ##
==========================================
+ Coverage   99.65%   99.67%   +0.01%     
==========================================
  Files          33       33              
  Lines        2328     2158     -170     
  Branches      293      256      -37     
==========================================
- Hits         2320     2151     -169     
  Misses          3        3              
+ Partials        5        4       -1

Impacted Files	Coverage Δ
stdpopsim/slim_engine.py	`99.67% <0.00%> (-0.01%)`	⬇️
stdpopsim/cli.py	`98.43% <0.00%> (-0.01%)`	⬇️
stdpopsim/cache.py	`100.00% <0.00%> (ø)`
stdpopsim/utils.py	`100.00% <0.00%> (ø)`
stdpopsim/models.py	`100.00% <0.00%> (ø)`
stdpopsim/engines.py	`100.00% <0.00%> (ø)`
stdpopsim/genomes.py	`100.00% <0.00%> (ø)`
stdpopsim/species.py	`100.00% <0.00%> (ø)`
stdpopsim/citations.py	`100.00% <0.00%> (ø)`
stdpopsim/genetic_maps.py	`100.00% <0.00%> (ø)`
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 566c38f...2502517. Read the comment docs.

grahamgower · 2020-08-05T07:19:48Z

Nice one, thanks Jeff!

I'm a noob when it comes to lifting over, so I have some questions about the nature of this liftOver tool that maybe are a bit offtopic, but here goes anyway...

UCSC seem to only offer precompiled binaries (https://genome-store.ucsc.edu/). There is mention of source code availablity, but its not obvious where this can be obtained. The linked license terms suggest that one is not allowed to make derivatives (3.2a in https://genome-store.ucsc.edu/media/eula/2019/10/14/UCSC_GB_EULA.pdf). Jeff: does your script then constitute a derivative work?
How does liftOver work? Is there a publication describing it? Or is the only description to be found in the sources (which I couldn't easily find)?
What is a chain file anyway, and how can one be created?

jeromekelleher · 2020-08-05T08:38:44Z

Thanks @jradrion, this is great!

@jeffspence - it would be super helpful if we could get your thoughts on what we're doing here. What we need to do is lift over older genetic maps to GRCh38 (for humans), and to whatever genome build is used in Ensembl for other species. As such, it's something we do rarely so the code doesn't need to be especially polished or automated, but we'd like to have some assurance that we've not introduced artifacts into the maps. Any thoughts would be much appreciated.

jradrion · 2020-08-05T17:49:04Z

@grahamgower thanks!

does your script then constitute a derivative work?

I'm not a lawyer, but I would argue that this does not constitute a derivative work. All I'm doing is running the UCSC liftOver tool "as is", with some pre- and post-processing going on to convert the file formats for our use. To me this is no different than running liftOver using a shell script, which I'd have to imagine would not break the EULA. If this is a concern for people it would be pretty easy to strip out the call to liftOver and split this code into two scripts, where we run the first half, then we run liftover, then we run the second half.

How does liftOver work?

I'm also not a liftOver expert, and this is a really good question. I guess I've always assumed it uses some kind of hash function to just grab the coordinates from the chain file. To be honest, I have no idea how it works.

What is a chain file anyway, and how can one be created?

Chain files are better described here, but it's basically just a file of coordinates from genome-wide local alignments mapping between the two assemblies. The algorithm for building the chains is published (https://www.pnas.org/content/100/20/11484.full). This page describes how to make your own.

jeromekelleher · 2020-08-05T18:33:58Z

I'm not a lawyer, but I would argue that this does not constitute a derivative work. All I'm doing is running the UCSC liftOver tool "as is", with some pre- and post-processing going on to convert the file formats for our use. To me this is no different than running liftOver using a shell script, which I'd have to imagine would not break the EULA. If this is a concern for people it would be pretty easy to strip out the call to liftOver and split this code into two scripts, where we run the first half, then we run liftover, then we run the second half.

Let's not worry about this - we're not distributing the maintenance code to end users, so I don't think we need to worry about license issues.

jeffspence · 2020-08-05T20:41:05Z

This looks great! If I understand correctly -- you map the hapmap format file to a BED, use liftOver to remap the BED and then do a bit of post-processing (filling in gaps, and averaging within windows) to clean up the bed and map back to hapmap. My only minor quibble is on line 140 you use the mean of the rates to fill in short gaps. I think it would make a bit more sense to use the mean rate instead of the mean of the rates (the mean of the rates up-weights short intervals and down-weights long intervals; in species with hotspots, high rates will tend to be in short intervals and low rates will tend to be in long intervals, so my guess is that the mean of the rates would be a bit higher than the mean rate).

This overall approach is a little bit different than what Adam Auton did to lift the hapmap maps over. This is from the README from the GRCh37 LiftOver of the HapMap recombination maps(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20110106_recombination_hotspots/README_hapmapII_GRCh37_map):

The map was generated by lifting the HapMap Phase II genetic map from build 35 to GRCh37. The original map was generated using LDhat
as described in the 2007 HapMap paper (Nature, 18th Sept 2007). The conversion from b35 to GRCh37 was achieved using the UCSC liftOver
tool.

Once the liftOver was completed, the map was inspected for regions in which the genome assembly had be rearranged. Such rearrangements
result in a dip in the lifted genetic map (i.e. a negative recombination rate). These regions were removed by setting the
recombination to zero within, and for 50 SNPs either side of, such regions. These regions are fairly rare, and this method removed a
total of 2013 SNPs from the following chromosomes.

chr1: 224 SNPs
chr2: 209 SNPs
chr3: 104 SNPs
chr4: 103 SNPs
chr7: 547 SNPs
chr9: 292 SNPs
chr10: 101 SNPs
chr13: 208 SNPs
chr15: 101 SNPs
chrX: 124 SNPs.

All other chromosomes did not have any SNPs removed by this method.

Adam Auton

I'm not 100% sure what this actually means. It sounds like he might have looked at the actual position in genetic distance for each line in the file, and then back-calculated the rate (from the differences in physical distances and genetic distances between adjacent lines), setting negative rates to zero. But I think that what you did makes more sense.

I did something similar to what you're doing in my paper (https://doi.org/10.1126/sciadv.aaw9206 ; Table S2) where I inferred maps on both GRCh37 and GRCh38, and then lifted over the maps from GRCh37 to GRCh38 and computed correlations between the map inferred on GRCh38 and the map inferred on GRCh37 but lifted over to GRCh38. The way I lifted over the map was a bit more naive than what you're doing -- I just converted the recombination rates to bed files and then ran LiftOver on the bed files without post-processing. I only computed correlations on positions that were included in an interval in both maps, and if positions appeared in more than one interval, I used the recombination rate from the last interval. In any case, the correlations were actually quite good. It will obviously depend on the method to infer the recombination maps, but I got correlations (spearman and pearson on the log-recombination rates) of ~0.98 and higher even at the single base pair resolution. Pearson on the natural scale varied much more depending on the population. I'm guessing this is because hotspots were inferred to have somewhat different heats depending on the build, driving the differences in correlation. But overall, my take was that running LiftOver had less of an effect than I might have thought (although of course YMMV especially for lifting over between genome builds that are less close than GRCh37 and GRCh38; also my method also uses unphased data, and I suspect that methods that use phased data might change more between different builds).

Hope this helps, and overall it looks really great!
Jeff

jradrion · 2020-08-05T21:36:39Z

@jeffspence thank you for taking a look at this!

I think it would make a bit more sense to use the mean rate instead of the mean of the rates

Great catch, I hadn't considered the issue of biased means. I also just realized that I rm the unmapped positions files automatically, which is dumb because we may want to inspect those after running this code. I'll work on modifying this commit to reflect these changes.

I'm not 100% sure what this actually means.

I'm not either. By rearrangements I was thinking he might have meant only where contiguous intervals are reordered in the lifted-over assembly, but I'm really not sure. This also doesn't address how they dealt with gaps in the new assembly when they presumably arose.

In any case, the correlations were actually quite good.

This is comforting to know!

jradrion · 2020-08-06T00:01:51Z

I edited the code to use a properly weighted mean rate for the chromosome, per @jeffspence's suggestion. I also added the flag --retainIntermediates to retain all of the liftOver intermediate files for inspection. Finally, I added the option --joinThresh which joins adjacent intervals with same rate when rounded to joinThresh digits precision after the decimal point.

ndukler · 2020-09-01T16:24:46Z

As per our conference call here's some code for seeing which target regions are covered in a chain file

#!/usr/bin/env bash

## A script designed to take a chain file and create a bed file showing its coverage on the target genome
## where chain has query --> target directionality. Places output file in same directory as source file

bname=`basename $1`
target=$2
query=$3
oname=`readlink -e $1`
oname=`dirname ${oname}`/${target}_covered_by_${query}.bed.gz

if [[ ${bname} == *.gz ]]; then
    zcat $1 | awk 'length {if($1 ~ "chain"){chrom=$8;qChrLen=$9;qStrand=$10; if(qStrand=="+"){start=$11} else{start=qChrLen-$11}} else{if(qStrand=="+"){print chrom,start,start+$1;start=start+$1+$3}; if(qStrand=="-"){print chrom,start-$1,start;start=start-$1-$3}}}' | sort-bed - | bedops -m - | awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' | gzip -c > ${oname}
else    
    awk 'length {if($1 ~ "chain"){chrom=$8;qChrLen=$9;qStrand=$10; if(qStrand=="+"){start=$11} else{start=qChrLen-$11}} else{if(qStrand=="+"){print chrom,start,start+$1;start=start+$1+$3}; if(qStrand=="-"){print chrom,start-$1,start;start=start-$1-$3}}}' $1 | sort-bed - | bedops -m - | awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' | gzip -c > ${oname}
fi

jradrion · 2020-09-01T21:22:30Z

Thanks, @ndukler. I'll have a look.

jradrion · 2020-09-15T16:46:41Z

I found some issues with the code that I will be fixing over the coming days so we should hold off on merging for the time being

jeromekelleher · 2020-09-15T16:52:37Z

It's annoying we're failing on circle because we can't import the new script - I think this is because we're asking nose to track coverage on the maintenance package. Might be simplest to just put matplotlib into the list of dev dependencies, although this might slow down our install time a fair bit.

@grahamgower, any thoughts on the simplest way out here? I don't think we want to try to write tests to cover this code, so I'm not sure what's the best thing to do here.

grahamgower · 2020-09-15T17:09:41Z

Hmm... interesting. Does this go away if we stop asking for coverage in the maintenance folder? If its as simple as that, then I'd say lets just not bother tracking coverage there.

jeromekelleher · 2020-09-16T07:37:30Z

Hmm... interesting. Does this go away if we stop asking for coverage in the maintenance folder? If its as simple as that, then I'd say lets just not bother tracking coverage there.

Yes, that's probably the best thing all right.

jradrion · 2020-10-07T20:25:32Z

I've again updated this script. The newest version avoids doing a weighted average for all non-overlapping windows across the chromosome, and instead only performs weighted averaging for overlapping intervals in the lifted map (a relatively rare occurrence). This dramatically reduces the error when mapping back to the original assembly, and I think this is really the way to go. Additionally, I added a new option to fill gaps with either the chromosome mean, or with the mean of the flanking intervals which seems to be a better choice in most instances.

I've also added maintenance code for the liftover of the comeron map from dm3 to dm6, as discussed in #592.

grahamgower · 2020-10-08T06:21:33Z

Thanks @jradrion, this is now at the top of my queue of stdpopsim things. It's quite long though, so I probably wont get to it for a few more days. (And Jerome is on parental leave). Perhaps you could look into avoiding the circleci failure in the mean time? It might be sufficient to remove the maintenance folder from the nostests command in .circleci/config.yml (and if that doesn't do the trick, let me know and we'll devise a new strategy).

jradrion · 2020-10-09T18:49:06Z

@grahamgower OK, looks like removing maintenance from nostests did the trick.

andrewkern · 2020-10-13T16:59:03Z

@jradrion do we really want to get rid of those tests?

grahamgower · 2020-10-13T17:22:21Z

This change made here doesn't change which tests are run. What it does do, is change which folders code coverage is reported for when running the tests (it was looking under both stdpopsim and maintenance, now it will just look under stdpopsim). Annoyingly, the continuous integration was failing because it wanted to import everything under the maintenance folder to check its coverage. This failed because @jradrion's script(s) import stuff that isn't installed on the circleci machine (and we preferred not to depend on matplotlib et al.)

grahamgower · 2020-10-13T17:49:51Z

Hey @jradrion, I've looked over this code a couple of times now, but haven't tried running it. I think there are a few code formatting issues that should be fixed. Can you try running flake8 --max-line-length 89 maintenance/lift*.py? We should be running that in the continuous integration as well, but I guess it hasn't been added yet (you're welcome to add it, if you like).

It would be nice to include a unittest or two for this in the test suite as well. How about including a small chain file for a small section of one chromosome, and store the pre-lifted and post-lifted rate maps for that? Then the unit test would check that the lift over script produces same output as the pre-calculated post-lifted maps.

jradrion · 2020-10-13T18:14:48Z

Thanks for looking it over, @grahamgower. I'll fix the formatting, add some tests, and update.

jeromekelleher

Looks great, thanks @jradrion!

There's a few high-level structural issues here that are resulting in us having a lot more code than we really need, but I think they can be resolved pretty easily. I'm happy to provide more input if the comments aren't clear.

jeromekelleher · 2020-10-19T18:18:41Z

maintenance/liftOver_assemblies.py

+from matplotlib import gridspec
+
+
+def progress_bar(percent, barLen = 50):


tqdm is a handy way of doing this.

jeromekelleher · 2020-10-19T18:20:27Z

maintenance/liftOver_assemblies.py

+    sys.stdout.flush()
+
+
+def assign_task(tmpFile, task_q, nProcs):


Seems pretty heavyweight to need our own task queue here - have you tried using concurrent.futures?

jeromekelleher · 2020-10-19T18:23:47Z

maintenance/liftOver_assemblies.py

+    run all functions to translate and liftOver file types
+    """
+    # multithread remapping
+    task_q = mp.JoinableQueue()


This does seem unneccesarily low-level here - there's a good example in the msprime code of using concurrent.futures along with a tqdm progress bar.

jeromekelleher · 2020-10-19T18:28:30Z

maintenance/liftOver_assemblies_dm3TOdm6.py

+from matplotlib import gridspec
+
+
+def progress_bar(percent, barLen = 50):


This looks like a near-identical copy of the liftover_assemblies.py file. Having multiple copies of the same code is almost never a good idea, and will be a significant maintenance burden over time (especially if we have lots of species - consider what happens if we find a small bug in the liftover code, which has been copied n times).

How about we have a file liftover.py which provides the basic utilities that you're using here like hapmap_to_bed etc, and then the code specific to each liftover that we do is in its own python file that imports liftover?

That is, basically just the main functions which you currently have?

jradrion · 2020-10-20T19:40:10Z

@jeromekelleher Thank you for your comments! I'll look into implementing both tqdm and concurrent.futures, and I'll fix the redundancies.

PopSim-bot · 2020-11-13T19:29:26Z

📖 Docs for this PR can be previewed here

jradrion · 2020-11-13T20:09:44Z

@jeromekelleher @grahamgower
I made the requested flake8 formatting changes and moved all redundant functions to liftOver.py, which is then imported by both liftOver_catalog.py and liftOver_comeron2012.py. As a reminder, liftOver_comeron2012.py is meant to be the archived code for how we generated our D. melanogaster dm6 map from the Comeron et al 2012 map, where liftOver_catalog.py is meant to be used when we want to liftOver from any of the current maps to a new map in the future (e.g. Hg37 -> Hg38).

I had quite a difficult time trying to wrap my head around how exactly I would modify this code to use concurrent.futures. I think this would take a sizeable amount of restructuring. I'm wondering if we could potentially hold off until after this code if merged and then open a new issue if we still want to restructure with concurrent.futures. Is the main motivation for using concurrent.futures just to have simpler maintenance code going forward? The way the code is currently structured there are really only a few low-level functions pertaining to multiprocessing. Let me know if you would be agreeable to this or this needs to be changed now.

jradrion · 2020-11-13T20:12:07Z

Also, is there a consensus for whether we want to have tests for all maintenance code? This code is still failing on CircleCi. Should I be writing tests for these scripts next?

jradrion · 2020-11-13T20:33:49Z

thank you @mufernando for helping me out of git hell.

jeromekelleher · 2020-11-14T10:55:45Z

This is a big improvement, thanks @jradrion! I'm happy for this to be merged as-is.

There's still a bit of cleanup that could be done (variable naming conventions are a bit inconsistent with a mixture of camelCase, CamelCase, ALLCAPS, and snake_case) but it's not that important for this maintenance code like this. The motivation for using concurrent.futures would be to remove redundant code, making it easier to understand what the file is actually for. Readers looking at the code might currently think there's some deep reason for needing to do tricky low-level concurrency stuff, and it takes a few minutes to figure out that there isn't.

Anyway, this is nit-picking; awesome work, thanks!

grahamgower · 2020-11-16T08:24:11Z

Awesome! Thanks @jradrion! I've added a small commit to fix the circleci failure. I couldn't seem to push to your PR branch, so I had to open a new PR (#691). Will merge that and close this as soon as the CI lights are green.

grahamgower · 2020-11-16T08:35:15Z

Thanks again @jradrion. And sorry for mess of CI changes you had to deal with during the life of this PR. Would you be able to open a couple of follow-up issues about the stuff raised by @jeromekelleher?

jradrion · 2020-11-16T23:53:21Z

Awesome, thanks @grahamgower. I will open issues about the comments made by @jeromekelleher.

jradrion · 2020-11-17T00:09:33Z

Thank you to @jeromekelleher, @grahamgower, @jeffspence, and @mufernando for all of your help with this.

grahamgower mentioned this pull request Oct 4, 2020

add recombination map(s) for BosTau #602

Open

jradrion mentioned this pull request Oct 7, 2020

DroMel: add assembly ref, and use recomb rates from Comeron map. #592

Merged

jeromekelleher requested changes Oct 19, 2020

View reviewed changes

liftOver files

62381e3

jeromekelleher approved these changes Nov 14, 2020

View reviewed changes

grahamgower mentioned this pull request Nov 16, 2020

jradrion lift over #691

Merged

grahamgower merged commit 62381e3 into popsim-consortium:master Nov 16, 2020

grahamgower mentioned this pull request Nov 16, 2020

recombination map liftOver thread #571

Open

This was referenced Nov 16, 2020

inconsistent variable naming conventions in liftOver maintanance code #692

Open

transition liftOver maintanance code to use concurrent.futures #693

Open

jradrion deleted the liftOver branch April 8, 2021 23:07

		from matplotlib import gridspec


		def progress_bar(percent, barLen = 50):

added code for lifting over assemblies #574

added code for lifting over assemblies #574

Conversation

jradrion commented Aug 5, 2020

codecov bot commented Aug 5, 2020 • edited Loading

Codecov Report

grahamgower commented Aug 5, 2020 • edited Loading

jeromekelleher commented Aug 5, 2020

jradrion commented Aug 5, 2020 • edited Loading

jeromekelleher commented Aug 5, 2020

jeffspence commented Aug 5, 2020

jradrion commented Aug 5, 2020

jradrion commented Aug 6, 2020

ndukler commented Sep 1, 2020

jradrion commented Sep 1, 2020

jradrion commented Sep 15, 2020

jeromekelleher commented Sep 15, 2020

grahamgower commented Sep 15, 2020

jeromekelleher commented Sep 16, 2020

jradrion commented Oct 7, 2020

grahamgower commented Oct 8, 2020

jradrion commented Oct 9, 2020

andrewkern commented Oct 13, 2020

grahamgower commented Oct 13, 2020

grahamgower commented Oct 13, 2020

jradrion commented Oct 13, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Oct 19, 2020

Choose a reason for hiding this comment

jeromekelleher Oct 19, 2020

Choose a reason for hiding this comment

jeromekelleher Oct 19, 2020

Choose a reason for hiding this comment

jeromekelleher Oct 19, 2020

Choose a reason for hiding this comment

jeromekelleher Oct 19, 2020

Choose a reason for hiding this comment

jradrion commented Oct 20, 2020

PopSim-bot commented Nov 13, 2020

jradrion commented Nov 13, 2020 • edited Loading

jradrion commented Nov 13, 2020

jradrion commented Nov 13, 2020

jeromekelleher commented Nov 14, 2020

grahamgower commented Nov 16, 2020

grahamgower commented Nov 16, 2020

jradrion commented Nov 16, 2020

jradrion commented Nov 17, 2020

codecov bot commented Aug 5, 2020 •

edited

Loading

grahamgower commented Aug 5, 2020 •

edited

Loading

jradrion commented Aug 5, 2020 •

edited

Loading

jradrion commented Nov 13, 2020 •

edited

Loading