-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update match.py #11
Update match.py #11
Conversation
Nice, thanks for the contribution. Do you know why the unit tests are failing? |
Yes, so the five tests fail for the following reasons:
{20: [8, 10], 23: [1, 2, 4, 7, 9, 11, 12], 25: [0, 3, 5, 6, 13]} The test script expects the key 23 to be 22, but otherwise the dictionary is the same. To trace back and see where the two scripts diverge, I first observe that they produce the same gc_perc vector of [0.5, 0.45, 0.45, 0.5, 0.45, 0.5, 0.5, 0.45, 0.4, 0.45, 0.4, 0.45, 0.45, 0.5] when adding half the gc_bin_width we arrive at [0.51, 0.46, 0.46, 0.51, 0.46, 0.51, 0.51, 0.46, 0.41, 0.46, 0.41, 0.46, 0.46, 0.51] to find the gc bins we do the integer division with gc_bin_width = 0.02. We can basically do that in our heads and see that 0.46 should be converted to a bin value of 23. For some reason the current version gets 22, which is also expected by the test, but this appears to be wrong. The current implementation uses a dtype of float32, while I changed this to standard numpy precision of float64, which is likely the reason behind the discrepancy. I did a few additional tests and got:
The tldr is that I think the new version I've proposed has the 'correct' output, but that in practice, it doesn't really matter, as these issues are likely to only arise with small and clean window sizes like 20. I was able to get the exact same matching loci with both implementations using a standard in_window of 2114 for ~112k input loci.
|
Sorry for the delay in getting this merged. I'll try to take a closer look and get it merged this weekend. |
No problem. Thanks for taking the time to look at it. I updated the fork, so that peak loci are no longer filtered based on the input chroms. Hence test 5) no longer fails. I looked a bit further into test 4) above. It is indeed related to floating point precision, and the fact that numbers like 0.01 cannot be represented exactly as floating points in a binary representation. Hence you cannot be guaranteed to get correct results even by choosing a larger floating point precision, so float64 is not necessarily better than float32. As far as I can tell, it's possible to 'round away' the slight error introduced from the addition and division in (loci_gc + gc_bin_width / 2.) // gc_bin_width by rounding before casting to an integer. So doing something like this works in the cases I have looked at: gc_perc = ((gc_perc + gc_bin_width / 2) / gc_bin_width).round(10).astype(int) # note: normal division, not floor division One could also just bin the gc percentages directly. For instance creating some bins: bins = numpy.linspace(0,1,int(1/gc_bin_width)+1) which creates evenly sized bins, though the width are in some cases not exactly equal to gc_bin_width. Or bins = numpy.arange(0, 1+gc_bin_width, gc_bin_width) which creates bins with width gc_bin_width, though the last bin can extend beyond 1. The linspace approach is more numerically stable according to the docs. Both binning strategies creates one less bin than the current approach, and shifted a bit, so if you want the same binning as currently, you could add an extra bin and shift them by gc_bin_width/2, though then you may start to run into some of the same issues. The do the binning: gc_perc = numpy.digitize(gc_perc, bins)-1 The three other tests fail due to small changes in dtypes are input verification. While I think these small changes are reasonable and slight improvements, I realize you may have reasons for keeping your current checks and output dtypes, such as for reproducibility and potential future functions you would want to fail and/or work with certain inputs. |
Reduces runtime in the bigwig case by ~5s in previous benchmark (from ~40s to ~35s)
Thanks for being so thorough in your investigation. I'm ready to merge this but would you mind (1) updating |
Correction to a docstring and update author information.
Should be updated now. I've had much use of your libraries in my work, so only happy to be able to contribute. |
You can use |
Might want to specify exact=True "just to be sure", although 1) I don't see any difference with and without exact=True in the tests I've run and 2) even approximate values should suffice for this application.
Handling the case where there are no entries in the bigwig file in a given region. Here bw_stream.stats will return None, and this is then converted to 0.
Hi @bovagner -- saw you pushed some more commits. Is this ready to be merged? |
Yes, I think it should be. It was just a few little things I corrected. I will try to stay away from updating any further. |
It might be worth to consider using Semi relevant: jackh726/bigtools#42 |
bigtools does look like a very interesting package. It does however look like you would need to write the data to a bed file and use that file path as input to the function, which would require a bit of rewriting of the code, i.e. https://bigtools.readthedocs.io/en/latest/pybigtools.html#BBIRead.average_over_bed. So I think the python API still needs a bit of added flexibility, which it looks like they're also working on. But I suppose it could be done without too much work - have you done any benchmark against pyBigWig? In the thread you linked it looks like it takes ~5s to get the mean for an entire bed file, which you can also do with pyBigWig, of course depending on the machine, files sizes etc. |
I don't have a direct comparizon between pyBigWig and pybigtools for reading bigWig files. A speed comparizon between bigWigAverageOverBed from UCSC Kent tools and pybigtools I could find back:
For writing bigWig files, i have some timings (script does more than just writing bigWig files, so speedup of writing bigWig files is likely much more:
|
Thanks for the helpful pointers and discussion, @ghuls. Given that these functions will likely only be infrequently run (e.g., once per experiment), I don't know if further optimization is necessary. I am wary of adding more dependencies and file writing, though, especially for packages I am not personally familiar with. |
I couldn't help trying to further optimize the ressource usage for extract_matching_loci from match.py.
Running a sample run using the human genome and ~112k loci, I estimated the peak memory usage at around 18 gb by observing htop. In addition I recorded the runtime of the different parts of the function and got the following:
Setting up: 0.0751s (0.06%)
Extracting loci: 5.4964s (4.25%)
Getting loci gc bin count: 0.8073s (0.62%)
Extracting mask: 0.0763s (0.06%)
Getting background gc: 115.4431s (89.18%)
Merging bg gc, getting bg bin count: 0.4058s (0.31%)
Matching: 0.0009s (0.00%)
Extracting loci: 0.0775s (0.06%)
Verbosity analysis: 7.0401s (5.44%)
Sorting: 0.0247s (0.02%)
Total time: 129.4472s (100.00%)
When additionally providing a bigwig file, the times were:
Setting up: 0.0771s (0.04%)
Extracting loci: 23.0924s (13.46%)
Getting loci gc bin count: 0.8186s (0.48%)
Extracting mask: 0.0796s (0.05%)
Getting background gc: 123.4100s (71.93%)
Merging bg gc, getting bg bin count: 0.1639s (0.10%)
Matching: 0.0009s (0.00%)
Extracting loci: 0.0674s (0.04%)
Verbosity analysis: 23.7682s (13.85%)
Sorting: 0.0999s (0.06%)
Total time: 171.5779s (100.00%)
Hence it is clear that extracting loci and getting background gc are the most ressource demanding tasks, both in terms of time and memory usage. Both tasks essentially amount to extracting sequences from a fasta file and calculating the percent of GC and N, and in addition the total count in a window when a bigwig file is given. Hence I created a set of utility functions that can handle this process by:
1.) Generating a list of coords, either for an input loci file or for tiling a chromosome.
and from the coords either
2a) Extract the sequence and calculate the char percentage.
or
2b) Extract the signal and calculate the sum.
This approach can then handle both use cases. It relies on the python string count method, rather than encoding all the sequences as torch tensors or numpy arrays, making it much faster and also more memory efficient, since only a single locus sequence needs to be kept in memory at a time. (I do still load entire chromosomal sequences into memory for calculating background GC and N, because it is faster than streaming, but in addition to that, only a single slice is kept in memory.)
With these changes, and some additional minor ones, the ressource usage now looks as follow:
Without a bigwig file:
Around ~2gb peak memory usage
Setting up: 0.0776s (0.94%)
Extracting loci: 3.5094s (42.33%)
Getting loci gc bin count: 0.0233s (0.28%)
Extracting mask: 0.0795s (0.96%)
Getting background gc: 4.0153s (48.43%)
Merging bg gc, getting bg bin count: 0.4111s (4.96%)
Matching: 0.0009s (0.01%)
Extracting loci: 0.0763s (0.92%)
Verbosity analysis: 0.0701s (0.85%)
Sorting: 0.0275s (0.33%)
Total time: 8.2910s (100.00%)
With a bigwig file:
Around ~5gb peak memory usage
Setting up: 0.0774s (0.19%)
Extracting loci: 16.9000s (41.37%)
Getting loci gc bin count: 0.0231s (0.06%)
Extracting mask: 0.0766s (0.19%)
Getting background gc: 9.5343s (23.34%)
Merging bg gc, getting bg bin count: 0.2759s (0.68%)
Matching: 0.0010s (0.00%)
Extracting loci: 0.0738s (0.18%)
Verbosity analysis: 13.8576s (33.93%)
Sorting: 0.0273s (0.07%)
Total time: 40.8470s (100.00%)
With the tests I've run, I've been able to reproduce the output of the current script. The only exception is the case where the bigwig file misses certain chromosomes; the current script will output a signal of all zeroes in this case, giving a total count of 0. With the functions I've implemented, I instead return nan in this case. This will result in a somewhat different threshold value for filtering background regions. I think this change is quite reasonable.
With these changes the slowest part is extracting the counts from the bigwig file. This could possibly be sped up by some parallel processing, but I didn't bother to look further into that.
Thanks for creating this amazing library and for creating this intuitive implementation of extract_matching_loci. I hope these proposed changes can serve as a useful addition, and are somewhat in line with your design philosophy.