Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update match.py #11

Merged
merged 7 commits into from
Aug 9, 2024
Merged

Update match.py #11

merged 7 commits into from
Aug 9, 2024

Conversation

bovagner
Copy link
Contributor

I couldn't help trying to further optimize the ressource usage for extract_matching_loci from match.py.
Running a sample run using the human genome and ~112k loci, I estimated the peak memory usage at around 18 gb by observing htop. In addition I recorded the runtime of the different parts of the function and got the following:

Setting up: 0.0751s (0.06%)
Extracting loci: 5.4964s (4.25%)
Getting loci gc bin count: 0.8073s (0.62%)
Extracting mask: 0.0763s (0.06%)
Getting background gc: 115.4431s (89.18%)
Merging bg gc, getting bg bin count: 0.4058s (0.31%)
Matching: 0.0009s (0.00%)
Extracting loci: 0.0775s (0.06%)
Verbosity analysis: 7.0401s (5.44%)
Sorting: 0.0247s (0.02%)
Total time: 129.4472s (100.00%)

When additionally providing a bigwig file, the times were:

Setting up: 0.0771s (0.04%)
Extracting loci: 23.0924s (13.46%)
Getting loci gc bin count: 0.8186s (0.48%)
Extracting mask: 0.0796s (0.05%)
Getting background gc: 123.4100s (71.93%)
Merging bg gc, getting bg bin count: 0.1639s (0.10%)
Matching: 0.0009s (0.00%)
Extracting loci: 0.0674s (0.04%)
Verbosity analysis: 23.7682s (13.85%)
Sorting: 0.0999s (0.06%)
Total time: 171.5779s (100.00%)

Hence it is clear that extracting loci and getting background gc are the most ressource demanding tasks, both in terms of time and memory usage. Both tasks essentially amount to extracting sequences from a fasta file and calculating the percent of GC and N, and in addition the total count in a window when a bigwig file is given. Hence I created a set of utility functions that can handle this process by:

1.) Generating a list of coords, either for an input loci file or for tiling a chromosome.
and from the coords either
2a) Extract the sequence and calculate the char percentage.
or
2b) Extract the signal and calculate the sum.

This approach can then handle both use cases. It relies on the python string count method, rather than encoding all the sequences as torch tensors or numpy arrays, making it much faster and also more memory efficient, since only a single locus sequence needs to be kept in memory at a time. (I do still load entire chromosomal sequences into memory for calculating background GC and N, because it is faster than streaming, but in addition to that, only a single slice is kept in memory.)

With these changes, and some additional minor ones, the ressource usage now looks as follow:

Without a bigwig file:
Around ~2gb peak memory usage
Setting up: 0.0776s (0.94%)
Extracting loci: 3.5094s (42.33%)
Getting loci gc bin count: 0.0233s (0.28%)
Extracting mask: 0.0795s (0.96%)
Getting background gc: 4.0153s (48.43%)
Merging bg gc, getting bg bin count: 0.4111s (4.96%)
Matching: 0.0009s (0.01%)
Extracting loci: 0.0763s (0.92%)
Verbosity analysis: 0.0701s (0.85%)
Sorting: 0.0275s (0.33%)
Total time: 8.2910s (100.00%)

With a bigwig file:
Around ~5gb peak memory usage
Setting up: 0.0774s (0.19%)
Extracting loci: 16.9000s (41.37%)
Getting loci gc bin count: 0.0231s (0.06%)
Extracting mask: 0.0766s (0.19%)
Getting background gc: 9.5343s (23.34%)
Merging bg gc, getting bg bin count: 0.2759s (0.68%)
Matching: 0.0010s (0.00%)
Extracting loci: 0.0738s (0.18%)
Verbosity analysis: 13.8576s (33.93%)
Sorting: 0.0273s (0.07%)
Total time: 40.8470s (100.00%)

With the tests I've run, I've been able to reproduce the output of the current script. The only exception is the case where the bigwig file misses certain chromosomes; the current script will output a signal of all zeroes in this case, giving a total count of 0. With the functions I've implemented, I instead return nan in this case. This will result in a somewhat different threshold value for filtering background regions. I think this change is quite reasonable.

With these changes the slowest part is extracting the counts from the bigwig file. This could possibly be sped up by some parallel processing, but I didn't bother to look further into that.

Thanks for creating this amazing library and for creating this intuitive implementation of extract_matching_loci. I hope these proposed changes can serve as a useful addition, and are somewhat in line with your design philosophy.

@jmschrei
Copy link
Owner

Nice, thanks for the contribution. Do you know why the unit tests are failing?

@bovagner
Copy link
Contributor Author

Yes, so the five tests fail for the following reasons:

  1. test_calculate_char_perc: I changed the output type from torch tensor to numpy array in accordance with the existing docstring, and because there was no need to convert the numpy array to a torch tensor. So now the gc.dtype is numpy.float64
  2. test_calculate_char_perc_raises_long: Fails because I removed the assert check for the width being greater than the sequence length. In this case the function just returns an empty array, so I thought it wasn't necessary for the function to fail because of that - say you have a very short chromosome in your list of chromosomes.
  3. test_calculate_char_perc_raises_sequence: Fails because I removed the check that sequence must be a string. The function will work perfectly fine with a list or tuple of characters as input. Now, it will not work if one tries to put in a one-hot-encoded sequence, so if you prefer, I can add an explicit test that the input sequence should be a str, tuple or list.
  4. test_extract_and_filter_chrom: This is quite subtle, and I'm not completely sure why it happens, but it seems to have to do with numerical precision. To explain a bit more detailed, the modifications I've proposed produces the following regions dictionary when calling _extract_and_filter_chrom("tests/data/test.fa", chrom='chr1', in_window=20, out_window=20):

{20: [8, 10], 23: [1, 2, 4, 7, 9, 11, 12], 25: [0, 3, 5, 6, 13]}

The test script expects the key 23 to be 22, but otherwise the dictionary is the same. To trace back and see where the two scripts diverge, I first observe that they produce the same gc_perc vector of

[0.5, 0.45, 0.45, 0.5, 0.45, 0.5, 0.5, 0.45, 0.4, 0.45, 0.4, 0.45, 0.45, 0.5]

when adding half the gc_bin_width we arrive at

[0.51, 0.46, 0.46, 0.51, 0.46, 0.51, 0.51, 0.46, 0.41, 0.46, 0.41, 0.46, 0.46, 0.51]

to find the gc bins we do the integer division with gc_bin_width = 0.02. We can basically do that in our heads and see that 0.46 should be converted to a bin value of 23. For some reason the current version gets 22, which is also expected by the test, but this appears to be wrong. The current implementation uses a dtype of float32, while I changed this to standard numpy precision of float64, which is likely the reason behind the discrepancy. I did a few additional tests and got:

  • 0.46 // 0.02 > 23.0

  • t = torch.tensor([0.45], dtype=torch.float32)
    (t+0.01) // 0.02 > tensor([22.])

  • t = torch.tensor([0.46], dtype=torch.float32)
    t // 0.02 > tensor([23.])

  • t = torch.tensor([0.45], dtype=torch.float64)
    (t+0.01) // 0.02 > tensor([23.], dtype=torch.float64)

  • t = torch.tensor([0.46], dtype=torch.float64)
    t // 0.02 > tensor([23.], dtype=torch.float64)

  • t = numpy.array([0.45], dtype=numpy.float32)
    (t+0.01) // 0.02 > array([22.], dtype=float32)

  • t = numpy.array([0.46], dtype=numpy.float32)
    t // 0.02 > array([23.], dtype=float32)

  • t = numpy.array([0.45], dtype=numpy.float64)
    (t+0.01) // 0.02 > array([23.])

  • t = numpy.array([0.46], dtype=numpy.float64)
    t // 0.02 > array([23.])

The tldr is that I think the new version I've proposed has the 'correct' output, but that in practice, it doesn't really matter, as these issues are likely to only arise with small and clean window sizes like 20. I was able to get the exact same matching loci with both implementations using a standard in_window of 2114 for ~112k input loci.

  1. test_extract_matching_loci: Here the test expects 5 matching loci but only 3 are given as output. Here I assumed that the loci should be filtered based on the provided chroms, but I can see that you do not do that in your script, you only filter loci based on if they are within bounds on the chromosome. So if you want the chroms parameter to be a parameter that only determines which chroms the matching loci are found on, and not which chroms the input loci can be on, then I will make a modification that doesn't filter the input loci based on the chroms parameter.

@jmschrei
Copy link
Owner

Sorry for the delay in getting this merged. I'll try to take a closer look and get it merged this weekend.

@bovagner
Copy link
Contributor Author

bovagner commented Aug 1, 2024

No problem. Thanks for taking the time to look at it. I updated the fork, so that peak loci are no longer filtered based on the input chroms. Hence test 5) no longer fails.

I looked a bit further into test 4) above. It is indeed related to floating point precision, and the fact that numbers like 0.01 cannot be represented exactly as floating points in a binary representation. Hence you cannot be guaranteed to get correct results even by choosing a larger floating point precision, so float64 is not necessarily better than float32. As far as I can tell, it's possible to 'round away' the slight error introduced from the addition and division in (loci_gc + gc_bin_width / 2.) // gc_bin_width by rounding before casting to an integer. So doing something like this works in the cases I have looked at:

gc_perc = ((gc_perc + gc_bin_width / 2) / gc_bin_width).round(10).astype(int) # note: normal division, not floor division

One could also just bin the gc percentages directly. For instance creating some bins:

bins = numpy.linspace(0,1,int(1/gc_bin_width)+1)

which creates evenly sized bins, though the width are in some cases not exactly equal to gc_bin_width. Or

bins = numpy.arange(0, 1+gc_bin_width, gc_bin_width)

which creates bins with width gc_bin_width, though the last bin can extend beyond 1. The linspace approach is more numerically stable according to the docs. Both binning strategies creates one less bin than the current approach, and shifted a bit, so if you want the same binning as currently, you could add an extra bin and shift them by gc_bin_width/2, though then you may start to run into some of the same issues.

The do the binning:

gc_perc = numpy.digitize(gc_perc, bins)-1
gc_bin_count = numpy.bincount(gc_perc, minlength=len(bins)-1)


The three other tests fail due to small changes in dtypes are input verification. While I think these small changes are reasonable and slight improvements, I realize you may have reasons for keeping your current checks and output dtypes, such as for reproducibility and potential future functions you would want to fail and/or work with certain inputs.

Reduces runtime in the bigwig case by ~5s in previous benchmark (from ~40s to ~35s)
@jmschrei
Copy link
Owner

jmschrei commented Aug 2, 2024

Thanks for being so thorough in your investigation. I'm ready to merge this but would you mind (1) updating docs/whats_new.rst to briefly describe your contribution and (2) updating the author list on match.py to include your name and contact info, if you'd like to be recognized for writing a significant portion of the code.

Correction to a docstring and update author information.
@bovagner
Copy link
Contributor Author

bovagner commented Aug 3, 2024

Should be updated now. I've had much use of your libraries in my work, so only happy to be able to contribute.

@ghuls
Copy link

ghuls commented Aug 6, 2024

I looked a bit further into test 4) above. It is indeed related to floating point precision, and the fact that numbers like 0.01 cannot be represented exactly as floating points in a binary representation. Hence you cannot be guaranteed to get correct results even by choosing a larger floating point precision, so float64 is not necessarily better than float32. As far as I can tell, it's possible to 'round away' the slight error introduced from the addition and division in (loci_gc + gc_bin_width / 2.) // gc_bin_width by rounding before casting to an integer. So doing something like this works in the cases I have looked at:

You can use assertAlmostEqual:
https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertAlmostEqual

Might want to specify exact=True "just to be sure", although 1) I don't see any difference with and without exact=True in the tests I've run and 2) even approximate values should suffice for this application.
Handling the case where there are no entries in the bigwig file in a given region. Here bw_stream.stats will return None, and this is then converted to 0.
@jmschrei
Copy link
Owner

jmschrei commented Aug 8, 2024

Hi @bovagner -- saw you pushed some more commits. Is this ready to be merged?

@bovagner
Copy link
Contributor Author

bovagner commented Aug 9, 2024

Yes, I think it should be. It was just a few little things I corrected. I will try to stay away from updating any further.

@ghuls
Copy link

ghuls commented Aug 9, 2024

It might be worth to consider using bigtools average_over_bed to get the sum. as it is quite a bit faster than pybigwig.

Semi relevant: jackh726/bigtools#42

@bovagner
Copy link
Contributor Author

bovagner commented Aug 9, 2024

bigtools does look like a very interesting package. It does however look like you would need to write the data to a bed file and use that file path as input to the function, which would require a bit of rewriting of the code, i.e. https://bigtools.readthedocs.io/en/latest/pybigtools.html#BBIRead.average_over_bed. So I think the python API still needs a bit of added flexibility, which it looks like they're also working on. But I suppose it could be done without too much work - have you done any benchmark against pyBigWig? In the thread you linked it looks like it takes ~5s to get the mean for an entire bed file, which you can also do with pyBigWig, of course depending on the machine, files sizes etc.

@ghuls
Copy link

ghuls commented Aug 9, 2024

I don't have a direct comparizon between pyBigWig and pybigtools for reading bigWig files.
Internally we had code with pyBigWig that used a similar approach as here to calculate the mean for a region and people in the lab complained that is was slow. So I suggested to use bigWigAverageOverBed from UCSC Kent tools. This was already giving a big speed improvemnt.

A speed comparizon between bigWigAverageOverBed from UCSC Kent tools and pybigtools I could find back:

Vip.bw pybigtools: 6.85 seconds
Vip.bw bigWigAverageOverBed: 9.85 seconds
OPC.bw pybigtools: 10.25 seconds
OPC.bw bigWigAverageOverBed: 5.93 seconds
L2_3IT.bw pybigtools: 13.42 seconds
L2_3IT.bw bigWigAverageOverBed: 39.39 seconds
Pvalb.bw pybigtools: 8.42 seconds
Pvalb.bw bigWigAverageOverBed: 28.17 seconds
Micro_PVM.bw pybigtools: 6.84 seconds
Micro_PVM.bw bigWigAverageOverBed: 9.49 seconds
SstChodl.bw pybigtools: 5.20 seconds
SstChodl.bw bigWigAverageOverBed: 4.92 seconds
L6b.bw pybigtools: 6.18 seconds
L6b.bw bigWigAverageOverBed: 15.95 seconds
VLMC.bw pybigtools: 5.94 seconds
VLMC.bw bigWigAverageOverBed: 17.92 seconds
L6CT.bw pybigtools: 24.00 seconds
L6CT.bw bigWigAverageOverBed: 23.54 seconds
L5ET.bw pybigtools: 7.20 seconds
L5ET.bw bigWigAverageOverBed: 8.13 seconds
Endo.bw pybigtools: 9.05 seconds
Endo.bw bigWigAverageOverBed: 5.61 seconds
Sst.bw pybigtools: 11.64 seconds
Sst.bw bigWigAverageOverBed: 12.42 seconds
Sncg.bw pybigtools: 10.28 seconds
Sncg.bw bigWigAverageOverBed: 5.88 seconds
Lamp5.bw pybigtools: 10.37 seconds
Lamp5.bw bigWigAverageOverBed: 11.03 seconds
L6IT.bw pybigtools: 11.20 seconds
L6IT.bw bigWigAverageOverBed: 12.14 seconds
❯ timeit scatac_fragment_tools bigwig -c mm10.chrom_sizes_and_alias.tsv -i atac_fragments.tsv.gz -o atac_fragments.bigtools.bw -n -w pybigtools

Time output:
------------

  * Command: scatac_fragment_tools bigwig -c mm10.chrom_sizes_and_alias.tsv -i atac_fragments.tsv.gz -o atac_fragments.bigtools.bw -n -w pybigtools
  * Elapsed wall time: 1:52.44 = 112.44 seconds
  * Elapsed CPU time:
     - User: 152.63
     - Sys: 14.97
  * CPU usage: 149%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 347298
     - Involuntarily (time slice expired): 4328
  * Maximum resident set size (RSS: memory) (kiB): 17762696
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 2184472
  * Exit status: 0


❯ timeit scatac_fragment_tools bigwig -c ~/Downloads/mm10.chrom_sizes_and_alias.tsv -i ~/Downloads/atac_fragments.tsv.gz -o ~/Downloads/atac_fragments.pybigwig3.bw -n -w pyBigWig
❯ timeit scatac_fragment_tools bigwig -c mm10.chrom_sizes_and_alias.tsv -i atac_fragments.tsv.gz -o atac_fragments.pybigwig.bw -n -w pybigwig

Time output:
------------

  * Command: scatac_fragment_tools bigwig -c mm10.chrom_sizes_and_alias.tsv -i atac_fragments.tsv.gz -o atac_fragments.pybigwig.bw -n -w pybigwig
  * Elapsed wall time: 2:43.26 = 163.26 seconds
  * Elapsed CPU time:
     - User: 180.53
     - Sys: 10.95
  * CPU usage: 117%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 36579
     - Involuntarily (time slice expired): 3795
  * Maximum resident set size (RSS: memory) (kiB): 16395720
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 1338728
  * Exit status: 0

For writing bigWig files, i have some timings (script does more than just writing bigWig files, so speedup of writing bigWig files is likely much more:

❯ timeit scatac_fragment_tools bigwig -c mm10.chrom_sizes_and_alias.tsv -i atac_fragments.tsv.gz -o atac_fragments.bigtools.bw -n -w pybigtools

Time output:
------------

  * Command: scatac_fragment_tools bigwig -c mm10.chrom_sizes_and_alias.tsv -i atac_fragments.tsv.gz -o atac_fragments.bigtools.bw -n -w pybigtools
  * Elapsed wall time: 1:52.44 = 112.44 seconds
  * Elapsed CPU time:
     - User: 152.63
     - Sys: 14.97
  * CPU usage: 149%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 347298
     - Involuntarily (time slice expired): 4328
  * Maximum resident set size (RSS: memory) (kiB): 17762696
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 2184472
  * Exit status: 0


❯ timeit scatac_fragment_tools bigwig -c mm10.chrom_sizes_and_alias.tsv -i atac_fragments.tsv.gz -o atac_fragments.pybigwig.bw -n -w pybigwig

Time output:
------------

  * Command: scatac_fragment_tools bigwig -c mm10.chrom_sizes_and_alias.tsv -i atac_fragments.tsv.gz -o atac_fragments.pybigwig.bw -n -w pybigwig
  * Elapsed wall time: 2:43.26 = 163.26 seconds
  * Elapsed CPU time:
     - User: 180.53
     - Sys: 10.95
  * CPU usage: 117%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 36579
     - Involuntarily (time slice expired): 3795
  * Maximum resident set size (RSS: memory) (kiB): 16395720
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 1338728
  * Exit status: 0

@jmschrei
Copy link
Owner

jmschrei commented Aug 9, 2024

Thanks for the helpful pointers and discussion, @ghuls. Given that these functions will likely only be infrequently run (e.g., once per experiment), I don't know if further optimization is necessary. I am wary of adding more dependencies and file writing, though, especially for packages I am not personally familiar with.

@jmschrei jmschrei merged commit 0064991 into jmschrei:main Aug 9, 2024
0 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants