-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more bigwigaverageoverbed stats #42
Add more bigwigaverageoverbed stats #42
Conversation
Closes: #38 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts. Really appreciate the PR!
Just btw, I have a rebase over #44 at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats. (Once I get that other PR merged, I'll get this updated for review comments - if you can't get to it in time - so we can get it merged) |
Looks like quite a bit of work in #44. I was trying out returning a named tuple and got it to work, but noticed that my command was taking 5 seconds instead of around 4 seconds before. So I decided to test creation times of tuples, dicts, named tuples and class instances. Named tuples seem to be the slowest to create: In [1]: %timeit tuple_style = ('chr1\t3094805\t3095305\tchr1:3094805-3095305', 0.06325921607762575, 0.06325921607
...: 762575, 0.044780001044273376, 0.044780001044273376, 0.08956000208854675, 0.08956000208854675, 500, 500, 3
...: 1.629608038812876)
6.97 ns ± 0.0829 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
In [3]: %timeit dict_style = {"name": 'chr1\t3094805\t3095305\tchr1:3094805-3095305', "mean": 0.06325921607762575
...: , "mean0": 0.06325921607762575, "min": 0.044780001044273376, "min0": 0.044780001044273376, "max": 0.08956
...: 000208854675, "max0": 0.08956000208854675, "size": 500, "bases": 500, "sum": 31.629608038812876}
202 ns ± 6.27 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [5]: %timeit named_tuple_style = pybigtools.BigWigAverageOverBedSummaryStatistics(name='chr1\t3094805\t3095305
...: \tchr1:3094805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, m
...: in0=0.044780001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.
...: 629608038812876)
548 ns ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
class BigWigAverageOverBedSummaryStatisticsClass:
def __init__(self, name, mean, mean0, min, min0, max, max0, size, bases, sum):
self.name = name
self.mean = mean
self.mean0 = mean0
self.min = min
self.min0 = min0
self.max = max
self.max0 = max0
self.size = size
self.bases = bases
self.sum = sum
In [12]: %timeit class_style = BigWigAverageOverBedSummaryStatisticsClass(name='chr1\t3094805\t3095305\tchr1:3094
...: 805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, min0=0.0447
...: 80001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.629608038
...: 812876)
449 ns ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) It would be nice if bigwigaverageoverbed would return mean, min ,max, (only considering positions that have actual values, returning NaN if needed) and mean0, min0 and max0 (assuming 0.0 for all positions that don''t have an entry in the bigWig). This would change the behavior of mean different compared to Kent tools (NaN instead of 0.0 when no values are found), but feels more consistent. -fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {
+fn pybigtools(py: Python, m: &PyModule) -> PyResult<()> {
m.add("__version__", env!("CARGO_PKG_VERSION"))?;
m.add_wrapped(wrap_pyfunction!(open))?;
m.add_wrapped(wrap_pyfunction!(bigWigAverageOverBed))?;
@@ -1366,5 +1382,25 @@ fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_class::<EntriesIterator>()?;
+ let collections = PyModule::import(py, "collections").unwrap();
+ let namedtuple = collections.getattr("namedtuple").unwrap();
+ let bigwigaverageoverbed_summary_statistics = namedtuple
+ .call(
+ (
+ "BigWigAverageOverBedSummaryStatistics".to_object(py),
+ (
+ "name", "mean", "mean0", "min", "min0", "max", "max0", "size", "bases", "sum",
+ )
+ .to_object(py),
+ ),
+ None,
+ )
+ .unwrap();
+ m.add(
+ "BigWigAverageOverBedSummaryStatistics",
+ bigwigaverageoverbed_summary_statistics,
+ )
+ .unwrap();
+
Ok(())
} |
I'm simultaneously both surprised and not that both dict and namedtuple have that much worse perf than a tuple. I think, at least for now, I'm not so worried about the time difference. I imagine if speed is needed at that level, people would either drop down to Rust, or use some future API that allows batching and returning a pre-allocated array or something. I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling I think I'm pretty happy with the branch I have at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats. I updated the API to match the comments above. Would love to hear your thoughts on it. |
Looks good. Do you know how I can check that branch out to try it locally? |
should work |
Managed to get the correct branch: git fetch https://github.com/jackh726/bigtools.git ghuls/add_more_bigwigaverageoverbed_stats:bigtools_ghuls_add_more_bigwigaverageoverbed_stats Help needs to be fixed:
Giving an unknown statistic (like In [32]: %%time
...: with pybigtools.open("test.bw", "r") as bw:
...: x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0"]))
...: x[0]
...:
...:
CPU times: user 4.93 s, sys: 36.2 ms, total: 4.97 s
Wall time: 4.96 s
Out[32]: 0.06325921607762575
In [33]: %%time
...: with pybigtools.open("test.bw", "r") as bw:
...: x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all"]))
...: x[0]
...:
...:
CPU times: user 90 µs, sys: 2 µs, total: 92 µs
Wall time: 95.8 µs
Out[33]: 0.06325921607762575
In [34]: %%time
...: with pybigtools.open("test.bw", "r") as bw:
...: x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all2"]))
...: x[0]
...:
...:
CPU times: user 49 µs, sys: 1e+03 ns, total: 50 µs
Wall time: 52.9 µs
Out[34]: 0.06325921607762575
In [35]: %%time
...: with pybigtools.open("test.bw", "r") as bw:
...: x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0", "mean", "min", "max", "size", "bases", "sum"]))
...: x[0]
...:
...:
CPU times: user 5.13 s, sys: 80 ms, total: 5.21 s
Wall time: 5.2 s
Out[35]:
(0.06325921607762575,
0.06325921607762575,
0.044780001044273376,
0.08956000208854675,
500,
500,
31.629608038812876) |
It would be nice to have a method on BBIRead to reinitialize it (similar to with pybigtools.open("test.bw", "r") as bw:
x = list(bw.average_over_bed("consensus_peaks_set1.bed", stats=["mean0"]))
y = list(bw.average_over_bed("consensus_peaks_set2.bed", stats=["mean0"])) |
Opening BED files as a Path does not work (works for opening the bigwig): bw.average_over_bed(Path("file.bed")) |
Done
Strange...there is a test that shows that it does:
This already should have worked - added a test to show.
Added this + test |
(feel free to just force push that branch to this PR branch - I can do it directly if you don't know how. Do want to give you credit for your work here) |
Calculating multiple In [14]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
...: x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="all"))
...: print(y_list[0], x_list[0])
...:
...:
0.06325921607762575 SummaryStatistics(size=500, bases=500, sum=31.629608038812876, mean0=0.06325921607762575, mean=0.06325921607762575, min=0.044780001044273376, max=0.08956000208854675)
CPU times: user 8.71 s, sys: 188 ms, total: 8.9 s
Wall time: 8.89 s Raising an exception form a with block does not work for me: import pybigtools
from pathlib import Path
# Don't import numpy to show with block does not pass exception.
# import numpy as np
In [2]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
...: x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"))
...: print(y_list[0], x_list[0])
...:
...:
0.06325921607762575 0.08956000208854675
CPU times: user 8.1 s, sys: 228 ms, total: 8.33 s
Wall time: 8.32 s
# Numpy is not imported, so the code should trow an exception, but doesn't:
In [3]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
...: x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
...: print(y_array[0], x_array[0])
...:
...:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File <timed exec>:4
NameError: name 'y_array' is not defined
In [4]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
...: x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
...: #print(y_array[0], x_array[0])
...:
...:
CPU times: user 47 μs, sys: 1 μs, total: 48 μs
Wall time: 50.5 μs
# In a normal with block it trows an exception:
In [5]: %%time
...: with open(Path("test.bw"), "r") as bw:
...: y_array = np.fromiter(bw.read(), dtype=np.float32)
...: x_array = np.fromiter(bw.read(), dtype=np.float32)
...: print(y_array[0], x_array[0])
...:
...:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File <timed exec>:2
NameError: name 'np' is not defined
# After importing numpy
import numpy as np
In [9]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
...: x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
...: print(y_array[0], x_array[0])
...:
...:
0.063259214 0.08956
CPU times: user 8.24 s, sys: 160 ms, total: 8.4 s
Wall time: 8.4 s Minimal example: In [11]: with pybigtools.open(Path("test.bw"), "r") as bw:
...: raise ValueError("Some error")
...:
In [12]: with open(Path("test.bw"), "r") as bw:
...: raise ValueError("Some error")
...:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[12], line 2
1 with open(Path("test.bw"), "r") as bw:
----> 2 raise ValueError("Some error")
ValueError: Some error |
Thanks, this was a problem with #44, which is fixed now (with a test) |
86485e6
to
b391f93
Compare
…gWigAverageOverBed.
… Allow passing Path objects.
I now assume that this Iwas the reason why invoking Exceptions are now raised properly. Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish. |
Seems like bigWigAverageOverBed is semi deprecated: If that is really the case, could the
|
b391f93
to
34a3691
Compare
Good to hear
Maybe today, most likely sometime this weekend
Potentially! I do think that the general question of whether uncovered is treated as 0 or not is important - it's relevant to
Just pushed a commit for this. |
Going to go ahead and merge this - happy to follow up on min0/max0. |
Great!
I think in pyBigWig returns np.nan for those positions.
Thanks. |
Seems like UCSC does not have time to fix bigWigAverageOverBed: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ
So I guess addin min0/max0 to bigtools should not be a big issue if |
Closes #38